Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CS 103: Representation Learning, Information Theory and ControlLecture 8, Mar 1, 2019
2
Recap
Group nuisances- Group convolutions- Canonical reference frames- SIFT descriptorsGeneral nuisances- Minimal information in activation ⇒ Invariance to nuisances- Information Bottleneck- IB loss can upper-bounded by introducing an auxiliary variable- Aside: Variational Auto-Encoder can be seen as a particular case- Aside: Disentanglement in VAE
How does this relate to standard deep learning?
3
The Kolmogorov Structure of a Task
How can we define the structure of a task? Define the Kolmogorov Structure Function:
Trai
ning
Los
s
Kolmogorov complexity of model
S𝒟(t) = minK(M)≤t
L(𝒟; M)
Increasing the complexity of the model leadsto big gains in accuracy: We are learning the structure of the problem.
After learning all the structure, we can only memorize: inefficient asymptotic phase.
Optimal
Information Complexity of Tasks, their Structure and their Distance, Achille et al., 2018
Tangent = 1 in the asymptote: Need to store 1 bit in the model to decrease the loss by 1 bit
Kolmogorov minimal sufficient statistic
Kolmogorov's Structure Functions and Model Selection, Vereshchagin and Vitanyi, 2002
4
Optimizing using Deep Neural Networks
How do we find the optimal solution?
S𝒟(t) = minK(M)<t
L(𝒟; M)
ℒ(M) = L(𝒟; M) + λK(M)
Corresponding Lagrangian
K(M) ≤ KL(q(w |𝒟)∥p(w))
ℒ(M) = L(𝒟; M) + λ KL(q(w |𝒟)∥p(w))
Use the boundLet w be the parameters of the model
This loss can be implemented using a DNN and the local reparametrization trick.** Variational Dropout and the Local Reparameterization Trick, Kingma et al., 2015
Information Complexity of Tasks, their Structure and their Distance, Achille et al., 2018
5
Let’s rewrite it using Information Theory
We used an upperbound, what is the best we value it can assume?
I(w; 𝒟) ≤ 𝔼𝒟[KL(q(w |𝒟)∥p(w))],Recall that:
which is obtained when p(w) = q(w|D). Hence, on expectation over the datasets, the best function loss function to use to recover the task structure is:
ℒ(M) = 𝔼𝒟[H(𝒟 |w)] + λI(w; 𝒟) .
IB Lagrangian for the weights
ℒ(M) = 𝔼w∼q(w|𝒟)[Hp,q(𝒟 |w)] + λ KL(q(w |𝒟)∥p(w)) .
A new Information Bottleneck
dataset weights real distributionD w p(y|x)
data activations labelx z y
Activations IBInvariance
Weights IBOverfitting
6
minwL = Hp,qw (y |z) + �I(D;w)
<latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">AAACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+eecw/3ciZMpDDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUexx6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD002zFLI66QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S866s+E67je35nrTUdeZedbIs3L3HgEoLL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">AAACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+eecw/3ciZMpDDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUexx6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD002zFLI66QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S866s+E67je35nrTUdeZedbIs3L3HgEoLL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">AAACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+eecw/3ciZMpDDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUexx6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD002zFLI66QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S866s+E67je35nrTUdeZedbIs3L3HgEoLL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">AAACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+eecw/3ciZMpDDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUexx6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD002zFLI66QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S866s+E67je35nrTUdeZedbIs3L3HgEoLL3n</latexit>
minq(z |x)
L = Hp,q(y |z) + �I(z ; x)<latexit sha1_base64="eu/9thb7n8qPJF11t/o556inDEc=">AAACfHicdVFdSxtBFJ1d26qxrVEffbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z33pryidfCDV1AMDh3Pu4V7OJJkUFoPgp+cvvXj5anlltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtkklrbDoMMOwU1KJjkZSXKLc8ou6F93nZUU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bII/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF11t/o556inDEc=">AAACfHicdVFdSxtBFJ1d26qxrVEffbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z33pryidfCDV1AMDh3Pu4V7OJJkUFoPgp+cvvXj5anlltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtkklrbDoMMOwU1KJjkZSXKLc8ou6F93nZUU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bII/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF11t/o556inDEc=">AAACfHicdVFdSxtBFJ1d26qxrVEffbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z33pryidfCDV1AMDh3Pu4V7OJJkUFoPgp+cvvXj5anlltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtkklrbDoMMOwU1KJjkZSXKLc8ou6F93nZUU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bII/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF11t/o556inDEc=">AAACfHicdVFdSxtBFJ1d26qxrVEffbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z33pryidfCDV1AMDh3Pu4V7OJJkUFoPgp+cvvXj5anlltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtkklrbDoMMOwU1KJjkZSXKLc8ou6F93nZUU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bII/if/gLpa73S</latexit>
7
PAC-Bayes bound (Catoni, 2007; McAllester 2013).
Corollary. Minimizing the IB Lagrangian for the weights minimizes an upper bound on the test error.
This gives non-vacuous generalization bounds! (Dziugaite and Roy, 2017)
Catoni, 2007; McAllester 2013The PAC-Bayes generalization bound
8
Can we really minimize the IBL for the weights?
We are making an approximation by assuming that both q(w|D) and p(w) are Gaussian.
Let w* be a local minimum. The optimal amount of gaussian noise is to add is:
where F(w*) is the Fisher Information Matrix (equiv. Hessian) computed in w*.
Flat minima have low information in the weights.
Σ = (I +2λ2
βF (w0))
−1,
I (w ;D) kwk2
�2+ log |2�2 N F (w⇤) + I |
<latexit sha1_base64="QDF8UiYJ9M3yOcB2uVn0sVGhLVM=">AAACUHicbVDLSgMxFL1T3/VVdekmWIRWpcwUQcGNqIhuRNFqoaklk2ZqMPMwyVjKtB/mb7hz5U70D9xp+lCseiHkcM69ObnHjQRX2rafrNTI6Nj4xORUenpmdm4+s7B4qcJYUlaioQhl2SWKCR6wkuZasHIkGfFdwa7c2/2ufnXPpOJhcKFbEav6pBFwj1OiDVXLnB/nmjsI+0TfUCKSg04eYcHuEPYkoQluN3H7uthJsDBP1omBaN00hA3ULn5zeOMEbxzmmtdreaMet2uZrF2we4X+AmcAsjCo01rmBddDGvss0FQQpSqOHelqQqTmVLBOGseKRYTekgarGBgQn6lq0lu+g1YNU0deKM0JNOqxPycS4ivV8l3T2d1S/da65H9aJdbedjXhQRRrFtC+kRcLpEPUTRLVuWRUi5YBhEpu/oroDTGxaZN3eshGeT0XE4zzO4a/4LJYcOyCc7aZ3d0bRDQJy7ACOXBgC3bhCE6hBBQe4Ble4c16tN6tj5TVb/26YQmGKpX+BMWJsk0=</latexit><latexit sha1_base64="QDF8UiYJ9M3yOcB2uVn0sVGhLVM=">AAACUHicbVDLSgMxFL1T3/VVdekmWIRWpcwUQcGNqIhuRNFqoaklk2ZqMPMwyVjKtB/mb7hz5U70D9xp+lCseiHkcM69ObnHjQRX2rafrNTI6Nj4xORUenpmdm4+s7B4qcJYUlaioQhl2SWKCR6wkuZasHIkGfFdwa7c2/2ufnXPpOJhcKFbEav6pBFwj1OiDVXLnB/nmjsI+0TfUCKSg04eYcHuEPYkoQluN3H7uthJsDBP1omBaN00hA3ULn5zeOMEbxzmmtdreaMet2uZrF2we4X+AmcAsjCo01rmBddDGvss0FQQpSqOHelqQqTmVLBOGseKRYTekgarGBgQn6lq0lu+g1YNU0deKM0JNOqxPycS4ivV8l3T2d1S/da65H9aJdbedjXhQRRrFtC+kRcLpEPUTRLVuWRUi5YBhEpu/oroDTGxaZN3eshGeT0XE4zzO4a/4LJYcOyCc7aZ3d0bRDQJy7ACOXBgC3bhCE6hBBQe4Ble4c16tN6tj5TVb/26YQmGKpX+BMWJsk0=</latexit><latexit sha1_base64="QDF8UiYJ9M3yOcB2uVn0sVGhLVM=">AAACUHicbVDLSgMxFL1T3/VVdekmWIRWpcwUQcGNqIhuRNFqoaklk2ZqMPMwyVjKtB/mb7hz5U70D9xp+lCseiHkcM69ObnHjQRX2rafrNTI6Nj4xORUenpmdm4+s7B4qcJYUlaioQhl2SWKCR6wkuZasHIkGfFdwa7c2/2ufnXPpOJhcKFbEav6pBFwj1OiDVXLnB/nmjsI+0TfUCKSg04eYcHuEPYkoQluN3H7uthJsDBP1omBaN00hA3ULn5zeOMEbxzmmtdreaMet2uZrF2we4X+AmcAsjCo01rmBddDGvss0FQQpSqOHelqQqTmVLBOGseKRYTekgarGBgQn6lq0lu+g1YNU0deKM0JNOqxPycS4ivV8l3T2d1S/da65H9aJdbedjXhQRRrFtC+kRcLpEPUTRLVuWRUi5YBhEpu/oroDTGxaZN3eshGeT0XE4zzO4a/4LJYcOyCc7aZ3d0bRDQJy7ACOXBgC3bhCE6hBBQe4Ble4c16tN6tj5TVb/26YQmGKpX+BMWJsk0=</latexit><latexit sha1_base64="QDF8UiYJ9M3yOcB2uVn0sVGhLVM=">AAACUHicbVDLSgMxFL1T3/VVdekmWIRWpcwUQcGNqIhuRNFqoaklk2ZqMPMwyVjKtB/mb7hz5U70D9xp+lCseiHkcM69ObnHjQRX2rafrNTI6Nj4xORUenpmdm4+s7B4qcJYUlaioQhl2SWKCR6wkuZasHIkGfFdwa7c2/2ufnXPpOJhcKFbEav6pBFwj1OiDVXLnB/nmjsI+0TfUCKSg04eYcHuEPYkoQluN3H7uthJsDBP1omBaN00hA3ULn5zeOMEbxzmmtdreaMet2uZrF2we4X+AmcAsjCo01rmBddDGvss0FQQpSqOHelqQqTmVLBOGseKRYTekgarGBgQn6lq0lu+g1YNU0deKM0JNOqxPycS4ivV8l3T2d1S/da65H9aJdbedjXhQRRrFtC+kRcLpEPUTRLVuWRUi5YBhEpu/oroDTGxaZN3eshGeT0XE4zzO4a/4LJYcOyCc7aZ3d0bRDQJy7ACOXBgC3bhCE6hBBQe4Ble4c16tN6tj5TVb/26YQmGKpX+BMWJsk0=</latexit>
Weight Information is bounded by the geometry of the loss landscape*
9
Is this approximation good? Phase transition
𝛽 < 1 ⇒ overfitting
𝛽 > 1 ⇒ underfitting
𝛽 < 1 ⇒ overfitting
𝛽 >> 1 ⇒ underfitting
fitting
Consider a dataset with random labels. There is no structure, so for λ < 1we should not fit anything, and for λ > 1 we should memorize everything (see the structure function).
10�2 10�1 100 101 102
Value of �
0%
20%
40%
60%
80%
100%
Tra
iner
ror
All-CNN
ResNet
Small AlexNet
Achille and Soatto, Emergence of Invariance and Disentanglement in Deep Representations, JMLR 2018
Phase transition
Even with local approximation, we can observe this behavior in real deep networks.For real label, we have a “Goldilocks Zone” where we fit without overfitting for λ > 1.
Information is a better measure of complexity than number of parameters
10
Bias-variance tradeoff
Variance
Bias2
Total error
Model complexity
Err
or
Optimal DNN
Optimal model
Optimal DNN
Information complexity
Parametrizing the complexity with information in the weights, we recover bias-variance trade-off trend.
Achille and Soatto, Emergence of Invariance and Disentanglement in Deep Representations, JMLR 2018Arora et al., Stronger generalization bounds for deep nets via a compression approach, ICML 2018
A new Information Bottleneck
dataset weights real distributionD w p(y|x)
data activations labelx z y
Activations IBInvariance
Weights IBOverfitting
11
minwL = Hp,qw (y |z) + �I(D;w)
<latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">AAACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+eecw/3ciZMpDDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUexx6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD002zFLI66QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S866s+E67je35nrTUdeZedbIs3L3HgEoLL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">AAACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+eecw/3ciZMpDDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUexx6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD002zFLI66QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S866s+E67je35nrTUdeZedbIs3L3HgEoLL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">AAACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+eecw/3ciZMpDDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUexx6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD002zFLI66QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S866s+E67je35nrTUdeZedbIs3L3HgEoLL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">AAACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+eecw/3ciZMpDDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUexx6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD002zFLI66QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S866s+E67je35nrTUdeZedbIs3L3HgEoLL3n</latexit>
minq(z |x)
L = Hp,q(y |z) + �I(z ; x)<latexit sha1_base64="eu/9thb7n8qPJF11t/o556inDEc=">AAACfHicdVFdSxtBFJ1d26qxrVEffbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z33pryidfCDV1AMDh3Pu4V7OJJkUFoPgp+cvvXj5anlltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtkklrbDoMMOwU1KJjkZSXKLc8ou6F93nZUU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bII/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF11t/o556inDEc=">AAACfHicdVFdSxtBFJ1d26qxrVEffbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z33pryidfCDV1AMDh3Pu4V7OJJkUFoPgp+cvvXj5anlltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtkklrbDoMMOwU1KJjkZSXKLc8ou6F93nZUU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bII/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF11t/o556inDEc=">AAACfHicdVFdSxtBFJ1d26qxrVEffbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z33pryidfCDV1AMDh3Pu4V7OJJkUFoPgp+cvvXj5anlltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtkklrbDoMMOwU1KJjkZSXKLc8ou6F93nZUU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bII/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF11t/o556inDEc=">AAACfHicdVFdSxtBFJ1d26qxrVEffbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z33pryidfCDV1AMDh3Pu4V7OJJkUFoPgp+cvvXj5anlltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtkklrbDoMMOwU1KJjkZSXKLc8ou6F93nZUU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bII/if/gLpa73S</latexit>
12
The Emergence Bound
Minimality of the weights (representation of the training set) induces minimality (hence invariance) and disentanglement of the activations.
g(I (w ;D)) I (z ; x) + TC (z) g(I (w ;D)) + O(1/ dim(x))<latexit sha1_base64="1GEXv2gC0eqEpKMRcwOX24wTxDo=">AAACU3icbVBNS8MwGM7q15xfU49egkNoEWYrgoKXwTy4i05xOljLSLN0CyZtTVLdVva//CMevHhV/AtezLYedPpA4OF5n/cjjx8zKpVtv+aMufmFxaX8cmFldW19o7i5dSujRGDSwBGLRNNHkjAakoaiipFmLAjiPiN3/n11XL97JELSKLxRg5h4HHVDGlCMlJbaxeuuWTOfTqHLkephxNKzkWVBl5EHWDOHp7BvwX14UzWHmfivfR9ems6B26Hc7FtWu1iyy/YE8C9xMlICGert4qfbiXDCSagwQ1K2HDtWXoqEopiRUcFNJIkRvkdd0tI0RJxIL538fQT3tNKBQST0CxWcqD87UsSlHHBfO8c3y9naWPyv1kpUcOKlNIwTRUI8XRQkDKoIjoOEHSoIVmygCcKC6lsh7iGBsNJxF36NwtxLZTDZpMNxZqP4S24Py45ddq6OSpWLLKY82AG7wAQOOAYVcA7qoAEweAZv4B185F5yX4ZhzE+tRi7r2Qa/YKx9A10Xr2Y=</latexit><latexit sha1_base64="1GEXv2gC0eqEpKMRcwOX24wTxDo=">AAACU3icbVBNS8MwGM7q15xfU49egkNoEWYrgoKXwTy4i05xOljLSLN0CyZtTVLdVva//CMevHhV/AtezLYedPpA4OF5n/cjjx8zKpVtv+aMufmFxaX8cmFldW19o7i5dSujRGDSwBGLRNNHkjAakoaiipFmLAjiPiN3/n11XL97JELSKLxRg5h4HHVDGlCMlJbaxeuuWTOfTqHLkephxNKzkWVBl5EHWDOHp7BvwX14UzWHmfivfR9ems6B26Hc7FtWu1iyy/YE8C9xMlICGert4qfbiXDCSagwQ1K2HDtWXoqEopiRUcFNJIkRvkdd0tI0RJxIL538fQT3tNKBQST0CxWcqD87UsSlHHBfO8c3y9naWPyv1kpUcOKlNIwTRUI8XRQkDKoIjoOEHSoIVmygCcKC6lsh7iGBsNJxF36NwtxLZTDZpMNxZqP4S24Py45ddq6OSpWLLKY82AG7wAQOOAYVcA7qoAEweAZv4B185F5yX4ZhzE+tRi7r2Qa/YKx9A10Xr2Y=</latexit><latexit sha1_base64="1GEXv2gC0eqEpKMRcwOX24wTxDo=">AAACU3icbVBNS8MwGM7q15xfU49egkNoEWYrgoKXwTy4i05xOljLSLN0CyZtTVLdVva//CMevHhV/AtezLYedPpA4OF5n/cjjx8zKpVtv+aMufmFxaX8cmFldW19o7i5dSujRGDSwBGLRNNHkjAakoaiipFmLAjiPiN3/n11XL97JELSKLxRg5h4HHVDGlCMlJbaxeuuWTOfTqHLkephxNKzkWVBl5EHWDOHp7BvwX14UzWHmfivfR9ems6B26Hc7FtWu1iyy/YE8C9xMlICGert4qfbiXDCSagwQ1K2HDtWXoqEopiRUcFNJIkRvkdd0tI0RJxIL538fQT3tNKBQST0CxWcqD87UsSlHHBfO8c3y9naWPyv1kpUcOKlNIwTRUI8XRQkDKoIjoOEHSoIVmygCcKC6lsh7iGBsNJxF36NwtxLZTDZpMNxZqP4S24Py45ddq6OSpWLLKY82AG7wAQOOAYVcA7qoAEweAZv4B185F5yX4ZhzE+tRi7r2Qa/YKx9A10Xr2Y=</latexit><latexit sha1_base64="1GEXv2gC0eqEpKMRcwOX24wTxDo=">AAACU3icbVBNS8MwGM7q15xfU49egkNoEWYrgoKXwTy4i05xOljLSLN0CyZtTVLdVva//CMevHhV/AtezLYedPpA4OF5n/cjjx8zKpVtv+aMufmFxaX8cmFldW19o7i5dSujRGDSwBGLRNNHkjAakoaiipFmLAjiPiN3/n11XL97JELSKLxRg5h4HHVDGlCMlJbaxeuuWTOfTqHLkephxNKzkWVBl5EHWDOHp7BvwX14UzWHmfivfR9ems6B26Hc7FtWu1iyy/YE8C9xMlICGert4qfbiXDCSagwQ1K2HDtWXoqEopiRUcFNJIkRvkdd0tI0RJxIL538fQT3tNKBQST0CxWcqD87UsSlHHBfO8c3y9naWPyv1kpUcOKlNIwTRUI8XRQkDKoIjoOEHSoIVmygCcKC6lsh7iGBsNJxF36NwtxLZTDZpMNxZqP4S24Py45ddq6OSpWLLKY82AG7wAQOOAYVcA7qoAEweAZv4B185F5yX4ZhzE+tRi7r2Qa/YKx9A10Xr2Y=</latexit>
I(zL;x) mink<L{dim(zk)⇥g( I(Wk;D)
dim(Wk) ) + 1⇤}
Tight for one layer; for more layers we have:
Achille and Soatto, Emergence of Invariance and Disentanglement in Deep Representations, JMLR 2018
Info in activations Info in weights
13
How do minimize the information in the weights?
1. Explicitly minimize the IBL for the weights using local reparametrization trick.2. Let SGD do it for you:• Empirically we know that SGD tend to find flatter minima.• We know from the local information bound that flat minima have less
information in the weights.• Hence, SGD implicitly minimizes the information in the weights.
3. Modify SGD to reduce information more aggressively.
Warning: These are still open questions and the claims are not proved.
Counterpoint: Are flat minima the whole story?
14
Information in Weights during training
Training Epoch
Info
rmat
ion
in W
eigh
ts
What should we expect from the information in the weights during training?
Maybe something like this?
15
Information during training
Achille, Rovere, Soatto, Critical Learning Periods in Deep Networks, 2018
Information extraction Information consolidation
16
A snag: Critical periods in Deep Networks
Achille, Rovere, Soatto, Critical Learning Periods in Deep Networks, 2018
Deficit Normal training
0 160+NN
Show network blurred images to simulate cataract
The network does not classify correctly if the deficit is removed to late
A short deficit at epoch ~40 is enough to permanently damage the network.
17
Critical periods
Critical periods. A time-period in early development where sensory deficits can permanently impair the acquisition of a skill
Examples: monocular deprivation, cataracts, imprinting, language acquisition, …
Deficit Normal training
0 180 + NN
Kitten does not recover vision in covered eye
Image from Cnops et al., 2008
Hubel and Wiesel
18
Critical learning periods and Information in Weights
Achille, Rovere, Soatto, Critical Learning Periods in Deep Networks, 2018
Sensitivity to deficits peaks when network is absorbing information.Is minimal when the network is consolidating information.
19
Are flat minima an epiphenomenon?
Final sharpness correlates with generalization…
…but generalization quality is decided here, far from convergence to minima