Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Learning and MemorizationSat Chatterjee ([email protected])1
1This work was done at Two Sigma Investments.2 2Yes, was not quite my day job.
ICML 2018
Learning and MemorizationSat Chatterjee
Motivation
2
Neural Networks can memorize large amounts of random data
Since it is believed that memorization and generalization are incompatible ...
If nets can memorize random data why do they generalize on real data?
Understanding deep learning requires rethinking generalization.
Zhang et al. ICLR `17. In
cept
ion
mod
el
Learning and MemorizationSat Chatterjee
One Possibility
Perhaps networks do different things with random data than with real data.
3
There are qualitative differences in DNN optimization behavior on real data vs. noise.
A Closer Look at Memorization in Deep Networks. Arpit et al. ICML `17.
The “hardness” of training examples has different distributions between random and real
Learning and MemorizationSat Chatterjee
Another Possibility
So, if they memorize random data, is it possible that they also memorize real data?
4
Perhaps networks do the same thing with random data as with real data.
But then how do they generalize? After all aren’t memorization and generalization at odds?
Can memorizationalonelead to generalization?
Learning and MemorizationSat Chatterjee
Let’s Make This Concrete
Consider this learning task (Binary MNIST):
Each pixel is quantized to 1 bitSeparate ‘0’-‘4’ from ‘5’-‘9’ (binary classification)
5
We want to see if memorization can lead to generalization
1. A giant lookup table but does not generalize(maps a 28x28-wide bit-vector to 0 or 1)
2. Nearest neighbor but needs a distance (metric)
class 0 class 1
Learning and MemorizationSat Chatterjee
Ok, so what can we do?
Instead of a single large lookup table
build a network of small lookup tables
Each lookup table (lut) is connected to k random outputs from previous layer(i.e. maps a k-wide bit-vector to 0 or 1)
k is typically less than 16
6
example with k = 2 luts
Input layer
First layer of look up tables
Second layer of look up tables
output
Learning and MemorizationSat Chatterjee
So, does this work? Surprisingly, yes!
7
5 hidden layers with 1024 luts in each layer
Each lut has k = 8 inputs from previous layer
training accuracy = 0.89test accuracy = 0.87
random chance = 0.50
Robust to randomness in topology
Learning and MemorizationSat Chatterjee
What happens as we vary the lut size k ?
k controls “brute force” memorization
Random data is harder to memorize!
At k = 14 we see neural network like behavior: memorize random data, yet generalize on real data!
8
Learning and MemorizationSat Chatterjee
How does it compare to other methods?
Not state-of-the-art but much better than chance and closer to other methods
No search, no domain specific architecture or distance function
9
Learning and MemorizationSat Chatterjee
45 Pairwise MNIST Tasks (e.g. separate ‘3’ from ‘7’)
Similar results: k controls brute force memorization and small k generalizes
10
Why is the variance so high?
More mixing with deeper networks
But with k = 2 there is no overfitting no matter how deep!
Learning and MemorizationSat Chatterjee
Binary CIFAR-10
11
Learning and MemorizationSat Chatterjee
Pairwise CIFAR-10
12
Learning and MemorizationSat Chatterjee
Memorizing a line (inputs are 2 x 10-bit fixed point)
13
Learning and MemorizationSat Chatterjee
Memorizing a circle (inputs are 2 x 10-bit fixed point)
14
Learning and MemorizationSat Chatterjee
Conclusions
1. Pure memorization can lead to generalization
2. This model replicates some interesting features of neural networks:
○ Depth helps
○ It memorizes random data and yet generalizes on real data
○ Memorizing random data is harder than memorizing real data
3. Cannot use such an observation to argue that there is no memorization
15
Learning and MemorizationSat Chatterjee
Future Directions1. Can we understand this better theoretically?
○ Prove generalization bounds (via bounding rademacher complexity?). Useful for small k
○ For larger k need new ideas to get non-vacuous bounds. Similar to the problem with neural networks but in a simpler setting (c.f. recent work on margin-based analysis)
2. Can we get a practically useful learner from this?
○ Small k networks are conservative signal hunters: if you find a signal, you know one really exists (i.e. no overfit).
○ Currently, no representation learning. Search over lut functions but with explicit control on overfitting.
○ Low hanging fruit: distillation from quantized neural nets for fast, cheap inference with no arithmetic
16
Learning and MemorizationSat Chatterjee
Questions?
Answers?
17