Learning and Memorization - EECS at UC Berkeleyalanmi/... · 1. Pure memorization can lead to generalization 2. This model replicates some interesting features of neural networks:

Learning and MemorizationSat Chatterjee ([email protected])1

1This work was done at Two Sigma Investments.2 2Yes, was not quite my day job.

ICML 2018

mailto:[email protected]

Learning and MemorizationSat Chatterjee

Motivation

2

Neural Networks can memorize large amounts of random data

Since it is believed that memorization and generalization are incompatible ...

If nets can memorize random data why do they generalize on real data?

Understanding deep learning requires rethinking generalization.

Zhang et al. ICLR `17. In

cept

ion

mod

el


One Possibility

Perhaps networks do different things with random data than with real data.

3

There are qualitative differences in DNN optimization behavior on real data vs. noise.

A Closer Look at Memorization in Deep Networks. Arpit et al. ICML `17.

The “hardness” of training examples has different distributions between random and real


Another Possibility

So, if they memorize random data, is it possible that they also memorize real data?

4

Perhaps networks do the same thing with random data as with real data.

But then how do they generalize? After all aren’t memorization and generalization at odds?

Can memorizationalonelead to generalization?


Let’s Make This Concrete

Consider this learning task (Binary MNIST):

Each pixel is quantized to 1 bitSeparate ‘0’-‘4’ from ‘5’-‘9’ (binary classification)

5

We want to see if memorization can lead to generalization

1. A giant lookup table but does not generalize(maps a 28x28-wide bit-vector to 0 or 1)

2. Nearest neighbor but needs a distance (metric)

class 0 class 1


Ok, so what can we do?

Instead of a single large lookup table

build a network of small lookup tables

Each lookup table (lut) is connected to k random outputs from previous layer(i.e. maps a k-wide bit-vector to 0 or 1)

k is typically less than 16

6

example with k = 2 luts

Input layer

First layer of look up tables

Second layer of look up tables

output


So, does this work? Surprisingly, yes!

7

5 hidden layers with 1024 luts in each layer

Each lut has k = 8 inputs from previous layer

training accuracy = 0.89test accuracy = 0.87

random chance = 0.50

Robust to randomness in topology


What happens as we vary the lut size k ?

k controls “brute force” memorization

Random data is harder to memorize!

At k = 14 we see neural network like behavior: memorize random data, yet generalize on real data!

8


How does it compare to other methods?

Not state-of-the-art but much better than chance and closer to other methods

No search, no domain specific architecture or distance function

9


45 Pairwise MNIST Tasks (e.g. separate ‘3’ from ‘7’)

Similar results: k controls brute force memorization and small k generalizes

10

Why is the variance so high?

More mixing with deeper networks

But with k = 2 there is no overfitting no matter how deep!


Binary CIFAR-10

11


Pairwise CIFAR-10

12


Memorizing a line (inputs are 2 x 10-bit fixed point)

13


Memorizing a circle (inputs are 2 x 10-bit fixed point)

14


Conclusions

1. Pure memorization can lead to generalization

2. This model replicates some interesting features of neural networks:

○ Depth helps

○ It memorizes random data and yet generalizes on real data

○ Memorizing random data is harder than memorizing real data

3. Cannot use such an observation to argue that there is no memorization

15


Future Directions1. Can we understand this better theoretically?

○ Prove generalization bounds (via bounding rademacher complexity?). Useful for small k

○ For larger k need new ideas to get non-vacuous bounds. Similar to the problem with neural networks but in a simpler setting (c.f. recent work on margin-based analysis)

2. Can we get a practically useful learner from this?

○ Small k networks are conservative signal hunters: if you find a signal, you know one really exists (i.e. no overfit).

○ Currently, no representation learning. Search over lut functions but with explicit control on overfitting.

○ Low hanging fruit: distillation from quantized neural nets for fast, cheap inference with no arithmetic

16


Questions?

Answers?

17

Documents

Learning and Memorization - EECS at UC Berkeleyalanmi/... · 1. Pure memorization can lead to generalization 2. This model replicates some interesting features of neural networks: