42
calculation | consulting why deep learning works: perspectives from theoretical chemistry (TM) c|c (TM) charles@calculationconsulting.com

calculation | consulting why deep learning works - MMDSmmds-data.org/presentations/2016/s-martin.pdf · MMDS 2016 why deep learning works: perspectives from theoretical chemistry

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

calculation | consulting why deep learning works:

perspectives from theoretical chemistry

(TM)

c|c (TM)

[email protected]

calculation|consultingMMDS 2016

why deep learning works:perspectives from theoretical chemistry

(TM)

[email protected]

calculation | consulting why deep learning works

Who Are We?

c|c (TM)

Dr. Charles H. Martin, PhD University of Chicago, Chemical PhysicsNSF Fellow in Theoretical Chemistry

Over 10 years experience in applied Machine LearningDeveloped ML algos for Demand Media; the first $1B IPO since Google

Tech: Aardvark (now Google), eHow, GoDaddy, …Wall Street: BlackRockFortune 500: Big Pharma, Telecom, eBay

[email protected]

(TM)

3

Data Scientists are Different

c|c (TM)

theoretical physics machine learning specialist

(TM)

4

experimental physics data scientist

engineer software, browser tech, dev ops, …

not all techies are the same

calculation | consulting why deep learning works

c|c (TM)

Problem: How can SGD possibly work?Aren’t Neural Nets non-Convex ?!

(TM)

5calculation | consulting why deep learning works

can Spin Glass models suggest why ?

what other models are out there ?

expected observed ?

c|c (TM)

(TM)

6calculation | consulting why deep learning works

Outline

Random Energy Model (REM)

Temperature, regularization and the glass transition

extending REM: Spin Glass of Minimal Frustration

protein folding analogy: Funneled Energy Landscapes

example: Dark Knowledge

Recent work: Spin Glass models for Deep Nets

c|c (TM)

(TM)

7calculation | consulting why deep learning works

Warning

condensed matter theory is about qualitative analogies

we may seek a toy modela mean field theory

a phenomenological description

c|c (TM)

What problem is Deep Learning solving ?

(TM)

8calculation | consulting why deep learning works

minimize cross-entropyhttps://www.ics.uci.edu/~pjsados.pdf

c|c (TM)

Problem: What is a good theoretical model for deep networks ?

(TM)

9calculation | consulting why deep learning works

p-spherical spin glass

LeCun … 2015

L Hamiltonian (Energy function)X Gaussian random variablesw real valued (spins) , spherical constraintH >= 3 (p)

can be solved analytically, simulated easily

c|c (TM)

What is a spin glass ?

(TM)

10calculation | consulting why deep learning works

Frustration: constraints that can not be satisfied

J = X = weightsS = w = spins

Energetically: all spins should be paired

c|c (TM)

why p-spherical spin glass ?

(TM)

11calculation | consulting why deep learning works

crudely: deep networks (effectively) have no local minima !

local minima

k=1 critical points

floor / ground state

k = 2 critical points

k = 3 critical points

the critical points are ordered

saddle points

c|c (TM)

why p-spherical spin glass ?

(TM)

12calculation | consulting why deep learning works

crudely: deep networks (effectively) have no local minima !

http://cims.nyu.edu/~achoroma/NonFlash/Papers/PAPER_AMMGY.pdf

ap

c|c (TM)

(TM)

13calculation | consulting why deep learning works

any local minima will do; the ground state is a state of overtraining

good generalization

overtraining

Early Stopping: to avoid the ground state ?

c|c (TM)

(TM)

14calculation | consulting why deep learning works

it’s easy to find the ground state; it’s hard to generalize ?

Early Stopping: to avoid the ground state ?

c|c (TM)

Current Interpretation

(TM)

15calculation | consulting why deep learning works

•finding the ground state is easy (sic); generalizing is hard

•finding the ground state is irrelevant: any local minima will do

•the ground state is a state over training

c|c (TM)

recent p-spherical spin glass results

(TM)

16calculation | consulting why deep learning works

actually: recent results (2013) on the behavior (distribution of critical points, concentration of the means)

of an isotropic random function on a high dimensional manifold

require: the variables actually concentrate on their means the weights are drawn from isotropic random function

related to: old results TAP solutions (1977) # critical points ~ TAP complexity

avoid local minima? : increase Temperatureharder problem: low Temp behavior of spin glass

c|c (TM)

What problem is Deep Learning solving ?

(TM)

17calculation | consulting why deep learning works

minimize cross-entropy of output layer

entropic effects : not just min energy

more like min free energy (divergence)

Statistical Physics and Information Theory: Neri Merhav

c|c (TM)

(TM)

18calculation | consulting why deep learning works

https://web.stanford.edu/~montanar/RESEARCH/BOOK/partB.pdf

infinite limit of p-spherical spin glass

A related approach: Random Energy Model (REM)

c|c (TM)

Random Energy Model (REM)

(TM)

19calculation | consulting why deep learning works

ground state is governed by Extreme Value Statisticshttp://guava.physics.uiuc.edu/~nigel/courses/563/essays2000/pogorelov.pdf

http://scitation.aip.org/content/aip/journal/jcp/111/14/10.1063/1.479951

old result from protein folding theory

c|c (TM)

REM: What is Temperature ?

(TM)

20calculation | consulting why deep learning works

We can use statistical mechanics to analyze known algorithms

I don’t mean in the traditional sense of algorithmic analysis

take Ej as the objective = loss function + regularizer

study Z: form a mean field theory;take limits N -> inf, T -> 0

c|c (TM)

REM: What is Temperature ?

(TM)

21calculation | consulting why deep learning works

let E(T) by the effective energy

E(T) = E/T ~ sum of weights*activations

as T -> 0, E(T) effective energies diverge; weights explode

Temperature is a proxy for weight constraints

T sets the Energy Scale

c|c (TM)

Temperature: as Weight Constraints

(TM)

22calculation | consulting why deep learning works

•traditional weight regularization

•max norm constraints (i.e. w/dropout)

•batch norm regularization (2015)

we avoid situations when the weights explode

in deep networks, we temper the weightsand the distribution of the activations (i.e local entropy)

c|c (TM)

REM: a toy model for real Glasses

(TM)

23calculation | consulting why deep learning works

but it is believed that entropy collapse ‘drives’ the glass transition

the glass transition is not well understood

c|c (TM)

what is a real (structural) Glass ?

(TM)

24calculation | consulting why deep learning works

Sand + Fire = Glass

c|c (TM)

what is a real (structural) Glass ?

(TM)

25calculation | consulting why deep learning works

all liquids can be made into glassesif we cool then fast enough

the glass transition is not a normal phase transitionnot the melting point

arrangement of atoms is amorphous; not completely random

different cooling rates produce different glassy states

universal phenomena; not universal physicsmolecular details affect the thermodynamics

c|c (TM)

REM: the Glass Transition

(TM)

26calculation | consulting why deep learning works

Entropy collapses when T <~ Tc

Phase Diagram: entropy density

energy density

free energy density

https://web.stanford.edu/~montanar/RESEARCH/BOOK/partB.pdf

c|c (TM)

REM: Dynamics on the Energy Landscape

(TM)

27calculation | consulting why deep learning works

let us assume some states trap the solver for some time;

of course, there is a great effort to design solvers that can avoid traps

c|c (TM)

Energy Landscapes: and Protein Folding

(TM)

28calculation | consulting why deep learning works

let us assume some states trap the solver in state E(j) for a short time

and the transitions E(j) -> E(j-1) are governed by finite, reversible transitions (i.e. SGD oscillates back and forth for a while)

classic result(s): for T near the glass Temp (Tc) the traversal times are slower than exponential !

in a physical system, like a protein or polymer, it would take longer than the known lifetime of the universe to find the ground (folded) state

c|c (TM)

Protein Folding: the Levinthal Paradox

(TM)

29calculation | consulting why deep learning works

folding could take longer than the known lifetime of the universe ?

c|c (TM)

(TM)

30calculation | consulting why deep learning works

http://arxiv.org/pdf/cond-mat/9904060v2.pdf

Old analogy between Protein folding and Hopfield Associative Memories

Natural pattern recognition could

• use a mechanism with a glass Temp (Tc) that is as low as possible

• avoid the glass transition entirely, via energetics

Nature (i.e. folding) can not operate this way !

Protein Folding: around the Levinthal Paradox

c|c (TM)

Spin Glasses: Minimizing Frustration

(TM)

31calculation | consulting why deep learning works

http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf

c|c (TM)

Spin Glasses: Minimizing Frustration

(TM)

32calculation | consulting why deep learning works

http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf

c|c (TM)

Spin Glasses: vs Disordered FerroMagnets

(TM)

33calculation | consulting why deep learning works

http://arxiv.org/pdf/cond-mat/9904060v2.pdf

c|c (TM)

the Spin Glass of Minimal Frustration

(TM)

34calculation | consulting why deep learning works

REM + strongly correlated ground state = no glass transition

https://arxiv.org/pdf/1312.7283.pdf

c|c (TM)

the Spin Glass of Minimal Frustration

(TM)

35calculation | consulting why deep learning works

Training a model induces an energy gap, with few local minimahttp://arxiv.org/pdf/1312.0867v1.pdf

c|c (TM)

Energy Funnels: Entropy vs Energy

(TM)

36calculation | consulting why deep learning works

there is a tradeoff between Energy and Entropy minimization

c|c (TM)

Energy Landscape Theory of Protein Folding

(TM)

37calculation | consulting why deep learning works

there is a tradeoff between Energy and Entropy minimization

c|c (TM)

(TM)

38calculation | consulting why deep learning works

Avoids the glass transition by having more favorable energetics

Levinthal paradoxglassy surfacevanishing gradients

Energy Landscape Theory of Protein Folding

funneled landscaperugged convexityenergy / entropy tradeoff

c|c (TM)

Dark Knowledge: an Energy Funnel ?

(TM)

39calculation | consulting why deep learning works

784 -> 800 -> 800 -> 10 MLP on MNIST

Distilled

10,000 test cases, 10 classes

99 errors

same entropy (capacity); better loss function

fit to ensemble soft-max probabilities

146 errors

784 -> 800 -> 800 -> 10

c|c (TM)

Adversarial Deep Nets: an Energy Funnel ?

(TM)

40calculation | consulting why deep learning works

Discriminator learns a complex loss function

Generator: fake data

Discriminator: fake vs real ?

http://soumith.ch/eyescream/

c|c (TM)

(TM)

41calculation | consulting why deep learning works

Summary

Random Energy Model (REM): simpler theoretical model

Glass Transition: temperature ~ weight constraints

extending REM: Spin Glass of Minimal Frustration

possible examples: Dark Knowledge

Funneled Energy Landscapes

Adversarial Deep Nets

(TM)

c|c (TM)

c | c

[email protected]