Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 1/85

Kernels, Random Embeddings and Deep Learning

Vikas Sindhwani

IBM Research, NY

October 28, 2014



Acknowledgements

At IBM: Haim Avron, Tara Sainath, B. Ramabhadran, Q. Fan

Summer Interns: Jiyan Yang (Stanford), Po-sen Huang (UIUC) Michael Mahoney (UC Berkeley), Ha Quang Minh (IIT Genova)

IBM DARPA XDATA project led by Ken Clarkson (IBM Almaden)

2 34



Setting

Given labeled data in the form of input-output pairs,

xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ R

m,

estimate the unknown dependency f : X → Y .

3 34



Setting



m,

estimate the unknown dependency f : X → Y . Regularized Risk Minimization in a suitable hypothesis space H,

arg minf ∈H

ni=1

V (f (xi),yi) + Ω(f )

3 34



Setting



m,


arg minf ∈H

ni=1

V (f (xi),yi) + Ω(f )

Large n =⇒ Big models: H “rich” /non-parametric/nonlinear.

3 34





Setting



m,


arg minf ∈H

ni=1

V (f (xi),yi) + Ω(f )

Large n =⇒ Big models: H “rich” /non-parametric/nonlinear. Two great ML traditions around choosing H:

– Deep Neural Networks: f (x) = sn(. . . s2(W2s1(W1x)) . . .)

3 34



Setting



m,


arg minf ∈H

ni=1

V (f (xi),yi) + Ω(f )

Large n =⇒ Big models: H “rich” /non-parametric/nonlinear. Two great ML traditions around choosing H:

– Deep Neural Networks: f (x) = sn(. . . s2(W2s1(W1x)) . . .)– Kernel Methods: general nonlinear function space generated by a

kernel function k(x, z) on X × X .

This talk: Thrust towards scalable kernel methods, motivated by therecent successes of deep learning.

3 34



Outline

Motivation and Background

Scalable Kernel Methods

Random Embeddings+Distributed Computation (ICASSP, JSM 2014)libSkylark: An open source software stackQuasi-Monte Carlo Embeddings (ICML 2014)

Synergies?

Motivation and Background 4 34



Deep Learning is “Supercharging” Machine Learning

Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) withtop-5 error rate of 15.3% compared to 26.2% of the second best entry.

Many statistical and computational ingredients:

– Large datasets (ILSVRC since 2010)







– Large datasets (ILSVRC since 2010)– Large statistical capacity (1.2M images, 60M params)








– Distributed computation








– Distributed computation– Depth, Invariant feature learning (transferrable to other tasks)








– Distributed computation– Depth, Invariant feature learning (transferrable to other tasks)– Engineering: Dropout, ReLU . . .

Very active area in Speech and Natural Language Processing.




Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)






– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)







Personal history:







Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.







Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!







Personal history:


– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.Neural Nets with a kernel methods section!







Personal history:



Why Kernel Methods?







Personal history:



Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.







Personal history:



Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing









Personal history:



Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.

So what changed?







Personal history:



Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.

So what changed?

– More data, parallel algorithms, hardware? Better DNN training? . . .






Kernel Methods and Neural Networks

Geoff Hinton facts meme maintained athttp://yann.lecun.com/ex/fun/


http://yann.lecun.com/ex/fun/






Kernel Methods and Neural Networks

Geoff Hinton facts meme maintained athttp://yann.lecun.com/ex/fun/

All kernels that ever dared approaching Geoff Hinton woke upconvolved.

The only kernel Geoff Hinton has ever used is a kernel of truth.

If you defy Geoff Hinton, he will maximize your entropy in no time.Your free energy will be gone even before you reach equilibrium.

Are there synergies between these fields towards design of even better

(faster and more accurate) algorithms?


Th M h i l N l f K l M h d





The Mathematical Naturalness of Kernel Methods

Data X ∈ Rd, Models ∈ H : X → R


Th M h i l N l f K l M h d





Geometry in H: inner product ·, ·H, norm · H (Hilbert Spaces)


Th M th ti l N t l f K l M th d






Theorem All nice Hilbert spaces are generated by a symmetricpositive definite function (the kernel) k(x,x) on X × X

– if f, g ∈ H close i.e. f − gH small, then f (x), g(x) close ∀x ∈ X .

– Reproducing Kernel Hilbert Spaces (RKHSs)


Th M th ti l N t l f K l M th d









Functional Analysis (Aronszajn, Bergman (1950s)); Statistics(Parzen (1960s)); PDEs; Numerical Analysis. . .

– ML: Nonlinear classification, regression, clustering, time-seriesanalysis, dynamical systems, hypothesis testing, causal modeling, . . .











Functional Analysis (Aronszajn, Bergman (1950s)); Statistics(Parzen (1960s)); PDEs; Numerical Analysis. . .

– ML: Nonlinear classification, regression, clustering, time-seriesanalysis, dynamical systems, hypothesis testing, causal modeling, . . .

In principle, possible to compose Deep Learning pipelines using moregeneral nonlinear functions drawn from RKHSs.


Outline



Outline


Scalable Kernel Methods

Random Embeddings+Distributed Computation (ICASSP, JSM 2014)libSkylark: An open source software stackQuasi-Monte Carlo Embeddings (ICML 2014)

Synergies?

Scalable Kernel Methods 10 34

Scalability Challenges for Kernel Methods



Scalability Challenges for Kernel Methods

f = arg minf ∈Hk

1

n

ni=1

V (yi, f (xi)) + λf 2Hk

, xi ∈ Rd

Representer Theorem: f (x) = n

i=1 αik(x,xi)

Regularized Least Squares

(K + λI)α = YO(n2) storage

O(n3 + n2d) trainingO(nd) test speed

Hard to parallelize when working directly with Kij = k(xi,xj)


Randomized Algorithms



Randomized Algorithms

Explicit approximate feature map: Ψ : Rd → Cs such that,

k(x, z) ≈ Ψ(x), Ψ(z)Cs

⇒ Z(X)T Z(X) + λIw = Z(X)T Y,

O(ns) storage

O(ns2) trainingO(s) test speed

Interested in Data-oblivious maps that depend only on the kernelfunction, and not on the data.

– Should be very cheap to apply and parallelizable.


Random Fourier Features (Rahimi & Recht 2007)



Random Fourier Features (Rahimi & Recht, 2007)

Theorem [Bochner 1930,33] One-to-one Fourier-pair

correspondence between any (normalized) shift-invariant kernel kand density p such that,

k(x, z) = ψ(x− z) =

Rd

e−i(x−z)T w p(w)dw

– Gaussian kernel: k(x, z) = e−x−z22

2σ2 ⇐⇒ p = N (0, σ−2Id)




DNNs vs Kernel Methods on TIMIT (Speech)



Joint work with IBM Speech Group, P. Huang :Can “shallow”, convex randomized kernel methods “match” DNNs?

(predicting HMM states given short window of coefficients representing acoustic input)

G = randn(size(X,1), s);Z = exp(i*X*G);I = eye(size(X,2));C = Z’*Z;

alpha = (C + lambda*I)\(Z’*y(:));ztest = exp(i*xtest*G)*alpha;







(Sp )

Kernel Methods match DNNs on TIMIT, ICASSP 2014, with P. Huang and IBM Speech group

High-performance Kernel Machines with Implicit Distributed Optimization and Randomization, JSM 2014, with H. Avron.





( p )

Kernel Methods match DNNs on TIMIT, ICASSP 2014, with P. Huang and IBM Speech group

High-performance Kernel Machines with Implicit Distributed Optimization and Randomization, JSM 2014, with H. Avron.

∼ 2 hours on 256 IBM Bluegene/Qnodes.

Phone error rate of 21.3% - best

reported for Kernel methods.– Competitive with

HMM/DNN systems.– New record: 16.7% with

CNNs (ICASSP 2014). Only two hyperparameters: σ, s (early

stopping regularizer). Z ≈ 6.4TB, C ≈ 1.2TB.

Materialized in blocks/used/discardedon-the-fly, in parallel.




Distributed Block-splitting ADMM



https://github.com/xdata-skylark/libskylark/tree/master/ml

Y 1, X 1

Y 2, X 2

Y 3, X 3

node 1

node 2

node 3





Y 1, X 1

Y 2, X 2

Y 3, X 3

node 1

node 2

node 3

Z 11 Z 12 Z 13

Z 21 Z 22 Z 23

Z 31 Z 32 Z 33

T[X1, 2]

T[X3, 2]

T-cores/OpenMP-threads

MPI (rank 1)

prox

l


MPI (rank 3)





Y 1, X 1

Y 2, X 2

Y 3, X 3

node 1

node 2

node 3

Z 11 Z 12 Z 13

Z 21 Z 22 Z 23

Z 31 Z 32 Z 33

T[X1, 2]

T[X3, 2]


MPI (rank 1)

proxl


MPI (rank 3)

W 11 W 12 W 13

W 31 W 32 W 33

projZij





Y 1, X 1

Y 2, X 2

Y 3, X 3

node 1

node 2

node 3

Z 11 Z 12 Z 13

Z 21 Z 22 Z 23

Z 31 Z 32 Z 33

T[X1, 2]

T[X3, 2]


MPI (rank 1)

proxl


MPI (rank 3)

W 11 W 12 W 13

W 31 W 32 W 33

projZij

W

W

node 0

node 0

reduce

reduce





Y 1, X 1

Y 2, X 2

Y 3, X 3

node 1

node 2

node 3

Z 11 Z 12 Z 13

Z 21 Z 22 Z 23

Z 31 Z 32 Z 33

T[X1, 2]

T[X3, 2]


MPI (rank 1)

proxl


MPI (rank 3)

W 11 W 12 W 13

W 31 W 32 W 33

projZij

W

W

node 0

node 0

reduce

reduce





Y 1, X 1

Y 2, X 2

Y 3, X 3

node 1

node 2

node 3

Z 11 Z 12 Z 13

Z 21 Z 22 Z 23

Z 31 Z 32 Z 33

T[X1, 2]

T[X3, 2]


MPI (rank 1)

proxl


MPI (rank 3)

W 11 W 12 W 13

W 31 W 32 W 33

projZij

W

W

node 0

node 0

reduce

reduce

proxr





Y 1, X 1

Y 2, X 2

Y 3, X 3

node 1

node 2

node 3

Z 11 Z 12 Z 13

Z 21 Z 22 Z 23

Z 31 Z 32 Z 33

T[X1, 2]

T[X3, 2]


MPI (rank 1)

proxl


MPI (rank 3)

W 11 W 12 W 13

W 31 W 32 W 33

projZij

W

W

node 0

node 0

reduce

reduce

proxr

broadcast

broadcast


Scalability



Graph Projection

U = [ZT ijZij + λI]−1 cached

(X + ZT ijY gemm

), V = ZijU gemm

High-performance implementation that can handle large columnsplits C = κ s

d by reorganizing updates to exploit shared-memory

access, structure of graph projection.

Memory nmR (T + 5) + nd

R + 5sm + 1κ

TndR + T md + sd

+ κnmsdR

Computation O( nd2

T Rκ)

transform

+ O(smd

κT )

graph-proj

+ O(nsm

T R )

gemm

Communication O(s m log R) (model reduce/broadcast)

Stochastic, Asynchronous versions may be possible to develop.


Randomized Kernel methods on thousands of cores1 5

2 0

2 5

d

u

p

M N I S T S t r o n g S c a l i n g ( T r i l o k a )

6 0

8 0

1 0 0

A

c

c

u

r

a

c

y

(

%

)

6

8

1 0

d

u

p

M N I S T S t r o n g S c a l i n g ( B G / Q )

6 0

8 0

1 0 0

A

c

c

u

r

a

c

y

(

%

)

1 5 0

2 0 0

(

s

e

c

s

)

M N I S T W e a k S c a l i n g ( T r i l o k a )

9 8 . 5

9 9 . 0

9 9 . 5

1 0 0 . 0

A

c

c

u

r

a

c

y

(

%

)

3 0 0

4 0 0

5 0 0

(

s

e

c

s

)

M N I S T W e a k S c a l i n g ( B G / Q )

9 8 . 6

9 8 . 8

9 9 . 0

A

c

c

u

r

a

c

y

(

%

)



Triloka: 20 nodes/16-cores per node; BG/Q: 1024 nodes/16 (x4) cores per node.

s = 100K,C = 200; strong scaling (n = 250k), weak scaling (n = 250k per node)

0 5 1 0 1 5 2 0

N u m b e r o f M P I p r o c e s s e s ( t = 6 t h r e a d s / p r o c e s s )

0

5

1 0

S

p

e

e

S p e e d u p

I d e a l

0 5 1 0 1 5 2 0

0

2 0

4 0

C

l

a

s

s

i

f

i

c

a

t

i

o

n

A c c u r a c y

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0

N u m b e r o f M P I p r o c e s s e s ( t = 6 4 t h r e a d s / p r o c e s s )

0

2

4

S

p

e

e

S p e e d u p

I d e a l

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0

0

2 0

4 0

C

l

a

s

s

i

f

i

c

a

t

i

o

n

A c c u r a c y

0 5 1 0 1 5 2 0

N u m b e r o f M P I p r o c e s s e s ( t = 6 t h r e a d s / p r o c e s s )

0

5 0

1 0 0

T

i

m

e

(

0 5 1 0 1 5 2 0

9 6 . 0

9 6 . 5

9 7 . 0

9 7 . 5

9 8 . 0

C

l

a

s

s

i

f

i

c

a

t

i

o

n

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0

N u m b e r o f M P I p r o c e s s e s ( t = 6 4 t h r e a d s / p r o c e s s )

0

1 0 0

2 0 0

T

i

m

e

(

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0

9 8 . 0

9 8 . 2

9 8 . 4

C

l

a

s

s

i

f

i

c

a

t

i

o

n


Comparisons on MNIST



See “no distortion” results at http://yann.lecun.com/exdb/mnist/

Gaussian Kernel 1.4Poly (4) 1.1

3-layer NN, 500+300 HU 1.53 Hinton, 2005

LeNet5 0.8 LeCun et al. 1998Large CNN (pretrained) 0.60 Ranzato et al., NIPS 2006

Large CNN (pretrained) 0.53 Jarrett et al., ICCV 2009Scattering transform + Gaussian Kernel 0.4 Bruna and Mallat, 2012

20000 random features 0.45 my experiments5000 random features 0.52 my experiments

committee of 35 conv. nets [elastic distortions] 0.23 Ciresan et al, 2012


Comparisons on MNIST



See “no distortion” results at http://yann.lecun.com/exdb/mnist/

Gaussian Kernel 1.4Poly (4) 1.1

3-layer NN, 500+300 HU 1.53 Hinton, 2005

LeNet5 0.8 LeCun et al. 1998Large CNN (pretrained) 0.60 Ranzato et al., NIPS 2006

Large CNN (pretrained) 0.53 Jarrett et al., ICCV 2009Scattering transform + Gaussian Kernel 0.4 Bruna and Mallat, 2012

20000 random features 0.45 my experiments5000 random features 0.52 my experiments

committee of 35 conv. nets [elastic distortions] 0.23 Ciresan et al, 2012

When similar prior knowledge is enforced (invariant learning), performancegaps vanish.

RKHS mappings can, in principle, be used to implement CNNs.






Efficiency of Random Embeddings



TIMIT: 58.8M (s = 400k, m = 147) vs DNN 19.9M parameters.

Draw S = [w1 . . .ws] ∼ p and approximate the integral:

k(x, z) =

Rd

e−i(x−z)T w p(w)dw ≈ 1

|S |w∈S

e−i(x−z)T w (6)

Integration error

p,S [f ] =

[0,1]d

f (x) p(x)dx− 1

s

w∈S

f (w)

Monte-carlo convergence rate: E[ p,S [f ]2] 12 ∝ s− 12

– 4-fold increase in s will only cut error by half.

Can we do better with a different sequence S ?


Quasi-Monte Carlo Sequences: Intuition



Weyl 1916; Koksma 1942; Dick et. al., 2013; Caflisch, 1998

Consider approximating [0,1]2 f (x

)dx

with

1

s w∈S f

(w

)


Quasi-Monte Carlo Sequences: Intuition



Weyl 1916; Koksma 1942; Dick et. al., 2013; Caflisch, 1998

Consider approximating [0,1]2 f (x)dx with 1

s w∈S f (w)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Uniform

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Halton

Deterministic correlated QMC sampling avoid clustering, clumping effectsin MC point sets.

Hierarchical structure: coarse-to-fine sampling as s increases.


Star Discrepancy



Integration error depends on variation f and uniformity of S .


Star Discrepancy



Integration error depends on variation f and uniformity of S . Theorem (Koksma-Hlawka inequality (1941, 1961))

S [f ] ≤ D(S )V HK [f ], where

D(S ) = supx∈[0,1]d

vol(J x) − |i : wi ∈ J x|s






Quasi-Monte Carlo Sequences



We seek sequences with low discrepancy.

Theorem (Roth, 1954) D(S ) ≥ cd(log s)

d−12

s .

For the regular grid on [0, 1)d with s = md, D ≥ s1d =⇒ only

optimal for d = 1.







d−12

s .


optimal for d = 1.

Theorem (Doerr, 2013) Let S with s be chosen uniformly at

random from [0, 1)d. Then, E[D(S )] ≥ √ d√ s

- the Monte Carlo rate.







d−12

s .


optimal for d = 1.




There exist low-discrepancy sequences that achieve a C d(log s)d

slower bound conjectured to be optimal.

– Matlab QMC generators: haltonset, sobolset . . .







d−12

s .


optimal for d = 1.




There exist low-discrepancy sequences that achieve a C d(log s)d

slower bound conjectured to be optimal.

– Matlab QMC generators: haltonset, sobolset . . .

With d fixed, this bound actually grows until s ∼ ed to (d/e)d!– “On the unreasonable effectiveness of QMC” in high-dimensions.– RKHSs and Kernels show up in Modern QMC analysis!


How do standard QMC sequences perform?



Compare K ≈ Z(X)Z(X)T where Z(X) = e−iXG where G isdrawn from a QMC sequence generator instead.

200 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Number of random features

R e l a t i v e e r r

o r o n | | K | | 2

USPST, n=1506

MCHaltonSobol’Digital netLattice

200 400 600 800

0.005

0.01

0.015

0.02

0.025

0.03

0.035



o r o n | | K | | 2

CPU, n=6554


QMC sequences consistently better.






200 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09



o r o n | | K | | 2

USPST, n=1506


200 400 600 800

0.005

0.01

0.015

0.02

0.025

0.03

0.035



o r o n | | K | | 2

CPU, n=6554



Why are some QMC sequences better, e.g., Halton over Sobol?






200 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09



o r o n | | K | | 2

USPST, n=1506


200 400 600 800

0.005

0.01

0.015

0.02

0.025

0.03

0.035



o r o n | | K | | 2

CPU, n=6554



Why are some QMC sequences better, e.g., Halton over Sobol?

Can we learn sequences even better adapted to our problem class?


RKHSs in QMC Theory

“Nice” integrands f ∈ Hh, where h(·, ·) is a kernel function.



g f ∈ Hh, ( , )


RKHSs in QMC Theory

“Nice” integrands f ∈ Hh, where h(·, ·) is a kernel function.



g f h, ( , ) By Reproducing property and Cauchy-Schwartz inequality:

Rd

f (x) p(x)dx− 1s

w∈S

f (w) ≤ f kDh,p(S ) (7)

where D2h,p is a discrepancy measure:

Dh,p(S ) = m p(·) − 1

s

w∈S

h(w, ·)2Hh

,

mean embedding m p =

Rd

h(x, ·) p(x)dx




Box Discrepancy

Assume that the data (shifted) lives in a box



( )

b =

−b

≤x

−z

≤b,x, z

∈ X Class of functions we want to integrate:

F b = f (w) = e−i∆T w, ∆ ∈ b


Box Discrepancy




( )

b =

−b

≤x

−z

≤b,x, z


F b = f (w) = e−i∆T w, ∆ ∈ b

Theorem: Integration error proportional to ”Box discrepancy”:

Ef ∼U (F b)

S,p[f ]2

12 ∝ D

p (S ) (9)

where D p (S ) is discrepancy associated with the sinc kernel:

sincb (u,v) = π−ddi=1

sin(bj(uj − vj))uj − vj


Box Discrepancy




( )

b =

−b

≤x

−z

≤b,x, z


F b = f (w) = e−i∆T w, ∆ ∈ b

Theorem: Integration error proportional to ”Box discrepancy”:

Ef ∼U (F b)

S,p[f ]2

12 ∝ D

p (S ) (9)

where D p (S ) is discrepancy associated with the sinc kernel:

sincb (u,v) = π−ddi=1

sin(bj(uj − vj))uj − vj

Can be evaluated in closed form for Gaussian density.


Does Box discrepancy explain behaviour of QMC

sequences?



0 500 1000 150010

−5

10−4

10−3

10−2

Samples

D

( S ) 2

CPU, d=21

Digital NetMC (expected)Halton

Sobol’Lattice


Learning Adaptive QMC Sequences

Unlike Star discrepancy, Box discrepancy admits numerical optimization,

( )



S ∗ = arg minS =(w1...ws)

∈Rds

D(S ), S (0) = Halton. (10)

0 20 40 60 8010

−6

10−4

10

−2

100

CPU dataset, s=100

Iteration

Normalized D(S )2

Maximum Squared ErrorMean Squared Error K −K 2/K 2

However, full impact on large-scale problems is an open research topic.Scalable Kernel Methods 30 34

Outline




Scalable Kernel MethodsRandom Embeddings+Distributed Computation (ICASSP, JSM 2014)

libSkylark: An open source software stackQuasi-Monte Carlo Embeddings (ICML 2014)

Synergies?

Synergies? 31 34

Randomization-vs-Optimization

k(x, z)



Randomization p

⇔k Optimization

eig5x

eig4x

eig3x

eig2x

eig1x

Synergies? 32 34

Randomization-vs-Optimization

k(x, z)



Randomization p

⇔k Optimization

eig5x

eig4x

eig3x

eig2x

eig1x

Jarret et al 2009, What is the Best Multi-Stage Architecture for Object Recognition?: “The most astonishing result is that systems with random filters and no filter learning whatsoever achieve decent performance”

On Random Weights and Unsupervised Feature Learning, Saxe et al, ICML 2011:“surprising fraction of performance can be attributed to architecture alone.”

Synergies? 32 34

Deep Learning with Kernels?

Maps across layers can be parameterized using more generalnonlinearities (kernels)



nonlinearities (kernels).

ImagePixelx1

f 1(x1), f 1 ∈ Hk1Ω1

x2

Ω2

f 2(x2), f 2 ∈ Hk2

Mathematics of Neural Response, Smale et. al., FCM (2010).

Convolutional Kernel Networks, Mairal et. al., NIPS 2014.

SimNets: A Generalization of Conv. Nets, Cohen and Sashua, 2014.

– learns networks 1/8 the size of comparable ConvNets.

Figure adapted from Mairal et al, NIPS 2014

Synergies? 33 34

Conclusions



Some empirical evidence suggests that once Kernel methods arescaled up and embody similar statistical principles, they arecompetitive with Deep Neural Networks.

– Randomization and Distributed Computation both required.

– Ideas from QMC Integration techniques are promising. Opportunities for designing new algorithms combining insights from

deep learning with the generality and mathematical richness of kernel methods.

Synergies? 34 34

Documents

Slidessindhwani - Kernels Random Embeddings and Deep Learning