Upload
snt388857437
View
219
Download
0
Embed Size (px)
Citation preview
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 1/85
Kernels, Random Embeddings and Deep Learning
Vikas Sindhwani
IBM Research, NY
October 28, 2014
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 2/85
Acknowledgements
At IBM: Haim Avron, Tara Sainath, B. Ramabhadran, Q. Fan
Summer Interns: Jiyan Yang (Stanford), Po-sen Huang (UIUC) Michael Mahoney (UC Berkeley), Ha Quang Minh (IIT Genova)
IBM DARPA XDATA project led by Ken Clarkson (IBM Almaden)
2 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 3/85
Setting
Given labeled data in the form of input-output pairs,
xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ R
m,
estimate the unknown dependency f : X → Y .
3 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 4/85
Setting
Given labeled data in the form of input-output pairs,
xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ R
m,
estimate the unknown dependency f : X → Y . Regularized Risk Minimization in a suitable hypothesis space H,
arg minf ∈H
ni=1
V (f (xi),yi) + Ω(f )
3 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 5/85
Setting
Given labeled data in the form of input-output pairs,
xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ R
m,
estimate the unknown dependency f : X → Y . Regularized Risk Minimization in a suitable hypothesis space H,
arg minf ∈H
ni=1
V (f (xi),yi) + Ω(f )
Large n =⇒ Big models: H “rich” /non-parametric/nonlinear.
3 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 6/85
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 7/85
Setting
Given labeled data in the form of input-output pairs,
xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ R
m,
estimate the unknown dependency f : X → Y . Regularized Risk Minimization in a suitable hypothesis space H,
arg minf ∈H
ni=1
V (f (xi),yi) + Ω(f )
Large n =⇒ Big models: H “rich” /non-parametric/nonlinear. Two great ML traditions around choosing H:
– Deep Neural Networks: f (x) = sn(. . . s2(W2s1(W1x)) . . .)
3 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 8/85
Setting
Given labeled data in the form of input-output pairs,
xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ R
m,
estimate the unknown dependency f : X → Y . Regularized Risk Minimization in a suitable hypothesis space H,
arg minf ∈H
ni=1
V (f (xi),yi) + Ω(f )
Large n =⇒ Big models: H “rich” /non-parametric/nonlinear. Two great ML traditions around choosing H:
– Deep Neural Networks: f (x) = sn(. . . s2(W2s1(W1x)) . . .)– Kernel Methods: general nonlinear function space generated by a
kernel function k(x, z) on X × X .
This talk: Thrust towards scalable kernel methods, motivated by therecent successes of deep learning.
3 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 9/85
Outline
Motivation and Background
Scalable Kernel Methods
Random Embeddings+Distributed Computation (ICASSP, JSM 2014)libSkylark: An open source software stackQuasi-Monte Carlo Embeddings (ICML 2014)
Synergies?
Motivation and Background 4 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 10/85
Deep Learning is “Supercharging” Machine Learning
Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) withtop-5 error rate of 15.3% compared to 26.2% of the second best entry.
Many statistical and computational ingredients:
– Large datasets (ILSVRC since 2010)
Motivation and Background 5 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 11/85
Deep Learning is “Supercharging” Machine Learning
Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) withtop-5 error rate of 15.3% compared to 26.2% of the second best entry.
Many statistical and computational ingredients:
– Large datasets (ILSVRC since 2010)– Large statistical capacity (1.2M images, 60M params)
Motivation and Background 5 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 12/85
Deep Learning is “Supercharging” Machine Learning
Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) withtop-5 error rate of 15.3% compared to 26.2% of the second best entry.
Many statistical and computational ingredients:
– Large datasets (ILSVRC since 2010)– Large statistical capacity (1.2M images, 60M params)
– Distributed computation
Motivation and Background 5 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 13/85
Deep Learning is “Supercharging” Machine Learning
Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) withtop-5 error rate of 15.3% compared to 26.2% of the second best entry.
Many statistical and computational ingredients:
– Large datasets (ILSVRC since 2010)– Large statistical capacity (1.2M images, 60M params)
– Distributed computation– Depth, Invariant feature learning (transferrable to other tasks)
Motivation and Background 5 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 14/85
Deep Learning is “Supercharging” Machine Learning
Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) withtop-5 error rate of 15.3% compared to 26.2% of the second best entry.
Many statistical and computational ingredients:
– Large datasets (ILSVRC since 2010)– Large statistical capacity (1.2M images, 60M params)
– Distributed computation– Depth, Invariant feature learning (transferrable to other tasks)– Engineering: Dropout, ReLU . . .
Very active area in Speech and Natural Language Processing.
Motivation and Background 5 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 15/85
Machine Learning in 1990s
Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)
Motivation and Background 6 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 16/85
Machine Learning in 1990s
Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)
– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)
Motivation and Background 6 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 17/85
Machine Learning in 1990s
Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)
– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)
Personal history:
Motivation and Background 6 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 18/85
Machine Learning in 1990s
Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)
– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)
Personal history:
– 1998: First ML experiment - train DNN on UCI Wine dataset.
Motivation and Background 6 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 19/85
Machine Learning in 1990s
Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)
– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)
Personal history:
– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!
Motivation and Background 6 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 20/85
Machine Learning in 1990s
Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)
– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)
Personal history:
– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!
– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.Neural Nets with a kernel methods section!
Motivation and Background 6 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 21/85
Machine Learning in 1990s
Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)
– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)
Personal history:
– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!
– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.Neural Nets with a kernel methods section!
Why Kernel Methods?
Motivation and Background 6 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 22/85
Machine Learning in 1990s
Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)
– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)
Personal history:
– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!
– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.Neural Nets with a kernel methods section!
Why Kernel Methods?
– Local Minima free - stronger role of Convex Optimization.
Motivation and Background 6 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 23/85
Machine Learning in 1990s
Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)
– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)
Personal history:
– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!
– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.Neural Nets with a kernel methods section!
Why Kernel Methods?
– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing
Motivation and Background 6 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 24/85
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 25/85
Machine Learning in 1990s
Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)
– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)
Personal history:
– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!
– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.Neural Nets with a kernel methods section!
Why Kernel Methods?
– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.
So what changed?
Motivation and Background 6 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 26/85
Machine Learning in 1990s
Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)
– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)
Personal history:
– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!
– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.Neural Nets with a kernel methods section!
Why Kernel Methods?
– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.
So what changed?
– More data, parallel algorithms, hardware? Better DNN training? . . .
Motivation and Background 6 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 27/85
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 28/85
Kernel Methods and Neural Networks
Geoff Hinton facts meme maintained athttp://yann.lecun.com/ex/fun/
Motivation and Background 8 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 29/85
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 30/85
Kernel Methods and Neural Networks
Geoff Hinton facts meme maintained athttp://yann.lecun.com/ex/fun/
All kernels that ever dared approaching Geoff Hinton woke upconvolved.
The only kernel Geoff Hinton has ever used is a kernel of truth.
If you defy Geoff Hinton, he will maximize your entropy in no time.Your free energy will be gone even before you reach equilibrium.
Are there synergies between these fields towards design of even better
(faster and more accurate) algorithms?
Motivation and Background 8 34
Th M h i l N l f K l M h d
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 31/85
The Mathematical Naturalness of Kernel Methods
Data X ∈ Rd, Models ∈ H : X → R
Motivation and Background 9 34
Th M h i l N l f K l M h d
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 32/85
The Mathematical Naturalness of Kernel Methods
Data X ∈ Rd, Models ∈ H : X → R
Geometry in H: inner product ·, ·H, norm · H (Hilbert Spaces)
Motivation and Background 9 34
Th M th ti l N t l f K l M th d
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 33/85
The Mathematical Naturalness of Kernel Methods
Data X ∈ Rd, Models ∈ H : X → R
Geometry in H: inner product ·, ·H, norm · H (Hilbert Spaces)
Theorem All nice Hilbert spaces are generated by a symmetricpositive definite function (the kernel) k(x,x) on X × X
– if f, g ∈ H close i.e. f − gH small, then f (x), g(x) close ∀x ∈ X .
– Reproducing Kernel Hilbert Spaces (RKHSs)
Motivation and Background 9 34
Th M th ti l N t l f K l M th d
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 34/85
The Mathematical Naturalness of Kernel Methods
Data X ∈ Rd, Models ∈ H : X → R
Geometry in H: inner product ·, ·H, norm · H (Hilbert Spaces)
Theorem All nice Hilbert spaces are generated by a symmetricpositive definite function (the kernel) k(x,x) on X × X
– if f, g ∈ H close i.e. f − gH small, then f (x), g(x) close ∀x ∈ X .
– Reproducing Kernel Hilbert Spaces (RKHSs)
Functional Analysis (Aronszajn, Bergman (1950s)); Statistics(Parzen (1960s)); PDEs; Numerical Analysis. . .
– ML: Nonlinear classification, regression, clustering, time-seriesanalysis, dynamical systems, hypothesis testing, causal modeling, . . .
Motivation and Background 9 34
The Mathematical Naturalness of Kernel Methods
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 35/85
The Mathematical Naturalness of Kernel Methods
Data X ∈ Rd, Models ∈ H : X → R
Geometry in H: inner product ·, ·H, norm · H (Hilbert Spaces)
Theorem All nice Hilbert spaces are generated by a symmetricpositive definite function (the kernel) k(x,x) on X × X
– if f, g ∈ H close i.e. f − gH small, then f (x), g(x) close ∀x ∈ X .
– Reproducing Kernel Hilbert Spaces (RKHSs)
Functional Analysis (Aronszajn, Bergman (1950s)); Statistics(Parzen (1960s)); PDEs; Numerical Analysis. . .
– ML: Nonlinear classification, regression, clustering, time-seriesanalysis, dynamical systems, hypothesis testing, causal modeling, . . .
In principle, possible to compose Deep Learning pipelines using moregeneral nonlinear functions drawn from RKHSs.
Motivation and Background 9 34
Outline
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 36/85
Outline
Motivation and Background
Scalable Kernel Methods
Random Embeddings+Distributed Computation (ICASSP, JSM 2014)libSkylark: An open source software stackQuasi-Monte Carlo Embeddings (ICML 2014)
Synergies?
Scalable Kernel Methods 10 34
Scalability Challenges for Kernel Methods
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 37/85
Scalability Challenges for Kernel Methods
f = arg minf ∈Hk
1
n
ni=1
V (yi, f (xi)) + λf 2Hk
, xi ∈ Rd
Representer Theorem: f (x) = n
i=1 αik(x,xi)
Regularized Least Squares
(K + λI)α = YO(n2) storage
O(n3 + n2d) trainingO(nd) test speed
Hard to parallelize when working directly with Kij = k(xi,xj)
Scalable Kernel Methods 11 34
Randomized Algorithms
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 38/85
Randomized Algorithms
Explicit approximate feature map: Ψ : Rd → Cs such that,
k(x, z) ≈ Ψ(x), Ψ(z)Cs
⇒ Z(X)T Z(X) + λIw = Z(X)T Y,
O(ns) storage
O(ns2) trainingO(s) test speed
Interested in Data-oblivious maps that depend only on the kernelfunction, and not on the data.
– Should be very cheap to apply and parallelizable.
Scalable Kernel Methods 12 34
Random Fourier Features (Rahimi & Recht 2007)
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 39/85
Random Fourier Features (Rahimi & Recht, 2007)
Theorem [Bochner 1930,33] One-to-one Fourier-pair
correspondence between any (normalized) shift-invariant kernel kand density p such that,
k(x, z) = ψ(x− z) =
Rd
e−i(x−z)T w p(w)dw
– Gaussian kernel: k(x, z) = e−x−z22
2σ2 ⇐⇒ p = N (0, σ−2Id)
Scalable Kernel Methods 13 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 40/85
DNNs vs Kernel Methods on TIMIT (Speech)
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 41/85
Joint work with IBM Speech Group, P. Huang :Can “shallow”, convex randomized kernel methods “match” DNNs?
(predicting HMM states given short window of coefficients representing acoustic input)
G = randn(size(X,1), s);Z = exp(i*X*G);I = eye(size(X,2));C = Z’*Z;
alpha = (C + lambda*I)\(Z’*y(:));ztest = exp(i*xtest*G)*alpha;
Scalable Kernel Methods 14 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 42/85
DNNs vs Kernel Methods on TIMIT (Speech)
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 43/85
(Sp )
Kernel Methods match DNNs on TIMIT, ICASSP 2014, with P. Huang and IBM Speech group
High-performance Kernel Machines with Implicit Distributed Optimization and Randomization, JSM 2014, with H. Avron.
Scalable Kernel Methods 15 34
DNNs vs Kernel Methods on TIMIT (Speech)
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 44/85
( p )
Kernel Methods match DNNs on TIMIT, ICASSP 2014, with P. Huang and IBM Speech group
High-performance Kernel Machines with Implicit Distributed Optimization and Randomization, JSM 2014, with H. Avron.
∼ 2 hours on 256 IBM Bluegene/Qnodes.
Phone error rate of 21.3% - best
reported for Kernel methods.– Competitive with
HMM/DNN systems.– New record: 16.7% with
CNNs (ICASSP 2014). Only two hyperparameters: σ, s (early
stopping regularizer). Z ≈ 6.4TB, C ≈ 1.2TB.
Materialized in blocks/used/discardedon-the-fly, in parallel.
Scalable Kernel Methods 15 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 45/85
Distributed Block-splitting ADMM
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 46/85
https://github.com/xdata-skylark/libskylark/tree/master/ml
Y 1, X 1
Y 2, X 2
Y 3, X 3
node 1
node 2
node 3
Distributed Block-splitting ADMM
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 47/85
https://github.com/xdata-skylark/libskylark/tree/master/ml
Y 1, X 1
Y 2, X 2
Y 3, X 3
node 1
node 2
node 3
Z 11 Z 12 Z 13
Z 21 Z 22 Z 23
Z 31 Z 32 Z 33
T[X1, 2]
T[X3, 2]
T-cores/OpenMP-threads
MPI (rank 1)
prox
l
T-cores/OpenMP-threads
MPI (rank 3)
Distributed Block-splitting ADMM
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 48/85
https://github.com/xdata-skylark/libskylark/tree/master/ml
Y 1, X 1
Y 2, X 2
Y 3, X 3
node 1
node 2
node 3
Z 11 Z 12 Z 13
Z 21 Z 22 Z 23
Z 31 Z 32 Z 33
T[X1, 2]
T[X3, 2]
T-cores/OpenMP-threads
MPI (rank 1)
proxl
T-cores/OpenMP-threads
MPI (rank 3)
W 11 W 12 W 13
W 31 W 32 W 33
projZij
Distributed Block-splitting ADMM
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 49/85
https://github.com/xdata-skylark/libskylark/tree/master/ml
Y 1, X 1
Y 2, X 2
Y 3, X 3
node 1
node 2
node 3
Z 11 Z 12 Z 13
Z 21 Z 22 Z 23
Z 31 Z 32 Z 33
T[X1, 2]
T[X3, 2]
T-cores/OpenMP-threads
MPI (rank 1)
proxl
T-cores/OpenMP-threads
MPI (rank 3)
W 11 W 12 W 13
W 31 W 32 W 33
projZij
W
W
node 0
node 0
reduce
reduce
Distributed Block-splitting ADMM
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 50/85
https://github.com/xdata-skylark/libskylark/tree/master/ml
Y 1, X 1
Y 2, X 2
Y 3, X 3
node 1
node 2
node 3
Z 11 Z 12 Z 13
Z 21 Z 22 Z 23
Z 31 Z 32 Z 33
T[X1, 2]
T[X3, 2]
T-cores/OpenMP-threads
MPI (rank 1)
proxl
T-cores/OpenMP-threads
MPI (rank 3)
W 11 W 12 W 13
W 31 W 32 W 33
projZij
W
W
node 0
node 0
reduce
reduce
Distributed Block-splitting ADMM
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 51/85
https://github.com/xdata-skylark/libskylark/tree/master/ml
Y 1, X 1
Y 2, X 2
Y 3, X 3
node 1
node 2
node 3
Z 11 Z 12 Z 13
Z 21 Z 22 Z 23
Z 31 Z 32 Z 33
T[X1, 2]
T[X3, 2]
T-cores/OpenMP-threads
MPI (rank 1)
proxl
T-cores/OpenMP-threads
MPI (rank 3)
W 11 W 12 W 13
W 31 W 32 W 33
projZij
W
W
node 0
node 0
reduce
reduce
proxr
Distributed Block-splitting ADMM
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 52/85
https://github.com/xdata-skylark/libskylark/tree/master/ml
Y 1, X 1
Y 2, X 2
Y 3, X 3
node 1
node 2
node 3
Z 11 Z 12 Z 13
Z 21 Z 22 Z 23
Z 31 Z 32 Z 33
T[X1, 2]
T[X3, 2]
T-cores/OpenMP-threads
MPI (rank 1)
proxl
T-cores/OpenMP-threads
MPI (rank 3)
W 11 W 12 W 13
W 31 W 32 W 33
projZij
W
W
node 0
node 0
reduce
reduce
proxr
broadcast
broadcast
Scalable Kernel Methods 17 34
Scalability
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 53/85
Graph Projection
U = [ZT ijZij + λI]−1 cached
(X + ZT ijY gemm
), V = ZijU gemm
High-performance implementation that can handle large columnsplits C = κ s
d by reorganizing updates to exploit shared-memory
access, structure of graph projection.
Memory nmR (T + 5) + nd
R + 5sm + 1κ
TndR + T md + sd
+ κnmsdR
Computation O( nd2
T Rκ)
transform
+ O(smd
κT )
graph-proj
+ O(nsm
T R )
gemm
Communication O(s m log R) (model reduce/broadcast)
Stochastic, Asynchronous versions may be possible to develop.
Scalable Kernel Methods 18 34
Randomized Kernel methods on thousands of cores1 5
2 0
2 5
d
u
p
M N I S T S t r o n g S c a l i n g ( T r i l o k a )
6 0
8 0
1 0 0
A
c
c
u
r
a
c
y
(
%
)
6
8
1 0
d
u
p
M N I S T S t r o n g S c a l i n g ( B G / Q )
6 0
8 0
1 0 0
A
c
c
u
r
a
c
y
(
%
)
1 5 0
2 0 0
(
s
e
c
s
)
M N I S T W e a k S c a l i n g ( T r i l o k a )
9 8 . 5
9 9 . 0
9 9 . 5
1 0 0 . 0
A
c
c
u
r
a
c
y
(
%
)
3 0 0
4 0 0
5 0 0
(
s
e
c
s
)
M N I S T W e a k S c a l i n g ( B G / Q )
9 8 . 6
9 8 . 8
9 9 . 0
A
c
c
u
r
a
c
y
(
%
)
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 54/85
Triloka: 20 nodes/16-cores per node; BG/Q: 1024 nodes/16 (x4) cores per node.
s = 100K,C = 200; strong scaling (n = 250k), weak scaling (n = 250k per node)
0 5 1 0 1 5 2 0
N u m b e r o f M P I p r o c e s s e s ( t = 6 t h r e a d s / p r o c e s s )
0
5
1 0
S
p
e
e
S p e e d u p
I d e a l
0 5 1 0 1 5 2 0
0
2 0
4 0
C
l
a
s
s
i
f
i
c
a
t
i
o
n
A c c u r a c y
0 5 0 1 0 0 1 5 0 2 0 0 2 5 0
N u m b e r o f M P I p r o c e s s e s ( t = 6 4 t h r e a d s / p r o c e s s )
0
2
4
S
p
e
e
S p e e d u p
I d e a l
0 5 0 1 0 0 1 5 0 2 0 0 2 5 0
0
2 0
4 0
C
l
a
s
s
i
f
i
c
a
t
i
o
n
A c c u r a c y
0 5 1 0 1 5 2 0
N u m b e r o f M P I p r o c e s s e s ( t = 6 t h r e a d s / p r o c e s s )
0
5 0
1 0 0
T
i
m
e
(
0 5 1 0 1 5 2 0
9 6 . 0
9 6 . 5
9 7 . 0
9 7 . 5
9 8 . 0
C
l
a
s
s
i
f
i
c
a
t
i
o
n
0 5 0 1 0 0 1 5 0 2 0 0 2 5 0
N u m b e r o f M P I p r o c e s s e s ( t = 6 4 t h r e a d s / p r o c e s s )
0
1 0 0
2 0 0
T
i
m
e
(
0 5 0 1 0 0 1 5 0 2 0 0 2 5 0
9 8 . 0
9 8 . 2
9 8 . 4
C
l
a
s
s
i
f
i
c
a
t
i
o
n
Scalable Kernel Methods 19 34
Comparisons on MNIST
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 55/85
See “no distortion” results at http://yann.lecun.com/exdb/mnist/
Gaussian Kernel 1.4Poly (4) 1.1
3-layer NN, 500+300 HU 1.53 Hinton, 2005
LeNet5 0.8 LeCun et al. 1998Large CNN (pretrained) 0.60 Ranzato et al., NIPS 2006
Large CNN (pretrained) 0.53 Jarrett et al., ICCV 2009Scattering transform + Gaussian Kernel 0.4 Bruna and Mallat, 2012
20000 random features 0.45 my experiments5000 random features 0.52 my experiments
committee of 35 conv. nets [elastic distortions] 0.23 Ciresan et al, 2012
Scalable Kernel Methods 20 34
Comparisons on MNIST
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 56/85
See “no distortion” results at http://yann.lecun.com/exdb/mnist/
Gaussian Kernel 1.4Poly (4) 1.1
3-layer NN, 500+300 HU 1.53 Hinton, 2005
LeNet5 0.8 LeCun et al. 1998Large CNN (pretrained) 0.60 Ranzato et al., NIPS 2006
Large CNN (pretrained) 0.53 Jarrett et al., ICCV 2009Scattering transform + Gaussian Kernel 0.4 Bruna and Mallat, 2012
20000 random features 0.45 my experiments5000 random features 0.52 my experiments
committee of 35 conv. nets [elastic distortions] 0.23 Ciresan et al, 2012
When similar prior knowledge is enforced (invariant learning), performancegaps vanish.
RKHS mappings can, in principle, be used to implement CNNs.
Scalable Kernel Methods 20 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 57/85
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 58/85
Efficiency of Random Embeddings
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 59/85
TIMIT: 58.8M (s = 400k, m = 147) vs DNN 19.9M parameters.
Draw S = [w1 . . .ws] ∼ p and approximate the integral:
k(x, z) =
Rd
e−i(x−z)T w p(w)dw ≈ 1
|S |w∈S
e−i(x−z)T w (6)
Integration error
p,S [f ] =
[0,1]d
f (x) p(x)dx− 1
s
w∈S
f (w)
Monte-carlo convergence rate: E[ p,S [f ]2] 12 ∝ s− 12
– 4-fold increase in s will only cut error by half.
Can we do better with a different sequence S ?
Scalable Kernel Methods 22 34
Quasi-Monte Carlo Sequences: Intuition
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 60/85
Weyl 1916; Koksma 1942; Dick et. al., 2013; Caflisch, 1998
Consider approximating [0,1]2 f (x
)dx
with
1
s w∈S f
(w
)
Scalable Kernel Methods 23 34
Quasi-Monte Carlo Sequences: Intuition
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 61/85
Weyl 1916; Koksma 1942; Dick et. al., 2013; Caflisch, 1998
Consider approximating [0,1]2 f (x)dx with 1
s w∈S f (w)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Uniform
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Halton
Deterministic correlated QMC sampling avoid clustering, clumping effectsin MC point sets.
Hierarchical structure: coarse-to-fine sampling as s increases.
Scalable Kernel Methods 23 34
Star Discrepancy
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 62/85
Integration error depends on variation f and uniformity of S .
Scalable Kernel Methods 24 34
Star Discrepancy
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 63/85
Integration error depends on variation f and uniformity of S . Theorem (Koksma-Hlawka inequality (1941, 1961))
S [f ] ≤ D(S )V HK [f ], where
D(S ) = supx∈[0,1]d
vol(J x) − |i : wi ∈ J x|s
Scalable Kernel Methods 24 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 64/85
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 65/85
Quasi-Monte Carlo Sequences
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 66/85
We seek sequences with low discrepancy.
Theorem (Roth, 1954) D(S ) ≥ cd(log s)
d−12
s .
For the regular grid on [0, 1)d with s = md, D ≥ s1d =⇒ only
optimal for d = 1.
Scalable Kernel Methods 25 34
Quasi-Monte Carlo Sequences
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 67/85
We seek sequences with low discrepancy.
Theorem (Roth, 1954) D(S ) ≥ cd(log s)
d−12
s .
For the regular grid on [0, 1)d with s = md, D ≥ s1d =⇒ only
optimal for d = 1.
Theorem (Doerr, 2013) Let S with s be chosen uniformly at
random from [0, 1)d. Then, E[D(S )] ≥ √ d√ s
- the Monte Carlo rate.
Scalable Kernel Methods 25 34
Quasi-Monte Carlo Sequences
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 68/85
We seek sequences with low discrepancy.
Theorem (Roth, 1954) D(S ) ≥ cd(log s)
d−12
s .
For the regular grid on [0, 1)d with s = md, D ≥ s1d =⇒ only
optimal for d = 1.
Theorem (Doerr, 2013) Let S with s be chosen uniformly at
random from [0, 1)d. Then, E[D(S )] ≥ √ d√ s
- the Monte Carlo rate.
There exist low-discrepancy sequences that achieve a C d(log s)d
slower bound conjectured to be optimal.
– Matlab QMC generators: haltonset, sobolset . . .
Scalable Kernel Methods 25 34
Quasi-Monte Carlo Sequences
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 69/85
We seek sequences with low discrepancy.
Theorem (Roth, 1954) D(S ) ≥ cd(log s)
d−12
s .
For the regular grid on [0, 1)d with s = md, D ≥ s1d =⇒ only
optimal for d = 1.
Theorem (Doerr, 2013) Let S with s be chosen uniformly at
random from [0, 1)d. Then, E[D(S )] ≥ √ d√ s
- the Monte Carlo rate.
There exist low-discrepancy sequences that achieve a C d(log s)d
slower bound conjectured to be optimal.
– Matlab QMC generators: haltonset, sobolset . . .
With d fixed, this bound actually grows until s ∼ ed to (d/e)d!– “On the unreasonable effectiveness of QMC” in high-dimensions.– RKHSs and Kernels show up in Modern QMC analysis!
Scalable Kernel Methods 25 34
How do standard QMC sequences perform?
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 70/85
Compare K ≈ Z(X)Z(X)T where Z(X) = e−iXG where G isdrawn from a QMC sequence generator instead.
200 400
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Number of random features
R e l a t i v e e r r
o r o n | | K | | 2
USPST, n=1506
MCHaltonSobol’Digital netLattice
200 400 600 800
0.005
0.01
0.015
0.02
0.025
0.03
0.035
Number of random features
R e l a t i v e e r r
o r o n | | K | | 2
CPU, n=6554
MCHaltonSobol’Digital netLattice
QMC sequences consistently better.
Scalable Kernel Methods 26 34
How do standard QMC sequences perform?
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 71/85
Compare K ≈ Z(X)Z(X)T where Z(X) = e−iXG where G isdrawn from a QMC sequence generator instead.
200 400
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Number of random features
R e l a t i v e e r r
o r o n | | K | | 2
USPST, n=1506
MCHaltonSobol’Digital netLattice
200 400 600 800
0.005
0.01
0.015
0.02
0.025
0.03
0.035
Number of random features
R e l a t i v e e r r
o r o n | | K | | 2
CPU, n=6554
MCHaltonSobol’Digital netLattice
QMC sequences consistently better.
Why are some QMC sequences better, e.g., Halton over Sobol?
Scalable Kernel Methods 26 34
How do standard QMC sequences perform?
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 72/85
Compare K ≈ Z(X)Z(X)T where Z(X) = e−iXG where G isdrawn from a QMC sequence generator instead.
200 400
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Number of random features
R e l a t i v e e r r
o r o n | | K | | 2
USPST, n=1506
MCHaltonSobol’Digital netLattice
200 400 600 800
0.005
0.01
0.015
0.02
0.025
0.03
0.035
Number of random features
R e l a t i v e e r r
o r o n | | K | | 2
CPU, n=6554
MCHaltonSobol’Digital netLattice
QMC sequences consistently better.
Why are some QMC sequences better, e.g., Halton over Sobol?
Can we learn sequences even better adapted to our problem class?
Scalable Kernel Methods 26 34
RKHSs in QMC Theory
“Nice” integrands f ∈ Hh, where h(·, ·) is a kernel function.
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 73/85
g f ∈ Hh, ( , )
Scalable Kernel Methods 27 34
RKHSs in QMC Theory
“Nice” integrands f ∈ Hh, where h(·, ·) is a kernel function.
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 74/85
g f h, ( , ) By Reproducing property and Cauchy-Schwartz inequality:
Rd
f (x) p(x)dx− 1s
w∈S
f (w) ≤ f kDh,p(S ) (7)
where D2h,p is a discrepancy measure:
Dh,p(S ) = m p(·) − 1
s
w∈S
h(w, ·)2Hh
,
mean embedding m p =
Rd
h(x, ·) p(x)dx
Scalable Kernel Methods 27 34
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 75/85
Box Discrepancy
Assume that the data (shifted) lives in a box
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 76/85
( )
b =
−b
≤x
−z
≤b,x, z
∈ X Class of functions we want to integrate:
F b = f (w) = e−i∆T w, ∆ ∈ b
Scalable Kernel Methods 28 34
Box Discrepancy
Assume that the data (shifted) lives in a box
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 77/85
( )
b =
−b
≤x
−z
≤b,x, z
∈ X Class of functions we want to integrate:
F b = f (w) = e−i∆T w, ∆ ∈ b
Theorem: Integration error proportional to ”Box discrepancy”:
Ef ∼U (F b)
S,p[f ]2
12 ∝ D
p (S ) (9)
where D p (S ) is discrepancy associated with the sinc kernel:
sincb (u,v) = π−ddi=1
sin(bj(uj − vj))uj − vj
Scalable Kernel Methods 28 34
Box Discrepancy
Assume that the data (shifted) lives in a box
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 78/85
( )
b =
−b
≤x
−z
≤b,x, z
∈ X Class of functions we want to integrate:
F b = f (w) = e−i∆T w, ∆ ∈ b
Theorem: Integration error proportional to ”Box discrepancy”:
Ef ∼U (F b)
S,p[f ]2
12 ∝ D
p (S ) (9)
where D p (S ) is discrepancy associated with the sinc kernel:
sincb (u,v) = π−ddi=1
sin(bj(uj − vj))uj − vj
Can be evaluated in closed form for Gaussian density.
Scalable Kernel Methods 28 34
Does Box discrepancy explain behaviour of QMC
sequences?
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 79/85
0 500 1000 150010
−5
10−4
10−3
10−2
Samples
D
( S ) 2
CPU, d=21
Digital NetMC (expected)Halton
Sobol’Lattice
Scalable Kernel Methods 29 34
Learning Adaptive QMC Sequences
Unlike Star discrepancy, Box discrepancy admits numerical optimization,
( )
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 80/85
S ∗ = arg minS =(w1...ws)
∈Rds
D(S ), S (0) = Halton. (10)
0 20 40 60 8010
−6
10−4
10
−2
100
CPU dataset, s=100
Iteration
Normalized D(S )2
Maximum Squared ErrorMean Squared Error K −K 2/K 2
However, full impact on large-scale problems is an open research topic.Scalable Kernel Methods 30 34
Outline
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 81/85
Motivation and Background
Scalable Kernel MethodsRandom Embeddings+Distributed Computation (ICASSP, JSM 2014)
libSkylark: An open source software stackQuasi-Monte Carlo Embeddings (ICML 2014)
Synergies?
Synergies? 31 34
Randomization-vs-Optimization
k(x, z)
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 82/85
Randomization p
⇔k Optimization
eig5x
eig4x
eig3x
eig2x
eig1x
Synergies? 32 34
Randomization-vs-Optimization
k(x, z)
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 83/85
Randomization p
⇔k Optimization
eig5x
eig4x
eig3x
eig2x
eig1x
Jarret et al 2009, What is the Best Multi-Stage Architecture for Object Recognition?: “The most astonishing result is that systems with random filters and no filter learning whatsoever achieve decent performance”
On Random Weights and Unsupervised Feature Learning, Saxe et al, ICML 2011:“surprising fraction of performance can be attributed to architecture alone.”
Synergies? 32 34
Deep Learning with Kernels?
Maps across layers can be parameterized using more generalnonlinearities (kernels)
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 84/85
nonlinearities (kernels).
ImagePixelx1
f 1(x1), f 1 ∈ Hk1Ω1
x2
Ω2
f 2(x2), f 2 ∈ Hk2
Mathematics of Neural Response, Smale et. al., FCM (2010).
Convolutional Kernel Networks, Mairal et. al., NIPS 2014.
SimNets: A Generalization of Conv. Nets, Cohen and Sashua, 2014.
– learns networks 1/8 the size of comparable ConvNets.
Figure adapted from Mairal et al, NIPS 2014
Synergies? 33 34
Conclusions
8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning
http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 85/85
Some empirical evidence suggests that once Kernel methods arescaled up and embody similar statistical principles, they arecompetitive with Deep Neural Networks.
– Randomization and Distributed Computation both required.
– Ideas from QMC Integration techniques are promising. Opportunities for designing new algorithms combining insights from
deep learning with the generality and mathematical richness of kernel methods.
Synergies? 34 34