Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Coresets, Discrepancy, and Sketches in Machine Learning
Edo Liberty – Research Director, AmazonZohar Karnin – Principal Scientist, Amazon
What is a Coreset?
Original Data Coreset
xi {xi, wi}
What is a Coreset?
F (q) =P
i f(xi, q) F̃ (q) =P
i2S wif(xi, q)
xi {xi, wi}
What is a Coreset?
F (q) =P
i f(xi, q) F̃ (q) =P
i2S wif(xi, q)
xi {xi, wi}
q q
Approximate CDF
q
f
n
x1 xn
The (empirical) CDF is given by
F (q)
F (q) =P
i f(xi, q)
f(xi, q) =
(1 if q > xi
0 else
Approximate CDF
q
f
n
x1 xn
An approximate CDF is given by |F̃ (q)� F (q)| "n
"n
F̃ (q)
Approximate CDF
q
f
n
x1 xn
There is a trivial coreset of size
"n
F̃ (q)
"n
"n
1/"
Approximate CDF
q
f
n
x1 xn
One way to get there is as like this:
F (q)
F̃ (q) =P
i2S 2f(xi, q)
F̃ (q)
Approximate CDF
q
f
n
x1 xn
The Discrepancy is
F (q)
F̃ (q)
F̃ (q)� F (q)
F̃ (q)� F (q) =P
i �if(xi, q) 1
Question:
For the function
1) We have that
2) We have a coreset of size
Does this generalize?
f(xi, q) =
(1 if q > xi
0 else
1/"
min
�max
q|X
i
�if(xi, q)| 1
Answer: Yes
Definition: Class Discrepancy
Lemma: For any function whose Class Discrepancy is
Its coreset complexity is
Its streaming coreset complexity is
Its randomized streaming coreset complexity is
O(c/")
Dm = min
�max
q
1
m
|mX
i=1
�if(xi, q)|
Dm = O(c/m)
O�c/" · log2 log(|Q"|/�)
�
O�c/" · log2 ("n/c)
�
Bounding the Class Discrepancy
q
Dm = c/m
q
f(x, q) = 1/(1 + e
x�q)
Dm = c/mf(x, q) = exp(�(x� q)
2)
Bounding the Class Discrepancy
Dm = ?
Dm = ?
q
qf(x, q) = exp(�kx� qk2)
f(x, q) = 1/(1 + e
�hx,qi)
Sigmoid Activation Regression
Gaussian Kernel Density
Interesting Connection
Class Discrepancy Rademacher Complexity
Usually: Usually:
Governs: Coreset Complexity Governs: Sample Complexity
Rm = E� max
q
1
m
|mX
i=1
�if(xi, q)|Dm = min
�max
q
1
m
|mX
i=1
�if(xi, q)|
Dm = O(c/m) Km ⇡ O(c/pm)
Look at techniques for bounding the Rademacher Complexity for inspiration...
Bounding sums of vectors
r =pd
min�
knX
i=1
�ixik pd
Does not depend on n
That’s encouraging.....
E�knX
i=1
�ixik ⇡pn
min�
knX
i=1
�ixixTi k = O(
pd)
Bounding sums of matrices
Proof [Due to Nikhil Bansal in private communication] This is a clever application of Banaszczyk’s theorem together with standard bounds on the spectral norm of random matrices.
It’s also tight.
Bounding sums of all tensors powers
Lemma [Karnin, L]: For any set of vectors there exist signs such that for all k simultaneously
xi 2 Rd �
�����
nX
i=1
�ix⌦ki
����� pd · poly(k)
Still does not depend on n !
x = x
⌦1xx
T = x
⌦2outer(x, x, x) = x
⌦3
X
i
�if(hxi, qi) =X
i
�i
X
k
↵khxi, qik
=
X
k
↵khX
i
�ix⌦ki , q
⌦ki
X
k
|↵k| · kX
i
�ix⌦ki k
pd
X
k
|↵k| · poly(k)
Bounding the Class DiscrepancyLemma: [Karnin, L]: The Class Discrepancy of any analytic function of thedot product is
Proof:
Constant if f is analytic
O(pd/m)
Taylor expansion
Resolves the open problemSee Philips an Tai 2018Results
O(pd/")
O⇣p
d/" · log2⇣"n/
pd⌘⌘
O⇣p
d/" · log2 log(|Q"|/�)⌘
Dm = O(pd/m)
• Sigmoid Activation Regression, Logistic Regression • Covariance approximation, Graph Laplacians Quadratic forms• Gaussian Kernel Density estimation
All have the above have Class Discrepancy of
1) coresets of size 2) Streaming Coresets of size 3) Randomized Streaming Coresets of size
Back to square one (same only different...)
Dm = ?
q
qDm = ?f(x, q) = exp(�kx� qk)
f(x, q) =
(1 if hq, xi > 0
0 else
Classification with 0-1 loss
Exponential Kernel Density
Pankaj K Agarwal, Sariel Har-Peled, and Kasturi R Varadarajan. Geometric approximation via coresets. Combinatorial and computational geometry, 52:1–30, 2005.
Jeff M Phillips. Small and stable descriptors of distributions for geometric statistical problems. PhD thesis, 2009.
Jeff M. Phillips and Wai Ming Tai. Improved coresets for kernel density estimates. Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Dis- crete Algorithms, SODA 2018
Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. 2011
Jeff M. Phillips and Wai Ming Tai. Near-optimal coresets of kernel density estimates
Elad Tolochinsky and Dan Feldman. Coresets for monotonic functions with applications to deep learning.
Sariel Har-Peled, Dan Roth, and Dav Zimak. Maximum margin coresets for active and noise tolerant learning. IJCAI 2007
Sariel Har-Peled and Akash Kushal. Smaller coresets for k-median and k-means clustering. The21st ACM Symposium on Computational Geometry, Pisa, Italy, June 6-8, 2005
Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets.
Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler, and David P. Woodruff. On coresets for logistic regression.
Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile approximation in streams. IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016
Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463–482, March 2003
Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning
Wojciech Banaszczyk. Balancing vectors and gaussian measures of n-dimensional convex bodies. Random Struct. Algorithms, 12(4):351–360, July 1998
</slides>