Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Coresets, Discrepancy, and Sketches in Machine Learning

Edo Liberty – Research Director, AmazonZohar Karnin – Principal Scientist, Amazon

What is a Coreset?

Original Data Coreset

xi {xi, wi}

qq

What is a Coreset?

F (q) =P

i f(xi, q) F̃ (q) =P

i2S wif(xi, q)

xi {xi, wi}

What is a Coreset?

F (q) =P

i f(xi, q) F̃ (q) =P

i2S wif(xi, q)

xi {xi, wi}

q q

Approximate CDF

q

f

n

x1 xn

The (empirical) CDF is given by

F (q)

F (q) =P

i f(xi, q)

f(xi, q) =

(1 if q > xi

0 else

Approximate CDF

q

f

n

x1 xn

An approximate CDF is given by |F̃ (q)� F (q)| "n

"n

F̃ (q)

Approximate CDF

q

f

n

x1 xn

There is a trivial coreset of size

"n

F̃ (q)

"n

"n

1/"

Approximate CDF

q

f

n

x1 xn

One way to get there is as like this:

F (q)

F̃ (q) =P

i2S 2f(xi, q)

F̃ (q)

Approximate CDF

q

f

n

x1 xn

The Discrepancy is

F (q)

F̃ (q)

F̃ (q)� F (q)

F̃ (q)� F (q) =P

i �if(xi, q) 1

Question:

For the function

1) We have that

2) We have a coreset of size

Does this generalize?

f(xi, q) =

(1 if q > xi

0 else

1/"

min

�max

q|X

i

�if(xi, q)| 1

Answer: Yes

Definition: Class Discrepancy

Lemma: For any function whose Class Discrepancy is

Its coreset complexity is

Its streaming coreset complexity is

Its randomized streaming coreset complexity is

O(c/")

Dm = min

�max

q

1

m

|mX

i=1

�if(xi, q)|

Dm = O(c/m)

O�c/" · log2 log(|Q"|/�)

�

O�c/" · log2 ("n/c)

�

Bounding the Class Discrepancy

q

Dm = c/m

q

f(x, q) = 1/(1 + e

x�q)

Dm = c/mf(x, q) = exp(�(x� q)

2)

Bounding the Class Discrepancy

Dm = ?

Dm = ?

q

qf(x, q) = exp(�kx� qk2)

f(x, q) = 1/(1 + e

�hx,qi)

Sigmoid Activation Regression

Gaussian Kernel Density

Interesting Connection

Class Discrepancy Rademacher Complexity

Usually: Usually:

Governs: Coreset Complexity Governs: Sample Complexity

Rm = E� max

q

1

m

|mX

i=1

�if(xi, q)|Dm = min

�max

q

1

m

|mX

i=1

�if(xi, q)|

Dm = O(c/m) Km ⇡ O(c/pm)

Look at techniques for bounding the Rademacher Complexity for inspiration...

Bounding sums of vectors

r =pd

min�

knX

i=1

�ixik pd

Does not depend on n

That’s encouraging.....

E�knX

i=1

�ixik ⇡pn

min�

knX

i=1

�ixixTi k = O(

pd)

Bounding sums of matrices

Proof [Due to Nikhil Bansal in private communication] This is a clever application of Banaszczyk’s theorem together with standard bounds on the spectral norm of random matrices.

It’s also tight.

Bounding sums of all tensors powers

Lemma [Karnin, L]: For any set of vectors there exist signs such that for all k simultaneously

xi 2 Rd �

��

nX

i=1

�ix⌦ki

�� pd · poly(k)

Still does not depend on n !

x = x

⌦1xx

T = x

⌦2outer(x, x, x) = x

⌦3

X

i

�if(hxi, qi) =X

i

�i

X

k

↵khxi, qik

=

X

k

↵khX

i

�ix⌦ki , q

⌦ki

X

k

|↵k| · kX

i

�ix⌦ki k

pd

X

k

|↵k| · poly(k)

Bounding the Class DiscrepancyLemma: [Karnin, L]: The Class Discrepancy of any analytic function of thedot product is

Proof:

Constant if f is analytic

O(pd/m)

Taylor expansion

Resolves the open problemSee Philips an Tai 2018Results

O(pd/")

O⇣p

d/" · log2⇣"n/

pd⌘⌘

O⇣p

d/" · log2 log(|Q"|/�)⌘

Dm = O(pd/m)

• Sigmoid Activation Regression, Logistic Regression • Covariance approximation, Graph Laplacians Quadratic forms• Gaussian Kernel Density estimation

All have the above have Class Discrepancy of

1) coresets of size 2) Streaming Coresets of size 3) Randomized Streaming Coresets of size

Back to square one (same only different...)

Dm = ?

q

qDm = ?f(x, q) = exp(�kx� qk)

f(x, q) =

(1 if hq, xi > 0

0 else

Classification with 0-1 loss

Exponential Kernel Density

Pankaj K Agarwal, Sariel Har-Peled, and Kasturi R Varadarajan. Geometric approximation via coresets. Combinatorial and computational geometry, 52:1–30, 2005.

Jeff M Phillips. Small and stable descriptors of distributions for geometric statistical problems. PhD thesis, 2009.

Jeff M. Phillips and Wai Ming Tai. Improved coresets for kernel density estimates. Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Dis- crete Algorithms, SODA 2018

Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. 2011

Jeff M. Phillips and Wai Ming Tai. Near-optimal coresets of kernel density estimates

Elad Tolochinsky and Dan Feldman. Coresets for monotonic functions with applications to deep learning.

Sariel Har-Peled, Dan Roth, and Dav Zimak. Maximum margin coresets for active and noise tolerant learning. IJCAI 2007

Sariel Har-Peled and Akash Kushal. Smaller coresets for k-median and k-means clustering. The21st ACM Symposium on Computational Geometry, Pisa, Italy, June 6-8, 2005

Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets.

Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler, and David P. Woodruff. On coresets for logistic regression.

Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile approximation in streams. IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016

Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463–482, March 2003

Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning

Wojciech Banaszczyk. Balancing vectors and gaussian measures of n-dimensional convex bodies. Random Struct. Algorithms, 12(4):351–360, July 1998

</slides>

Documents

Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal