21
Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty – Research Director, Amazon Zohar Karnin – Principal Scientist, Amazon

Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Coresets, Discrepancy, and Sketches in Machine Learning

Edo Liberty – Research Director, AmazonZohar Karnin – Principal Scientist, Amazon

Page 2: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

What is a Coreset?

Original Data Coreset

xi {xi, wi}

Page 3: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

qq

What is a Coreset?

F (q) =P

i f(xi, q) F̃ (q) =P

i2S wif(xi, q)

xi {xi, wi}

Page 4: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

What is a Coreset?

F (q) =P

i f(xi, q) F̃ (q) =P

i2S wif(xi, q)

xi {xi, wi}

q q

Page 5: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Approximate CDF

q

f

n

x1 xn

The (empirical) CDF is given by

F (q)

F (q) =P

i f(xi, q)

f(xi, q) =

(1 if q > xi

0 else

Page 6: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Approximate CDF

q

f

n

x1 xn

An approximate CDF is given by |F̃ (q)� F (q)| "n

"n

F̃ (q)

Page 7: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Approximate CDF

q

f

n

x1 xn

There is a trivial coreset of size

"n

F̃ (q)

"n

"n

1/"

Page 8: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Approximate CDF

q

f

n

x1 xn

One way to get there is as like this:

F (q)

F̃ (q) =P

i2S 2f(xi, q)

F̃ (q)

Page 9: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Approximate CDF

q

f

n

x1 xn

The Discrepancy is

F (q)

F̃ (q)

F̃ (q)� F (q)

F̃ (q)� F (q) =P

i �if(xi, q) 1

Page 10: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Question:

For the function

1) We have that

2) We have a coreset of size

Does this generalize?

f(xi, q) =

(1 if q > xi

0 else

1/"

min

�max

q|X

i

�if(xi, q)| 1

Page 11: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Answer: Yes

Definition: Class Discrepancy

Lemma: For any function whose Class Discrepancy is

Its coreset complexity is

Its streaming coreset complexity is

Its randomized streaming coreset complexity is

O(c/")

Dm = min

�max

q

1

m

|mX

i=1

�if(xi, q)|

Dm = O(c/m)

O�c/" · log2 log(|Q"|/�)

O�c/" · log2 ("n/c)

Page 12: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Bounding the Class Discrepancy

q

Dm = c/m

q

f(x, q) = 1/(1 + e

x�q)

Dm = c/mf(x, q) = exp(�(x� q)

2)

Page 13: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Bounding the Class Discrepancy

Dm = ?

Dm = ?

q

qf(x, q) = exp(�kx� qk2)

f(x, q) = 1/(1 + e

�hx,qi)

Sigmoid Activation Regression

Gaussian Kernel Density

Page 14: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Interesting Connection

Class Discrepancy Rademacher Complexity

Usually: Usually:

Governs: Coreset Complexity Governs: Sample Complexity

Rm = E� max

q

1

m

|mX

i=1

�if(xi, q)|Dm = min

�max

q

1

m

|mX

i=1

�if(xi, q)|

Dm = O(c/m) Km ⇡ O(c/pm)

Look at techniques for bounding the Rademacher Complexity for inspiration...

Page 15: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Bounding sums of vectors

r =pd

min�

knX

i=1

�ixik pd

Does not depend on n

That’s encouraging.....

E�knX

i=1

�ixik ⇡pn

Page 16: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

min�

knX

i=1

�ixixTi k = O(

pd)

Bounding sums of matrices

Proof [Due to Nikhil Bansal in private communication] This is a clever application of Banaszczyk’s theorem together with standard bounds on the spectral norm of random matrices.

It’s also tight.

Page 17: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Bounding sums of all tensors powers

Lemma [Karnin, L]: For any set of vectors there exist signs such that for all k simultaneously

xi 2 Rd �

�����

nX

i=1

�ix⌦ki

����� pd · poly(k)

Still does not depend on n !

x = x

⌦1xx

T = x

⌦2outer(x, x, x) = x

⌦3

Page 18: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

X

i

�if(hxi, qi) =X

i

�i

X

k

↵khxi, qik

=

X

k

↵khX

i

�ix⌦ki , q

⌦ki

X

k

|↵k| · kX

i

�ix⌦ki k

pd

X

k

|↵k| · poly(k)

Bounding the Class DiscrepancyLemma: [Karnin, L]: The Class Discrepancy of any analytic function of thedot product is

Proof:

Constant if f is analytic

O(pd/m)

Taylor expansion

Page 19: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Resolves the open problemSee Philips an Tai 2018Results

O(pd/")

O⇣p

d/" · log2⇣"n/

pd⌘⌘

O⇣p

d/" · log2 log(|Q"|/�)⌘

Dm = O(pd/m)

• Sigmoid Activation Regression, Logistic Regression • Covariance approximation, Graph Laplacians Quadratic forms• Gaussian Kernel Density estimation

All have the above have Class Discrepancy of

1) coresets of size 2) Streaming Coresets of size 3) Randomized Streaming Coresets of size

Page 20: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Back to square one (same only different...)

Dm = ?

q

qDm = ?f(x, q) = exp(�kx� qk)

f(x, q) =

(1 if hq, xi > 0

0 else

Classification with 0-1 loss

Exponential Kernel Density

Page 21: Coresets, Discrepancy, and Sketches in Machine Learning · Coresets, Discrepancy, and Sketches in Machine Learning Edo Liberty –Research Director, Amazon Zohar Karnin–Principal

Pankaj K Agarwal, Sariel Har-Peled, and Kasturi R Varadarajan. Geometric approximation via coresets. Combinatorial and computational geometry, 52:1–30, 2005.

Jeff M Phillips. Small and stable descriptors of distributions for geometric statistical problems. PhD thesis, 2009.

Jeff M. Phillips and Wai Ming Tai. Improved coresets for kernel density estimates. Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Dis- crete Algorithms, SODA 2018

Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. 2011

Jeff M. Phillips and Wai Ming Tai. Near-optimal coresets of kernel density estimates

Elad Tolochinsky and Dan Feldman. Coresets for monotonic functions with applications to deep learning.

Sariel Har-Peled, Dan Roth, and Dav Zimak. Maximum margin coresets for active and noise tolerant learning. IJCAI 2007

Sariel Har-Peled and Akash Kushal. Smaller coresets for k-median and k-means clustering. The21st ACM Symposium on Computational Geometry, Pisa, Italy, June 6-8, 2005

Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets.

Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler, and David P. Woodruff. On coresets for logistic regression.

Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile approximation in streams. IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016

Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463–482, March 2003

Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning

Wojciech Banaszczyk. Balancing vectors and gaussian measures of n-dimensional convex bodies. Random Struct. Algorithms, 12(4):351–360, July 1998

</slides>