15
Polynomial Tensor Sketch for Element-wise Function of Low-Rank Matrix Insu Han 1 Haim Avron 2 Jinwoo Shin 31 Abstract This paper studies how to sketch element-wise functions of low-rank matrices. Formally, given low-rank matrix A =[A ij ] and scalar non-linear function f , we aim for finding an approximated low-rank representation of the (possibly high- rank) matrix [f (A ij )]. To this end, we propose an efficient sketching-based algorithm whose com- plexity is significantly lower than the number of entries of A, i.e., it runs without accessing all entries of [f (A ij )] explicitly. The main idea un- derlying our method is to combine a polynomial approximation of f with the existing tensor sketch scheme for approximating monomials of entries of A. To balance the errors of the two approx- imation components in an optimal manner, we propose a novel regression formula to find poly- nomial coefficients given A and f . In particu- lar, we utilize a coreset-based regression with a rigorous approximation guarantee. Finally, we demonstrate the applicability and superiority of the proposed scheme under various machine learn- ing tasks. 1. Introduction Given a low-rank matrix A =[A ij ] R n×n with A = UV > for some matrices U, V R n×d with d n and a scalar non-linear function f : R R, we are interested in the following element-wise matrix function: * f (A) R n×n , where f (A) ij := [f (A ij )] . Our goal is to design a fast algorithm for computing tall- and-thin matrices T U ,T V in time o(n 2 )such that f (A) T U T > V . In particular, our algorithm should run without computing all entries of A or f (A) explicitly. As an 1 School of Electrical Engineering, KAIST, Daejeon, Korea 2 School of Mathematical Sciences, Tel Aviv University, Israel 3 Graduate School of AI, KAIST, Daejeon, Korea. Correspondence to: Jinwoo Shin <[email protected]>. Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). side, we obtain a o(n 2 )-time approximation scheme of f (A)x T U T > V x for an arbitrary vector x R n due to the associative property of matrix multiplication, where the exact computation of f (A)x requires the complexity of Θ(n 2 ). The matrix-vector multiplication f (A)x or the low-rank decomposition f (A) T U T > V is useful in many machine learning algorithms. For example, the Gram matrices of certain kernel functions, e.g., polynomial and radial basis function (RBF), are element-wise matrix functions where the rank is the dimension of the underlying data. Such matrices are the cornerstone of so-called kernel methods and the ability to multiply the Gram matrix to a vector suffices for most kernel learning. The Sinkhorn-Knopp algorithm is a powerful, yet simple, tool for computing optimal transport distances (Cuturi, 2013; Altschuler et al., 2017) and also involves the matrix-vector multiplication f (A)x with f (x) = exp(x). Finally, f (UV > ) can also describe the non-linear computation of activation in a layer of deep neural networks (Goodfellow et al., 2016), where U , V and f correspond to input, weight and activation function (e.g., sigmoid or ReLU) of the previous layer, respectively. Unlike the element-wise matrix function f (A), a tradi- tional matrix function f (A) is defined on their eigenval- ues (Higham, 2008) and possesses clean algebraic prop- erties, i.e., it preserves eigenvectors. For example, given a diagonalizable matrix A = PDP -1 , it is defined that f (A)= Pf (D)P -1 . A classical problem addressed in the literature is of approximating the trace of f (A) (or f (A)x) efficiently (Han et al., 2017; Ubaru et al., 2017). However, these methods are not applicable to our prob- lem because element-wise matrix functions are fundamen- tally different from the traditional function of matrices, e.g., they do not guarantee the spectral property. We are unaware of any approximation algorithm that targets gen- eral element-wise matrix functions. The special case of element-wise kernel functions such as f (x) = exp(x) and f (x)= x k , as been address in the literature, e.g., using ran- dom Fourier features (RFF) (Rahimi & Recht, 2008; Pen- nington et al., 2015), Nystr ¨ om method (Williams & Seeger, 2001), sparse greedy approximation (Smola & Sch ¨ okopf, * In this paper, we primarily focus on the square matrix A for simplicity, but it is straightforward to extend our results to the case of non-square matrices. arXiv:1905.11616v3 [cs.LG] 29 Jun 2020

Polynomial Tensor Sketch for Element-wise Function of Low-Rank … · 2020. 6. 30. · Polynomial Tensor Sketch for Element-wise Function of Low-Rank Matrix Insu Han1 Haim Avron2

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • Polynomial Tensor Sketch for Element-wise Function of Low-Rank Matrix

    Insu Han 1 Haim Avron 2 Jinwoo Shin 3 1

    AbstractThis paper studies how to sketch element-wisefunctions of low-rank matrices. Formally, givenlow-rank matrix A = [Aij ] and scalar non-linearfunction f , we aim for finding an approximatedlow-rank representation of the (possibly high-rank) matrix [f(Aij)]. To this end, we propose anefficient sketching-based algorithm whose com-plexity is significantly lower than the number ofentries of A, i.e., it runs without accessing allentries of [f(Aij)] explicitly. The main idea un-derlying our method is to combine a polynomialapproximation of f with the existing tensor sketchscheme for approximating monomials of entriesof A. To balance the errors of the two approx-imation components in an optimal manner, wepropose a novel regression formula to find poly-nomial coefficients given A and f . In particu-lar, we utilize a coreset-based regression with arigorous approximation guarantee. Finally, wedemonstrate the applicability and superiority ofthe proposed scheme under various machine learn-ing tasks.

    1. IntroductionGiven a low-rank matrix A = [Aij ] ∈ Rn×n with A =UV > for some matrices U, V ∈ Rn×d with d � n and ascalar non-linear function f : R→ R, we are interested inthe following element-wise matrix function:∗

    f�(A) ∈ Rn×n, where f�(A)ij := [f(Aij)] .

    Our goal is to design a fast algorithm for computing tall-and-thin matrices TU , TV in time o(n2)such that f�(A) ≈TUT

    >V . In particular, our algorithm should run without

    computing all entries of A or f�(A) explicitly. As an

    1School of Electrical Engineering, KAIST, Daejeon, Korea2School of Mathematical Sciences, Tel Aviv University, Israel3Graduate School of AI, KAIST, Daejeon, Korea. Correspondenceto: Jinwoo Shin .

    Proceedings of the 37 th International Conference on MachineLearning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 bythe author(s).

    side, we obtain a o(n2)-time approximation scheme off�(A)x ≈ TUT>V x for an arbitrary vector x ∈ Rn dueto the associative property of matrix multiplication, wherethe exact computation of f�(A)x requires the complexityof Θ(n2).

    The matrix-vector multiplication f�(A)x or the low-rankdecomposition f�(A) ≈ TUT>V is useful in many machinelearning algorithms. For example, the Gram matrices ofcertain kernel functions, e.g., polynomial and radial basisfunction (RBF), are element-wise matrix functions wherethe rank is the dimension of the underlying data. Suchmatrices are the cornerstone of so-called kernel methodsand the ability to multiply the Gram matrix to a vectorsuffices for most kernel learning. The Sinkhorn-Knoppalgorithm is a powerful, yet simple, tool for computingoptimal transport distances (Cuturi, 2013; Altschuler et al.,2017) and also involves the matrix-vector multiplicationf�(A)x with f(x) = exp(x). Finally, f�(UV >) can alsodescribe the non-linear computation of activation in a layerof deep neural networks (Goodfellow et al., 2016), where U ,V and f correspond to input, weight and activation function(e.g., sigmoid or ReLU) of the previous layer, respectively.

    Unlike the element-wise matrix function f�(A), a tradi-tional matrix function f(A) is defined on their eigenval-ues (Higham, 2008) and possesses clean algebraic prop-erties, i.e., it preserves eigenvectors. For example, givena diagonalizable matrix A = PDP−1, it is defined thatf(A) = Pf�(D)P−1. A classical problem addressedin the literature is of approximating the trace of f(A) (orf(A)x) efficiently (Han et al., 2017; Ubaru et al., 2017).However, these methods are not applicable to our prob-lem because element-wise matrix functions are fundamen-tally different from the traditional function of matrices,e.g., they do not guarantee the spectral property. We areunaware of any approximation algorithm that targets gen-eral element-wise matrix functions. The special case ofelement-wise kernel functions such as f(x) = exp(x) andf(x) = xk, as been address in the literature, e.g., using ran-dom Fourier features (RFF) (Rahimi & Recht, 2008; Pen-nington et al., 2015), Nyström method (Williams & Seeger,2001), sparse greedy approximation (Smola & Schökopf,∗In this paper, we primarily focus on the square matrix A for

    simplicity, but it is straightforward to extend our results to the caseof non-square matrices.

    arX

    iv:1

    905.

    1161

    6v3

    [cs

    .LG

    ] 2

    9 Ju

    n 20

    20

  • Polynomial Tensor Sketch

    2000) and sketch-based methods (Pham & Pagh, 2013;Avron et al., 2014; Meister et al., 2019; Ahle et al., 2020),just to name a few examples. We aim for not only designingan approximation framework for general f , but also outper-forming the previous approximations (e.g., RFF) even forthe special case f(x) = exp(x).

    The high-level idea underlying our method is to combinetwo approximation schemes: TENSORSKETCH approximat-ing the element-wise matrix function of monomial xk anda degree-r polynomial approximation pr(x) =

    ∑rj=0 cjx

    j

    of function f , e.g., f(x) = exp(x) ≈ pr(x) =∑rj=0

    1j!x

    j .More formally, we consider the following approximation:

    f�(A)(a)≈

    r∑j=0

    cj · (xj)�(A)

    (b)≈

    r∑j=0

    cj · TENSORSKETCH((xj)�(A)

    ),

    which we call POLY-TENSORSKETCH. This is a linear-timeapproximation scheme with respect to n under the choiceof r = O(1). Here, a non-trivial challenge occurs: a largerdegree r is required to approximate an arbitrary function fbetter, while the approximation error of TENSORSKETCHis known to increase exponentially with respect to r (Pham& Pagh, 2013; Avron et al., 2014). Hence, it is importantto choose good cj’s for balancing two approximation er-rors in both (a) and (b). The known (truncated) polynomialexpansions such as Taylor or Chebyshev (Mason & Hand-scomb, 2002) are far from being optimal for this purpose(see Section 3.2 and 4.1 for more details).

    To address this challenge, we formulate an optimizationproblem on the polynomial coefficients (the cjs) by relatingthem to the approximation error of POLY-TENSORSKETCH.However, this optimization problem is intractable since theobjective involves an expectation taken over random vari-ables whose supports are exponentially large. Thus, wederive a novel tractable alternative optimization problembased on an upper bound the approximation error of POLY-TENSORSKETCH. This problem turns out to be an instanceof generalized ridge regression (Hemmerle, 1975), and assuch has a closed-form solution. Furthermore, we observethat the regularization term effectively forces the coefficientsto be exponentially decaying, and this compensates the ex-ponentially growing errors of TENSORSKETCH with respectto the rank, while simultaneously maintaining a good poly-nomial approximation to the scalar function f given entriesof A. We further reduce its complexity by regressing only asubset of the matrix entries found by an efficient coreset clus-tering algorithm (Har-Peledx, 2008): if matrix entries areclose to the selected coreset, then the resulting coefficientis provably close to the optimal one. Finally, we constructthe regression with respect to Chebyshev polynomial basis

    (instead of monomials), i.e., pr(x) =∑rj=0 cjtj(x) for the

    Chebyshev polynomial tj(x) of degree j, in order to resolvenumerical issues arising for large degrees.

    We evaluate the approximation quality of our algorithm un-der the RBF kernels of synthetic and real-world datasets.Then, we apply the proposed method to classification usingkernel SVM (Cristianini et al., 2000; Scholkopf & Smola,2001) and computation of optimal transport distances (Cu-turi, 2013) which require to compute element-wise matrixfunctions with f(x) = exp(x). Our experimental resultsconfirm that our scheme is at the order of magnitude fasterthan the exact method with a marginal loss on accuracy.Furthermore, our scheme also significantly outperforms astate-of-the-art approximation method, RFF for the afore-mentioned applications. Finally, we demonstrate a wide ap-plicability of our method by applying it to the linearizationof neural networks, which requires to compute element-wisematrix functions with f(x) = sigmoid(x).

    2. PreliminariesIn this section, we provide backgrounds on the random-ized sketching algorithms COUNTSKETCH and TENSORS-KETCH that are crucial components of the proposed scheme.

    First, COUNTSKETCH (Charikar et al., 2002; Weinbergeret al., 2009) was proposed for an effective dimensionalityreduction of high-dimensional vector u ∈ Rd. Formally,consider a random hash function h : [d] → [m] and arandom sign function s : [d] → {−1,+1}, where [d] :={1, . . . , d}. Then, COUNTSKETCH transforms u into Cu ∈Rm such that [Cu]j :=

    ∑i:h(i)=j s(i)ui for j ∈ [m]. The

    algorithm takes O(d) time to run since it requires a singlepass over the input. It is known that applying the same†

    COUNTSKETCH transform on two vectors preserves thedot-product, i.e., 〈u,v〉 = E [〈Cu, Cv〉].

    TENSORSKETCH (Pham & Pagh, 2013) was proposed as ageneralization of COUNTSKETCH to tensor products of vec-tors. Given u ∈ Rd and a degree k, consider i.i.d. randomhash functions h1, . . . , hk : [d] → [m] and i.i.d. randomsign functions s1, . . . , sk : [d] → {−1,+1}, TENSORS-KETCH applied to u is defined as the m-dimensional vectorTu ∈ Rm such that j ∈ [m],

    [T (k)u ]j :=∑

    H(i1,...,ik)=j

    s1(i1) . . . sk(ik)ui1 . . . uik ,

    whereH(i1, . . . , ik) ≡ h1(i1)+· · ·+hk(ik) (mod m).‡ In(Pham & Pagh, 2013; Avron et al., 2014; Pennington et al.,2015; Meister et al., 2019), TENSORSKETCH was used forapproximating of the power of dot-product between vectors.

    †i.e., use the same hash and sign functions.‡Unless stated otherwise, we define T (0)u := 1.

  • Polynomial Tensor Sketch

    Algorithm 1 TENSORSKETCH (Pham & Pagh, 2013)

    1: Input: U ∈ Rn×d, degree k, sketch dimension m2: Draw k i.i.d. random hash functions h1, . . . , hk : [d]→

    [m]3: Draw k i.i.d. random sign functions s1, . . . , sk : [d]→{+1,−1}

    4: C(i)U ← COUNTSKETCH on each row of U using hi andsi for i = 1, . . . , k.

    5: T (k)U ← FFT−1(

    FFT(C(1)U )� · · · � FFT(C(k)U ))

    6: Output : T (k)U

    In other words, let T (k)u , T(k)v be the same TENSORSKETCH

    on u,v ∈ Rd with degree k. Then, it holds that

    〈u,v〉k =〈u⊗k,v⊗k

    〉= E

    [〈T (k)u , T

    (k)v

    〉],

    where ⊗ denotes the tensor product (or outer-product) andu⊗k := u ⊗ · · · ⊗ u ∈ Rdk (k − 1 times). This canbe naturally extended to matrices U, V ∈ Rn×d. Sup-pose T (k)U , T

    (k)V ∈ Rn×m are the same TENSORSKETCH

    on each row of U, V , and it follows that (UV >)�k =E[T

    (k)U T

    (k)>V

    ]where A�k := [Akij ]. Pham & Pagh (2013)

    devised a fast way to compute TENSORSKETCH using theFast Fourier Transform (FFT) as described in Algorithm 1.

    In Algorithm 1, � is the element-wise multiplication (alsocalled the Hadamard product) between two matrices ofthe same dimension and FFT(C), FFT−1(C) are the FastFourier Transform and its inverse applied to each row ofa matrix C ∈ Rn×m, respectively. The cost of the FFTsis O(nm logm) and so the total cost of Algorithm 1 isO(nk(d+m logm)) time since it runs FFT and COUNTS-KETCH k times. Avron et al. (2014) proved a tight boundon the variance (or error) of TENSORSKETCH as follows.

    Theorem 1 (Avron et al. (2014)) Given U, V ∈ Rn×d, letT

    (k)U , T

    (k)V ∈ Rn×m be the same TENSORSKETCH of U, V

    with degree k ≥ 0 and sketch dimension m. Then, it holds

    E

    [∥∥∥∥(UV >)�k − T (k)U T (k)V >∥∥∥∥2F

    ]

    ≤(2 + 3k)

    (∑i(∑j U

    2ij)

    k)(∑

    i(∑j V

    2ij)

    k)

    m. (2.1)

    The above theorem implies the error of TENSORSKETCHbecomes small for large sketch dimension m, but can growfast with respect to degree k. Recently, Ahle et al. (2020)proposed another method whose error bound is tighter withrespect to k, but can be worse with respect to another factorso-called statistical dimension (Avron et al., 2017).§

    Algorithm 2 POLY-TENSORSKETCH

    1: Input: U, V ∈ Rn×d, degree r, coefficient c0, . . . , cr,sketch dimension m

    2: Draw r i.i.d. random hash functions h1, . . . , hr : [d]→[m]

    3: Draw r i.i.d. random sign functions s1, . . . , sr : [d]→{+1,−1}

    4: T (1)U , T(1)V ← COUNTSKETCH of U, V using h1 and s1,

    respectively.5: FU , FV ← FFT(T (1)U ), FFT(T

    (1)V )

    6: Γ← c0T (0)U T(0)>V + c1T

    (1)U T

    (1)>V

    7: for j = 2 to r do8: CU , CV ← COUNTSKETCH of U, V using hj and

    sj , respectively.9: FU , FV ← FFT(CU )� FU , FFT(CV )� FV

    10: T (j)U , T(j)V ← FFT

    −1 (FU ) , FFT−1 (FV )11: Γ← Γ + cjT (j)U T

    (j)>V

    12: end for13: Output : Γ

    3. Linear-time Approximation ofElement-wise Matrix Functions

    Given a scalar function f : R → R and matrices U, V ∈Rn×d with d � n, our goal is to design an efficient algo-rithm to find TU , TV ∈ Rn×d

    ′with d′ � n in o(n2) time

    such that

    f�(UV >) ≈ TUT>V ∈ Rn×n.

    Namely, we aim for finding a low-rank approximation off�(UV >) without computing all n2 entries of UV >. Wefirst describe the proposed approximation scheme in Section3.1 and provide further ideas for minimizing the approxima-tion gap in Section 3.2.

    3.1. POLY-TENSORSKETCH Transform

    Suppose we have a polynomial pr(x) =∑rj=0 cjx

    j ap-proximating f(x), e.g., f(x) = exp(x) and its (trun-cated) Taylor series pr(x) =

    ∑rj=0

    1j!x

    j . Then, we con-sider the following approximation scheme, coined POLY-TENSORSKETCH :

    f�(UV >)(a)≈

    r∑j=0

    cj(UV>)�j

    (b)≈

    r∑j=0

    cjT(j)U T

    (j)>V ,

    (3.1)

    where T (j)U , T(j)V ∈ Rn×m are the same TENSORSKETCH

    of U, V with degree j and sketch dimension m, respectively.

    §In our experiments, TENSORSKETCH and the method by Ahleet al. (2020) show comparable approximation errors, but the latterone requires up to 4 times more computation.

  • Polynomial Tensor Sketch

    Namely, our main idea is to combine (a) a polynomial ap-proximation of a scalar function with (b) the randomizedtensor sketch of a matrix. Instead of running Algorithm1 independently for each j ∈ [r], we utilize the followingrecursive relation to amortize operations:

    T(j)U = FFT

    −1(

    FFT(CU )� FFT(T (j−1)U )),

    where CU is the COUNTSKETCH on each row of U whoserandomness is independently drawn from that of T (j−1)U .Since each recursive step can be computed in O(n(d +m logm)) time, computing all T (j)U for j ∈ [r] requiresO(nr(d + m logm)) operations. Hence, the overall com-plexity of POLY-TENSORSKETCH, formally described inAlgorithm 2, is O(n) if r, d,m = O(1).

    Observe that multiplication of Γ (i.e., the output of Algo-rithm 2) and an arbitrary vector x ∈ Rd can be done inO(nmr) time due to Γx =

    ∑rj=0 cjT

    (j)U T

    (j)>V x. Hence,

    for r,m, d = O(1), POLY-TENSORSKETCH can approxi-mate f�(UV >)x in O(nr(d + m logm)) = O(n) time.As for the error, we prove the following error bound.

    Proposition 2 Given U, V ∈ Rn×d, f : R → R, supposethat |f(x)−

    ∑rj=0 cjx

    j | ≤ ε in a closed interval containingall entries of UV > for some ε > 0. Then, it holds that

    E[∥∥f�(UV >)− Γ∥∥2

    F

    ]≤ 2n2ε2

    +

    r∑j=1

    2rc2j (2 + 3j)(∑

    i(∑k U

    2ik)

    j) (∑

    i(∑k V

    2ik)

    j)

    m,

    (3.2)

    where Γ is the output of Algorithm 2.

    The proof of Proposition 2 is given in the supplementarymaterial. Note that even when ε is close to 0, the error bound(3.2) may increase exponentially with respect to the degree r.This is because the approximation error of TENSORSKETCHgrows exponentially with respect to the degree (see Theorem1). Therefore, in order to compensate, it is desirable to useexponentially decaying (or even truncated) coefficient {cj}.However, we are now faced with the challenge of findingexponentially decaying coefficients, while not hurting theapproximation quality of f . Namely, we want to balance thetwo approximation components of POLY-TENSORSKETCH:TENSORSKETCH and a polynomial approximation for f . Inthe following section, we propose a novel approach to findthe optimal coefficients minimizing the approximation errorof POLY-TENSORSKETCH.

    3.2. Optimal Coefficient via Ridge Regression

    A natural choice for the coefficients is to utilize a polyno-mial series such as Taylor, Chebyshev (Mason & Hand-scomb, 2002) or other orthogonal basis expansions (see

    Szego (1939)). However, constructing the coefficient in thisway focuses on the error of the polynomial approximationof f , and ignores the error of TENSORSKETCH which de-pends on the decay of the coefficient, as reflected in the errorbound (3.2). To address the issue, we study the followingoptimization to find the optimal coefficient:

    minc∈Rr+1

    E[∥∥f�(UV >)− Γ∥∥2

    F

    ], (3.3)

    where Γ is the output of POLY-TENSORSKETCH and c =[c0, . . . , cr] is a vector of the coefficient. However, it is noteasy to solve the above optimization directly as its objec-tive involves an expectation over random variables with ahuge support, i.e., uniform hash and binary sign functions.Instead, we aim for minimizing an upper bound of the ap-proximation error (3.3). To this end, we define the followingnotation.

    Definition 3 Let X ∈ Rn2×(r+1) be the matrix¶ whose k-th column corresponds to the vectorization of (UV >)k−1ijfor all i, j ∈ [n], f ∈ Rn2 be the vectorization of f�(UV >)and W ∈ R(r+1)×(r+1) be a diagonal matrix such thatW11 = 0 and for i = 2, . . . , r + 1

    Wii =

    √r(2 + 3i)(

    ∑j(∑k U

    2jk)

    i)(∑j(∑k V

    2jk)

    i)

    m.

    Using the above notation, we establish the following errorbound.

    Lemma 4 Given U, V ∈ Rn×d and f : R → R, considerX,f and W defined in Definition 3. Then, it holds

    E[∥∥f�(UV >)− Γ∥∥2

    F

    ]≤ 2‖Xc− f‖22 + 2‖Wc‖22,

    (3.4)

    where Γ is the output of Algorithm 2.

    The proof of Lemma 4 is given in the supplementary ma-terial. Observe that the error bound (3.4) is a quadraticform of c ∈ Rr+1, where it is straightforward to obtain aclosed-form solution for minimizing (3.4):

    c∗ := arg minc∈Rr+1

    ‖Xc− f‖22 + ‖Wc‖22

    =(X>X +W 2

    )−1X>f . (3.5)

    This optimization task is also known as generalized ridgeregression (Hemmerle, 1975). The solution (3.5) minimizesthe regression error (i.e., the error of polynomial), while itis regularized by W , i.e., Wii is a regularizer of ci. Namely,if Wii grows exponentially with respect to i, then c∗i may

    ¶X is known as the Vandermonde matrix.

  • Polynomial Tensor Sketch

    decay exponentially (this compensates the error of TEN-SORSKETCH with degree i). By substituting c∗ into theerror bound (3.4), we obtain the following multiplicativeerror bound of POLY-TENSORSKETCH.

    Theorem 5 Given U, V ∈ Rn×d and f : R→ R, considerX defined in Definition 3. Then, it holds

    E[∥∥f�(UV >)− Γ∥∥2

    F

    ]≤

    (2

    1 + mσ2

    rC

    )‖f�(UV >)‖2F ,

    (3.6)

    where Γ is the output of Algorithm 2 with the co-efficient c∗ in (3.5), σ ≥ 0 is the smallest singu-lar value of X and C = max(5‖U‖2F ‖V ‖2F , (2 +3r)(

    ∑j(∑k U

    2jk)

    r)(∑j(∑k V

    2jk)

    r)).

    The proof of Theorem 5 is given in the supplementarymaterial. Observe that the error bound (3.6) is boundedby 2‖f�(UV >)‖2F since m, r, σ, C ≥ 0. On the otherhand, we recall that the error bound (3.2), i.e., POLY-TENSORSKETCH without using the optimal coefficient c∗,can grow exponentially with respect to r.‖ We indeed ob-serve that c∗ is empirically superior to the coefficient of thepopular Taylor and Chebyshev series expansions with re-spect to the error of POLY-TENSORSKETCH (see Section 4.1for more details). We also remark that the error bound (3.6)of POLY-TENSORSKETCH is even better than that (2.1) ofTENSORSKETCH even for the case of monomial f(x) = xk.This is primarily because the former is achievable by pay-ing an additional cost for computing the optimal coefficientvector (3.5). In what follows, we discuss the extra cost.

    3.3. Reducing Complexity via Coreset Regression

    To obtain the optimal coefficient c∗ in (3.5), one can checkthat O(r2n2 + r3) operations are required because of com-puting X>X for X ∈ Rn2×(r+1) (see Definition 3). Thishurts the overall complexity of POLY-TENSORSKETCH. In-stead, we choose a subset of entries in UV > and approxi-mately find the coefficient c∗ based on the selected entries.

    More formally, suppose U ∈ Rk×d is a matrix containingcertain k rows of U ∈ Rn×d and we use the followingapproximation:

    c∗ =(X>X +W 2

    )−1X>f

    ≈(X>DX +W 2

    )−1X>Df , (3.7)

    where X,D and f are defined as follows.∗∗

    ‖Note that σ is decreasing with respect to r, and the error bound(3.6) may decrease with respect to r.∗∗Note that X,f are defined similar to X, f in Definition 3.

    Algorithm 3 Greedy k-center

    1: Input: u1, . . . ,un ∈ Rd, number of clusters k2: a← uniformly random in {1, 2, . . . , n} and S ← {a}3: ∆i ← ‖ui − ua‖2 for all i ∈ [n]4: for i = 2 to k do5: a← arg maxi ∆i and S ← S ∪ {a}6: ∆i ← min(∆i, ‖ui − ua‖2) for all i7: end for8: P (i)← arg minj∈S ‖ui − uj‖2 for all i9: Output: {ui : i ∈ S}, P

    Algorithm 4 Coefficient approximation via coreset

    1: Input: U, V ∈ Rn×d, a function f , number of clustersk, degree r

    2: {ui}ki=1, P1 ← Algorithm 3 with rows in U and k3: {vi}ki=1, P2 ← Algorithm 3 with rows in V and k4: εU ←

    ∑i

    ∥∥ui − uP1(i)∥∥2, εV ←∑i ∥∥vi − vP2(i)∥∥25: if εU

    ∑i ‖vi‖2 < εV

    ∑j ‖uj‖2 then

    6: X,D and f ← matrices and vector defined in Defi-nition 6 with U, V and P1

    7: else8: X,D and f ← matrices and vector defined in Defi-

    nition 6 with U, V and P29: end if

    10: Output :(X>DX +W 2

    )−1X>Df

    Definition 6 Let X ∈ Rkn×(r+1) be the matrix whose `-thcolumn is the vectorization of (UV >)`−1ij for i ∈ [k], j ∈[n] and f ∈ Rkn be the vectorization of f�(UV >). Givena mapping P : [n] → [k], let D ∈ Rkn×kn be a diagonalmatrix where Dii = |{s : P (s) = di/ne}|.

    Computing (3.7) requires O(r2kn + r3) = O(n) forr, k = O(1). In what follows, we show that if rows ofU are close to those of U under the mapping P , then theapproximation (3.7) becomes tighter. To this end, we sayU is a ε-coreset of U for some ε > 0 if there exists a map-ping P : [n] → [k] such that

    ∑ni=1

    ∥∥ui − uP (i)∥∥2 ≤ ε,where ui,ui are the i-th rows of U,U , respectively. Usingthe notation, we now present the following error bound ofPOLY-TENSORSKETCH under the approximated coefficientvector in (3.7).

    Theorem 7 Given U, V ∈ Rn×d and f : R→ R, considerε-coreset U ∈ Rk×d of U with X,D,f and P defined inDefinition 6. For the approximated coefficient vector c in(3.7), assume that (f(x) −

    ∑rj=0 cjx

    j)2 is a L-Lipschitz

  • Polynomial Tensor Sketch

    function.†† Then, it holds

    E[∥∥f�(UV >)− Γ∥∥2

    2

    ]≤ 2(

    1 + mσ2

    rC

    ) ∥∥f∥∥22

    + 2εL

    (n∑i=1

    ‖vi‖2

    ), (3.8)

    where Γ is the output of Algorithm 2 with c, viis the i-th row of V , σ is the smallest singularvalue of D1/2X and C := max(5‖U‖2F ‖V ‖2F , (2 +3r)(

    ∑j(∑k U

    2jk)

    r)(∑j(∑k V

    2jk)

    r)).

    The proof of Theorem 7 is given in the supplementary ma-terial. Observe that when ε is small, the error bound (3.8)becomes closer to that under the optimal coefficient (3.6).To find a coreset with small ε, we use the greedy k-centeralgorithm (Har-Peledx, 2008) described in Algorithm 3.We remark that the greedy k-center runs in O(ndk) timeand it marginally increases the overall complexity of POLY-TENSORSKETCH in our experiments. Moreover, we indeedobserve that various real-world datasets used in our exper-iments are well-clustered, which leads to a small ε (seesupplementary material for details).

    Algorithm 4 summarizes the proposed scheme for com-puting the approximated coefficient vector c in (3.7) withthe greedy k-center algorithm. Here, we run the greedy k-center on both rows in U and V , and choose the one with asmaller value in the second term in (3.8), i.e., εU

    ∑i ‖vi‖2

    or εV∑i ‖ui‖2 (see also line 4-8 in Algorithm 4). We

    remark that Algorithm 4 requires O(nk(d+ r2) + r3) op-erations, hence, applying this to POLY-TENSORSKETCHresults inO(nk(d+ r2) + r3) +O(nr(d+m logm)) timein total. If one chooses m, k, r = O(1), the overall run-ning time becomes O(nd) which is linear in the size ofinput matrix. For example, we choose r, k,m = 10 in ourexperiments.

    Chebyshev polynomial regression for avoiding a numer-ical issue. Recall that X contains (UV >)rij (see Definition3). If entries in UV > are greater (or smaller) than 1 anddegree r is large, X can have huge (or negligible) values.This can cause a numerical issue for computing the opti-mal coefficient (3.5) (or (3.7)) using X . To alleviate theissue, we suggest to construct a matrix X ′ ∈ Rn2×(r+1)whose entries are the output of the Chebyshev polynomi-als: (X ′)ij = tj((UV >)k`) where k, ` ∈ [n] correspondsto index i ∈ [n2], j ∈ [r + 1] and tj(x) is the Chebyshevpolynomial of degree j (Mason & Handscomb, 2002). Now,the value of tj(x) is always in [−1, 1] and does not mono-tonically increase or decrease with respect to the degree j.Then, we find the optimal coefficient c′ ∈ Rr+1 based on†† For some constant L > 0, a function f is called L-Lipschitz

    if |f(x)− f(y)| ≤ L|x− y| for all x, y.

    Chebyshev polynomials as follows:

    c′ =(X ′>X ′ +R>W>WR

    )−1X ′>f , (3.9)

    where R ∈ R(r+1)×(r+1) satisfies that c∗ = Rc′ and canbe easily computed. Specifically, it converts c′ ∈ Rr+1 intothe coefficient based on monomials, i.e.,

    ∑rj=0 c

    ∗jxj =∑r

    j=0 c′jtj(x). We finally remark that to find tj , one

    needs to find a closed interval containing all (UV >)ij .To this end, we use the interval [−a, a] where a =(maxi ‖ui‖2) (maxj ‖vj‖2) and ui is the i-row of the ma-trix U . It takes O(nd) time and contributes marginally tothe overall complexity of POLY-TENSORSKETCH.

    4. ExperimentsIn this section, we report the empirical results of POLY-TENSORSKETCH for the element-wise matrix functionsunder various machine learning applications.‡‡ All resultsare reported by averaging over 100 and 10 independent tri-als for the experiments in Section 4.1 and those in othersections, respectively. Our implementation and experi-ments are available at https://github.com/insuhan/polytensorsketch.

    4.1. RBF Kernel Approximation

    Given U = [u1, . . . ,un]> ∈ Rn×d, the RBF kernel K =[Kij ] is defined as Kij := exp(−‖ui − uj‖22/γ) for i, j ∈[n] and γ > 0. It can be represented using the element-wisematrix exponential function:

    K = Z exp�(2UU>/γ)Z, (4.1)

    where Z is the n-by-n diagonal matrix with Zii =exp(−‖ui‖22 /γ). One can approximate the element-wisematrix function exp�(2UU>/γ) ≈ Γ where Γ is the out-put of POLY-TENSORSKETCH with f(x) = exp(2x/γ) andK ≈ ZΓZ. The RBF kernel has been used in many ap-plications including classification (Pennington et al., 2015),covariance estimation (Wang et al., 2015), Gaussian pro-cess (Rasmussen, 2003), determinantal point processes (Af-fandi et al., 2014) where they commonly require multiplica-tions between the kernel matrix and vectors.

    For synthetic kernels, we generate random matrices U ∈R1,000×50 whose entries are drawn from the normal dis-tribution N (0, 1/50). For real-world kernels, we usesegment and usps datasets. We report the average of rela-tive error for n2 entries of K under varying parameters, i.e.,sketch dimension m and polynomial degree r. We choosem = 10, r = 10 and k = 10 as the default configuration.

    ‡‡The datasets used in Section 4.1 and 4.2 are available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/and http://archive.ics.uci.edu/.

    https://github.com/insuhan/polytensorsketchhttps://github.com/insuhan/polytensorsketchhttp://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/http://archive.ics.uci.edu/

  • Polynomial Tensor Sketch

    10 20 30 40 50 60

    sketch dimension m

    10-3

    10-2

    rela

    tive

    erro

    rRFFTaylor-TSChebyshev-TSCoreset-TS

    2 4 6 8 10

    polynomial degree r

    10-3

    10-2

    10-1

    rela

    tive

    erro

    r

    RFFTaylor-TSChebyshev-TSCoreset-TS

    2 3 4 5

    scaling factor

    10-4

    10-2

    100

    rela

    tive

    erro

    r

    RFFTaylor-TSChebyshev-TSCoreset-TS

    2000 4000 6000 8000 10000

    matrix dimension n

    10-3

    10-2

    10-1

    rela

    tive

    erro

    r

    RFFTaylor-TSChebyshev-TSCoreset-TS

    (a) synthetic dataset

    10 20 30 40 50 60

    sketch dimension m

    10-4

    10-3

    10-2

    rela

    tive

    erro

    r

    RFFTaylor-TSChebyshev-TSCoreset-TS

    2 4 6 8 10

    polynomial degree r

    10-3

    10-2

    10-1re

    lativ

    e er

    ror

    RFFTaylor-TSChebyshev-TSCoreset-TS

    (b) segment dataset

    10 20 30 40 50 60

    sketch dimension m

    10-4

    10-3

    10-2

    rela

    tive

    erro

    r

    RFFTaylor-TSChebyshev-TSCoreset-TS

    2 4 6 8 10

    polynomial degree r

    10-3

    10-2

    10-1

    rela

    tive

    erro

    r

    RFFTaylor-TSChebyshev-TSCoreset-TS

    (c) usps dataset

    Figure 1: Kernel approximation under (a) synthetic, (b) segment and (c) usps dataset. We set sketch dimension m = 10,polynomial degree r = 10 and coreset size k = 10 unless stated otherwise. All test methods have comparable running times.

    Table 1: Classification error, kernel approximation error and speedup for classification (training time) with kernel SVMunder various real-world datasets.

    Dataset Statistics Classification error (%) Kernel approximation error Speedup

    n d Exact RFF Coreset-TS RFF Coreset-TS Coreset-TS

    segment 2,310 19 3.20 3.28 3.22 1.7× 10−3 3.21× 10−4 8.26satimage 4,335 36 23.54 24.11 23.74 6.5× 10−3 3.54× 10−3 10.06anuran 7,195 22 37.26 40.33 37.36 6.1× 10−3 3.1× 10−4 49.45usps 7,291 256 2.11 6.50 5.36 1.2× 10−3 1.61× 10−4 27.15grid 10,000 12 3.66 18.27 15.99 2.34× 10−3 1.02× 10−3 4.23

    mapping 10,845 28 15.04 18.57 17.35 6.09× 10−3 2.84× 10−4 4.17letter 20,000 16 2.62 11.38 10.30 3.06× 103 1.29 40.60

    We compare our algorithm, denoted by Coreset-TS, (i.e.,POLY-TENSORSKETCH using the coefficient computed asdescribed in Section 3.3) with the random Fourier feature(RFF) (Rahimi & Recht, 2008) of the same computationalcomplexity, i.e., its embedding dimension is chosen to mrso that the rank of their approximated kernels is the sameand their running times are comparable. We also benchmarkPOLY-TENSORSKETCH using coefficients from Taylor andChebyshev series expansions which are denoted by Taylor-TS and Chebyshev-TS, respectively. As reported in Figure1, we observe that Coreset-TS consistently outperforms thecompetitors under all tested settings and datasets, whereits error is often at orders of magnitude smaller than thatof RFF. In particular, observe that the error of Coreset-TStends to decreases with respect to the polynomial degreer, which is not the case for Taylor-TS and Chebyshev-TS(suboptimal versions of our algorithm). We additionallyperform the POLY-TENSORSKETCH using the optimal coef-

    ficient and compare it with Coreset-TS in the supplementarymaterial.

    4.2. Classification with Kernel SVM

    Next, we aim for applying our algorithm (Coreset-TS) toclassification tasks using the support vector machine (SVM)based on RBF kernel. Given the input data U ∈ Rn×d,our algorithm can find a feature TU ∈ Rn×m

    ′such that

    K ≈ TUT>U where K is the RBF kernel of U (see Section4.1). One can expect that a linear SVM using TU shows asimilar performance compared to the kernel SVM using K.However, for m′ � n, the complexity of a linear SVM ismuch cheaper than that of the kernel method both for train-ing and testing. In order to utilize our algorithm, we con-struct TU =

    [√c0T

    (0)U . . .

    √crT

    (r)U

    ]where T (0)U , . . . , T

    (r)U

    are the TENSORSKETCHs of U in Algorithm 2. Here, thecoefficient cj should be positive and one can compute the

  • Polynomial Tensor Sketch

    2000 5000 10000 20000

    matrix dimension n

    101

    102

    103

    spee

    d up

    (pe

    r ite

    ratio

    n)

    RFFCoreset-TSNystrom

    (a)

    0.5 1 1.5 2 2.5

    matrix dimension n 1040

    1

    2

    3

    4

    ratio

    to S

    inkh

    orn

    RFFCoreset-TSNystrom

    (b) (c)

    Figure 2: Speedup per iteration (a) and the ratio to Sinkhorn objective (b) of our algorithm, RFF and Nystrom approximationaveraged over 3 tested images in (c). The Sinkhorn algorithm transforms the color information of source images to targets.

    optimal coefficient (3.5) by adding non-negativity condition,which is solvable using simple quadratic programming withmarginal additional cost.

    We use the open-source SVM package (LIBSVM) (Chang& Lin, 2011) and compare our algorithm with the kernelSVM using the exact RBF and the linear SVM using theembedded feature from RFF of the same running time withours. We run all experiments with 10 cross-validations andreport the average of the classification error on the validationdataset. We set m = 20 for the dimension of sketches andr = 3 for the degree of the polynomial. Table 1 summarizesthe results of classification errors, kernel approximation er-rors and speedups of our algorithm under various real-worlddatasets. Ours (Coreset-TS) shows better classification er-rors compared to RFF of comparable running time and runsup to 50 times faster than the exact method. We also ob-serve that large m can improve the performance (see thesupplementary material).

    4.3. Sinkhorn Algorithm for Optimal Transport

    We apply Coreset-TS to the Sinkhorn algorithm for com-puting the optimal transport distance (Cuturi, 2013). Thealgorithm is the entropic regularization for approximatingthe distance of two discrete probability density. It is a fixed-point algorithm where each iteration requires multiplicationof an element-wise matrix exponential function with a vec-tor. More formally, given a, b ∈ Rn, and two distinctdata points of the same dimension {xi}ni=1, {yj}nj=1, thealgorithm iteratively computes u = a � (exp�(−γD)v)and v = b �

    (exp�(−γD)>u

    )for some initial v where

    Dij := ‖xi−yj‖22 and� denotes the element-wise division.Hence, the computation can be efficiently approximated us-ing our algorithm or RFF. We also perform the Nyströmapproximation (Williams & Seeger, 2001) which was re-cently used for developing a scalable Sinkhorn algorithm(Altschuler et al., 2019). We provide all the details in thesupplementary materials.

    Given a pair of source and target images as shown in Fig-

    Table 2: Test errors on CIFAR100 of linearization of 2 fully-connected layers of AlexNet, i.e., W2sigmoid�(W1x) ≈(W2TW1)Tx where both the input and hidden dimensionsare d = dh = 1,024 and the degree r = 3.

    Model Complexity m Error (%)

    Wx O(d) − 45.67

    (W2TW1)Tx O(r(d+m logm))100 53.09200 47.18400 43.90800 41.73

    W2sigmoid�(W1x) O(dhd) − 39.02

    ure 2 (c),∗ we randomly sample {xi}ni=1 from RGB pixelsin the source image and {yj}nj=1 from those in the targetimage. We set m = 20, d = 3, r = 3 and γ = 1. Figure 2(a) reports the speedup per iteration of the tested approxi-mation algorithms over the exact computation and Figure 2(b) shows the ratios the objective value of the approximatedSinkhorn algorithm to the exact one after 10 iterations. Asreported in Figure 2 (a), all algorithms run at orders ofmagnitude faster than the exact Sinkhorn algorithm. Fur-thermore, the approximation ratio of ours is much morestable, while both RFF and Nyström have huge varianceswithout any tendency on the dimension n.

    4.4. Linearization of Neural Networks

    Finally, we demonstrate that our method has a poten-tial to obtain a low-complexity model by linearizationof a neural network. To this end, we consider 2 fully-connected layers in AlexNet (Krizhevsky et al., 2012)where each has 1,024 hidden nodes and it is trained onCIFAR100 dataset (Krizhevsky et al., 2009). Formally,given an input x ∈ R1,024, its predictive class corre-sponds to arg maxi [W2 sigmoid�(W1x)]i where W1 ∈R1,024×1,024,W2 ∈ R1,024×100 are model parameters. Wefirst train the model for 300 epochs using ADAM opti-∗Images from https://github.com/rflamary/POT.

    https://github.com/rflamary/POT

  • Polynomial Tensor Sketch

    mizer (Kingma & Ba, 2015) with 0.0005 learning rate.Then, we approximate sigmoid�(W1x) ≈ TW1Tx forTW1 ∈ R800×mr, Tx ∈ Rmr using Coreset-TS. After that,we fine-tune the parameters TW1 ,W2 for 200 epochs andevaluate the final test error as reported in Table 2. We chooser = 3 and explore various m ∈ {100, 200, 400, 800}. Ob-serve that the obtained model (W2TW1)Tx is linear with re-spect to Tx and its complexity (i.e., inference time) is muchsmaller than that of the original neural network. Moreover,by choosing the sketch dimension m appropriately, its per-formance is better than that of the vanilla linear model Wxas reported in Table 2. The results shed a broad applicabilityour generic approximation scheme and more exploration onthis line would be an interesting direction in the future.

    5. ConclusionIn this paper, we design a fast algorithm for sketchingelement-wise matrix functions. Our method is based oncombining (a) a polynomial approximation with (b) therandomized matrix tensor sketch. Our main novelty is onfinding the optimal polynomial coefficients for minimizingthe overall approximation error bound by balancing the er-rors of (a) and (b). We expect that the generic scheme wouldenjoy a broader usage in the future.

    AcknowledgementsIH and JS were partially supported by Institute of Informa-tion & Communications Technology Planning & Evalua-tion (IITP) grant funded by the Korea government (MSIT)(No.2019-0-00075, Artificial Intelligence Graduate SchoolProgram (KAIST)) and the Engineering Research CenterProgram through the National Research Foundation of Ko-rea (NRF) funded by the Korean Government MSIT (NRF-2018R1A5A1059921). HA was partially supported by BSFgrant 2017698 and ISF grant 1272/17.

    ReferencesAffandi, R. H., Fox, E., Adams, R., and Taskar, B. Learning

    the parameters of determinantal point process kernels. InInternational Conference on Machine Learning (ICML),2014.

    Ahle, T. D., Kapralov, M., Knudsen, J. B., Pagh, R., Vel-ingker, A., Woodruff, D. P., and Zandieh, A. Oblivioussketching of high-degree polynomial kernels. In Sympo-sium on Discrete Algorithms (SODA), 2020.

    Altschuler, J., Weed, J., and Rigollet, P. Near-lineartime approximation algorithms for optimal transport viaSinkhorn iteration. In Advances in Neural InformationProcessing Systems (NIPS), 2017.

    Altschuler, J., Bach, F., Rudi, A., and Niles-Weed, J.Massively scalable Sinkhorn distances via the Nyströmmethod. In Advances in Neural Information ProcessingSystems (NIPS), 2019.

    Avron, H., Nguyen, H., and Woodruff, D. Subspace embed-dings for the polynomial kernel. In Advances in NeuralInformation Processing Systems (NIPS), 2014.

    Avron, H., Clarkson, K. L., and Woodruff, D. P. SharperBounds for Regularized Data Fitting. Approximation,Randomization, and Combinatorial Optimization. Algo-rithms and Techniques, 2017.

    Chang, C.-C. and Lin, C.-J. LIBSVM: A library for supportvector machines. Transactions on Intelligent Systems andTechnology (TIST), 2011.

    Charikar, M., Chen, K., and Farach-Colton, M. Finding fre-quent items in data streams. In International Colloquiumon Automata, Languages and Programming (ICALP),2002.

    Cristianini, N., Shawe-Taylor, J., et al. An introduction tosupport vector machines and other kernel-based learningmethods. Cambridge university press, 2000.

    Cuturi, M. Sinkhorn distances: Lightspeed computationof optimal transport. In Advances in Neural InformationProcessing Systems (NIPS), 2013.

    Goodfellow, I., Bengio, Y., and Courville, A. Deep learning.MIT press, 2016.

    Han, I., Malioutov, D., Avron, H., and Shin, J. Approxi-mating the Spectral Sums of Large-scale Matrices usingChebyshev Approximations. SIAM Journal on ScientificComputing, 2017.

    Har-Peledx, S. Geometric approximation algorithms. Lec-ture Notes for CS598, UIUC, 2008.

    Hemmerle, W. J. An explicit solution for generalized ridgeregression. Technometrics, 1975.

    Higham, N. J. Functions of matrices: theory and computa-tion. SIAM, 2008.

    Kingma, D. P. and Ba, J. Adam: A method for stochas-tic optimization. International Conference on LearningRepresentations (ICLR), 2015.

    Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenetclassification with deep convolutional neural networks.In Advances in neural information processing systems,pp. 1097–1105, 2012.

    Krizhevsky, A. et al. Learning multiple layers of featuresfrom tiny images. 2009.

    http://proceedings.mlr.press/v32/affandi14.pdfhttp://proceedings.mlr.press/v32/affandi14.pdfhttps://epubs.siam.org/doi/pdf/10.1137/1.9781611975994.9https://epubs.siam.org/doi/pdf/10.1137/1.9781611975994.9https://papers.nips.cc/paper/6792-near-linear-time-approximation-algorithms-for-optimal-transport-via-sinkhorn-iteration.pdfhttps://papers.nips.cc/paper/6792-near-linear-time-approximation-algorithms-for-optimal-transport-via-sinkhorn-iteration.pdfhttps://papers.nips.cc/paper/6792-near-linear-time-approximation-algorithms-for-optimal-transport-via-sinkhorn-iteration.pdfhttp://papers.nips.cc/paper/8693-massively-scalable-sinkhorn-distances-via-the-nystrom-method.pdfhttp://papers.nips.cc/paper/8693-massively-scalable-sinkhorn-distances-via-the-nystrom-method.pdfhttps://papers.nips.cc/paper/5240-subspace-embeddings-for-the-polynomial-kernel.pdfhttps://papers.nips.cc/paper/5240-subspace-embeddings-for-the-polynomial-kernel.pdfhttps://arxiv.org/pdf/1611.03225.pdfhttps://arxiv.org/pdf/1611.03225.pdfhttp://www.csie.ntu.edu.tw/~cjlin/libsvmhttp://www.csie.ntu.edu.tw/~cjlin/libsvmhttps://link.springer.com/chapter/10.1007/3-540-45465-9_59https://link.springer.com/chapter/10.1007/3-540-45465-9_59https://www.cambridge.org/core/books/an-introduction-to-support-vector-machines-and-other-kernelbased-learning-methods/A6A6F4084056A4B23F88648DDBFDD6FChttps://www.cambridge.org/core/books/an-introduction-to-support-vector-machines-and-other-kernelbased-learning-methods/A6A6F4084056A4B23F88648DDBFDD6FChttps://www.cambridge.org/core/books/an-introduction-to-support-vector-machines-and-other-kernelbased-learning-methods/A6A6F4084056A4B23F88648DDBFDD6FChttps://papers.nips.cc/paper/4927-sinkhorn-distances-lightspeed-computation-of-optimal-transport.pdfhttps://papers.nips.cc/paper/4927-sinkhorn-distances-lightspeed-computation-of-optimal-transport.pdfhttp://www.deeplearningbook.orghttps://epubs.siam.org/doi/pdf/10.1137/16M1078148https://epubs.siam.org/doi/pdf/10.1137/16M1078148https://epubs.siam.org/doi/pdf/10.1137/16M1078148https://graphics.stanford.edu/courses/cs468-06-fall/Papers/01%20har-peled%20notes.pdfhttps://www.jstor.org/stable/1268066?seq=1#metadata_info_tab_contentshttps://www.jstor.org/stable/1268066?seq=1#metadata_info_tab_contentshttps://www.maths.manchester.ac.uk/~higham/fm/https://www.maths.manchester.ac.uk/~higham/fm/https://arxiv.org/pdf/1412.6980.pdfhttps://arxiv.org/pdf/1412.6980.pdfhttps://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdfhttps://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdfhttps://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdfhttps://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

  • Polynomial Tensor Sketch

    Maaten, L. v. d. and Hinton, G. Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR),2008.

    Mason, J. C. and Handscomb, D. C. Chebyshev polynomials.Chapman and Hall/CRC, 2002.

    Meister, M., Sarlos, T., and Woodruff, D. Tight dimen-sionality reduction for sketching low degree polynomialkernels. Advances in Neural Information Processing Sys-tems (NIPS), 2019.

    Pennington, J., Yu, F. X. X., and Kumar, S. Sphericalrandom features for polynomial kernels. In Advances inNeural Information Processing Systems (NIPS), 2015.

    Pham, N. and Pagh, R. Fast and scalable polynomial kernelsvia explicit feature maps. In Conference on KnowledgeDiscovery and Data Mining (KDD), 2013.

    Rahimi, A. and Recht, B. Random features for large-scalekernel machines. In Advances in Neural InformationProcessing Systems (NIPS), 2008.

    Rasmussen, C. E. Gaussian processes in machine learning.In Summer School on Machine Learning. Springer, 2003.

    Scholkopf, B. and Smola, A. J. Learning with kernels:support vector machines, regularization, optimization,and beyond. MIT press, 2001.

    Smola, A. J. and Schökopf, B. Sparse Greedy Matrix Ap-proximation for Machine Learning. In International Con-ference on Machine Learning (ICML), 2000.

    Szego, G. Orthogonal polynomials. American MathematicalSoc., 1939.

    Ubaru, S., Chen, J., and Saad, Y. Fast Estimation of tr(f(A))via Stochastic Lanczos Quadrature. SIAM Journal onMatrix Analysis and Applications, 2017.

    Vazirani, V. V. Approximation algorithms. Springer Science& Business Media, 2013.

    Wang, L., Zhang, J., Zhou, L., Tang, C., and Li, W. Beyondcovariance: Feature representation with nonlinear ker-nel matrices. In International Conference on ComputerVision (ICCV), 2015.

    Weinberger, K., Dasgupta, A., Attenberg, J., Langford, J.,and Smola, A. Feature hashing for large scale multi-task learning. In International Conference on MachineLearning (ICML), 2009.

    Williams, C. K. and Seeger, M. Using the Nystrom methodto speed up kernel machines. In Advances in NeuralInformation Processing Systems (NIPS), 2001.

    http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdfhttp://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdfhttp://dl.iranidata.com/book/daneshgahi/Chebyshev%20polynomials%28www.iranidata.com%29.pdfhttps://papers.nips.cc/paper/9144-tight-dimensionality-reduction-for-sketching-low-degree-polynomial-kernels.pdfhttps://papers.nips.cc/paper/9144-tight-dimensionality-reduction-for-sketching-low-degree-polynomial-kernels.pdfhttps://papers.nips.cc/paper/9144-tight-dimensionality-reduction-for-sketching-low-degree-polynomial-kernels.pdfhttps://papers.nips.cc/paper/5943-spherical-random-features-for-polynomial-kernels.pdfhttps://papers.nips.cc/paper/5943-spherical-random-features-for-polynomial-kernels.pdfhttp://chbrown.github.io/kdd-2013-usb/kdd/p239.pdfhttp://chbrown.github.io/kdd-2013-usb/kdd/p239.pdfhttps://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdfhttps://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdfhttp://www.gaussianprocess.org/gpml/chapters/RW.pdfhttps://cds.cern.ch/record/791819/files/0262194759_TOC.pdfhttps://cds.cern.ch/record/791819/files/0262194759_TOC.pdfhttps://cds.cern.ch/record/791819/files/0262194759_TOC.pdfhttps://pdfs.semanticscholar.org/dd09/78a594290f6dc530e65983d79a056874185c.pdfhttps://pdfs.semanticscholar.org/dd09/78a594290f6dc530e65983d79a056874185c.pdfhttps://books.google.co.kr/books/about/Orthogonal_Polynomials.html?id=3hcW8HBh7gsC&redir_esc=yhttps://www-users.cs.umn.edu/~saad/PDF/ys-2016-04.pdfhttps://www-users.cs.umn.edu/~saad/PDF/ys-2016-04.pdfhttps://doc.lagout.org/science/0_Computer%20Science/2_Algorithms/Approximation%20Algorithms%20%5BVazirani%202010-12-01%5D.pdfhttps://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Wang_Beyond_Covariance_Feature_ICCV_2015_paper.pdfhttps://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Wang_Beyond_Covariance_Feature_ICCV_2015_paper.pdfhttps://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Wang_Beyond_Covariance_Feature_ICCV_2015_paper.pdfhttps://alex.smola.org/papers/2009/Weinbergeretal09.pdfhttps://alex.smola.org/papers/2009/Weinbergeretal09.pdfhttps://papers.nips.cc/paper/1866-using-the-nystrom-method-to-speed-up-kernel-machines.pdfhttps://papers.nips.cc/paper/1866-using-the-nystrom-method-to-speed-up-kernel-machines.pdf

  • Supplementary Material:Polynomial Tensor Sketch for Element-wise Function of Low-Rank Matrix

    A. Greedy k-center AlgorithmWe recall greedy k-center algorithm runs in O(ndk) time given u1, . . . ,un ∈ Rd since it computes the distances betweenthe inputs and the selected coreset k times. The greedy k-center was originally studied for set cover problem with 2-approximation ratio (Vazirani, 2013). Beyond the set cover problem, it has been reported to be useful in many clusteringapplications. Due to its simple yet efficient clustering performance, we adopt this to approximate the optimal coefficient(3.5) (see Section 3.3).

    B. t-SNE plots of Real-world DatasetsThe error analysis in Theorem 7 depends on the upper bound of the distortion of input matrix, i.e., ε of ε-coreset (see Section3.3). We recall that a smaller ε gives tighter the error bound (Theorem 7) and this indeed occurs in read-world settings. Toverify this, we empirically investigate the t-SNE plot (Maaten & Hinton, 2008) for real-world datasets including segment ,satimage and usps in Figure 3. Observe that data points in t-SNE are well-clustered and it is expected that the distortion εare small. This can justify to utilize our coreset-based coefficient construction (Algorithm 3 and 4) for real-world problems.

    80 60 40 20 0 20 40 6080

    60

    40

    20

    0

    20

    40

    60

    80

    (a) segment

    80 60 40 20 0 20 40 60 8080

    60

    40

    20

    0

    20

    40

    60

    80

    (b) satimage

    80 60 40 20 0 20 40 60 80

    75

    50

    25

    0

    25

    50

    75

    (c) usps

    Figure 3: t-SNE plot of (a) segment , (b) satimage and (c) usps datasets. Different colors represent distinguish clusters.

    C. Scalable Sinkhorn Algorithm via Nyström MethodRecently, Altschuler et al. (2019) proposed a scalable Sinkhorn algorithm based on Nyström approximation. Motivated bytheir work, we compare our Coreset-TS to the Nyström method as fast Sinkhorn approximations. In a nutshell, the Nyströmmethod randomly samples a subset S ⊆ [n] such that |S| = s and approximates the kernel matrix K ∈ Rn×n as

    K ≈ K̃ := K[n],S(KS,S)†KS,[n]

    where KA,B ∈ Rn×s is defined as the sub-matrix of K whose rows and columns are indexed by A,B ⊆ [n], respectively,and K† is the pseudo-inverse of a matrix K. Observe that given K[n],S and (KS,S)† multiplications K̃ with vectors canbe computed efficiently in O(ns) time for |S| = s. It is easy to check that the approximation takes O(nds + ns2 + s3)time and remind that ours is O(nk(d + r2) + r3 + nr(d + m logm)). Thus, for comparable complexity, we chooses = dmax(

    √rm, r)e for the Nyström method and report the results in Section 4.3.

  • Polynomial Tensor Sketch

    D. Comparison between Coreset-based Estimation and Optimal CoefficientThe optimal coefficients c∗ (3.5) can provide a smaller approximation error than that computed using the coreset c′ (3.7),where a larger k (number of clusters) guarantees a smaller gap, i.e., ‖c∗ − c′‖2. We additionally evaluate the coefficient gapand their kernel approximation errors under the segment data, varying k from 5 to 30 as reported in Table 3. Observe that alarger k reduces not only the coefficient gap but also kernel error of the coreset. We use k = 10 for all reported experimentsin our paper since the improvement on the kernel error for k > 10 tends to be marginal.

    Table 3: Kernel approximation error between the optimal coefficient and the coreset-based estimation under segment data.The optimal coefficient achieves the kernel approximation error of 5.17× 10−4 .

    k Coefficient gap Kernel approximation error by coreset

    5 3.28× 10−1 5.52× 10−410 1.72× 10−1 5.39× 10−415 1.31× 10−1 5.33× 10−420 1.15× 10−1 5.34× 10−425 9.07× 10−2 5.28× 10−430 8.01× 10−2 5.23× 10−4

    E. Classification with Kernel SVM using Other ParametersIn Section 4.2, we use the sketch dimension m = 20 but a larger m can reduce the classification error of both RandomFourier Features (RFF) and Coreset-TS, while it increases both time and memory complexity. We additionally conductexperiments with m = 50 under the last 4 datasets in Table 1 and report the result in Table 4.

    Table 4: Classification error of kernel SVM with m = 20 and 50 under 4 real-world datasets. A larger m shows betterclassification errors for both RFF and Coreset-TS.

    DatasetClassification error (%)

    RFF Coreset-TS

    m = 20 m = 50 m = 20 m = 50

    usps 6.50 3.55 5.36 3.33grid 18.27 11.78 15.99 11.17

    mapping 18.57 17.64 17.35 17.11letter 11.38 5.31 10.30 5.13

  • Polynomial Tensor Sketch

    F. ProofsF.1. Proof of Proposition 2

    Proof. Let pr(x) =∑rj=0 cjx

    r and [p�r (A)]ij = pr(Aij). Consider the approximation error into the error from (1)polynomial approximation and (2) tensor sketch:

    E[∥∥f�(UV >)− Γ∥∥2

    F

    ]≤ E

    [2∥∥f�(UV >)− p�r (UV >)∥∥2F + 2 ∥∥p�r (UV >)− Γ∥∥2F ]

    = 2∥∥f�(UV >)− p�r (UV >)∥∥2F + 2E [∥∥p�r (UV >)− Γ∥∥2F , ]

    where the inequality comes from that (a + b)2 ≤ 2(a2 + b2) for a, b ∈ R. The first error is straightforward from theassumption: ∥∥f�(UV >)− Γ∥∥2

    F=∑i,j

    (f((UV >)ij

    )− pr

    ((UV >)ij

    ))2 ≤ n2ε2For the second error, we use Theorem 1 to have

    E[∥∥p�r (UV >)− Γ∥∥2F ] ≤ r r∑

    j=1

    c2j E

    [∥∥∥∥(UV >)�j − T (j)U T (j)V >∥∥∥∥2F

    ]

    ≤ rr∑j=1

    c2j(2 + 3j)

    (∑i(∑k U

    2ik)

    j) (∑

    i(∑k V

    2ik)

    j)

    m

    Putting all together, we conclude the result and this completes the proof of Proposition 2.

    F.2. Proof of Lemma 4

    Proof. We recall that Γ =∑rj=0 cjT

    (j)U T

    (j)>V . and similar to the proof of Proposition 2 we have

    E[∥∥f�(UV >)− Γ∥∥2

    F

    ]≤ 2

    ∥∥f�(UV >)− p�r (UV >)∥∥2F + 2E [∥∥p�r (UV >)− Γ∥∥2F , ]where the inequality comes from that (a+ b)2 ≤ 2(a2 + b2). The error in the first term can be written as∥∥f�(UV >)− p�r (UV >)∥∥2F = ∑

    i,j

    (f((UV >)ij

    )− pr

    ((UV >)ij

    ))2= ‖Xc− f‖22

    where we recall the Definition 3, that is,

    X :=

    1 [UV >]11 [UV

    >]211 . . . [UV>]r11

    1 [UV >]12 [UV>]212 . . . [UV

    >]r12...

    ......

    . . ....

    1 [UV >]nn [UV>]2nn . . . [UV

    >]rnn

    ∈ Rn2×(1+r)

    (i.e., also known as the Vandermonde matrix) and f ∈ Rn2 is the vectorization f((UV >)ij

    ). For the second error, we use

    Theorem 1 to have

    E[∥∥p�r (UV >)− Γ∥∥2F ] ≤ r r∑

    j=1

    c2j E

    [∥∥∥∥(UV >)�j − T (j)U T (j)V >∥∥∥∥2F

    ]

    ≤ rr∑j=1

    c2j(2 + 3k)

    (∑i(∑k U

    2ik)

    j) (∑

    i(∑k V

    2ik)

    j)

    m= ‖Wc‖22

  • Polynomial Tensor Sketch

    where W ∈ R(r+1)×(r+1) is defined as a diagonal matrix with (see Definition 3)

    Wii =

    √√√√√r(2 + 3k)

    m

    ∑j

    (∑k

    U2jk)i

    ∑j

    (∑k

    V 2jk)i

    i = 2, . . . , r + 10, i = 1.

    Putting all together, we have the results, that is,

    E[∥∥f�(UV >)− Γ∥∥2

    F

    ]≤ 2

    (‖Xc− f‖22 + ‖Wc‖

    22

    ).

    This completes the proof of Lemma 4.

    F.3. Proof of Theorem 5

    Proof. We denote that

    g(c) := ‖Xc− f‖22 + ‖Wc‖22 = c>(X>X +W>W )c− 2f>Xc + f>f

    for c ∈ Rr+1 and substituting c∗ = (X>X +W>W )−1X>f into the above, we have

    g(c∗) = f>(I −X(X>X +W>W )−1X>

    )f

    ≤ ‖I −X(X>X +W>W )−1X>‖2‖f‖22 (F.1)

    Since X>X is positive semi-definite, it has eigendecomposition as X>X = V ΣV > and X = Σ1/2V >. Then, we have

    X(X>X +W>W )−1X> = Σ1/2V >(V ΣV > +W>W

    )−1V >Σ1/2

    =(I + Σ−1/2V >W>WV Σ−1/2

    )−1.

    For simplicity, let M := Σ−1/2V >W>WV Σ−1/2 and observe that M is symmetric and positive semi-defnite because Wis a diagonal and Wii ≥ 0 for all i (see Definition 3). Then, we have∥∥I − (I +M)−1∥∥

    2=∥∥M(I +M)−1∥∥

    2=

    ‖M‖21 + ‖M‖2

    =(

    1 + ‖M‖−12)−1

    (F.2)

    since x/(1 + x) is a increasing function. From the submultiplicativity of ‖·‖2, we have

    ‖M‖2 =∥∥∥Σ−1/2V >W>WV Σ−1/2∥∥∥

    2

    ≤∥∥∥Σ−1/2∥∥∥

    2

    ∥∥V >∥∥2

    ∥∥W>W∥∥2‖V ‖2

    ∥∥∥Σ−1/2∥∥∥2

    =∥∥Σ−1∥∥

    2

    ∥∥W>W∥∥2. (F.3)

    where the last equality is from ‖V ‖2 = 1. We remind that

    Wii =

    √r(2 + 3i)(

    ∑j(∑k U

    2jk)

    i)(∑j(∑k V

    2jk)

    i)

    m

    for i ∈ 2, . . . , r + 1 and W1,1 = 0. Therefore,∥∥W>W∥∥2

    = maxiW 2ii

    =r

    mmax

    5‖U‖2F ‖V ‖2F , (2 + 3r)∑

    j

    (∑k

    U2jk

    )r∑j

    (∑k

    V 2jk

    )r︸ ︷︷ ︸

    :=C

    =rC

    m

  • Polynomial Tensor Sketch

    and recall that σ ≥ 0 is denoted by the smallest singular value of X . Then, ‖Σ−1‖2 ≤ 1/σ2. Substituting the above boundson∥∥W>W∥∥

    2,∥∥Σ−1∥∥

    2into (F.3), we have

    ‖M‖2 ≤rC

    mσ2

    and putting this bound and (F.2) into (F.1), we have that

    g(c∗) ≤(

    1 +mσ2

    rC

    )−1‖f‖22 =

    (1 +

    mσ2

    rC

    )−1 ∥∥f�(UV >)∥∥2F.

    Combining this with Lemma 4, we obtained the proposed bound. This completes the proof of Theorem 5.

    F.4. Proof of Theorem 7

    Recall that P : [n]→ [k] is a mapping that satisfies with

    n∑i=1

    ∥∥ui − uP (i)∥∥2 ≤ ε,for some ε > 0 where ui,vi are the i-th row of U, V , respectively. Let r(x) := f(x)−

    ∑ri=0 cix

    i. Then,

    ∣∣∣∣‖Xc− f‖22 − ∥∥∥D1/2(Xc− f)∥∥∥22

    ∣∣∣∣ =∣∣∣∣∣∣n∑

    i,j=1

    r(u>i vj)2 −

    k∑`=1

    n∑j=1

    |P`| · r(u>` vj)2∣∣∣∣∣∣ (F.4)

    =

    ∣∣∣∣∣∣n∑

    i,j=1

    r(u>i vj)2 − r(u>P (i)vj)

    2

    ∣∣∣∣∣∣ (F.5)≤

    n∑i,j=1

    ∣∣∣r(u>i vj)2 − r(u>P (i)vj)2∣∣∣ (F.6)≤∑i,j

    L∣∣∣u>i vj − u>P (i)vj∣∣∣ (F.7)

    ≤ L

    (n∑i=1

    ∥∥ui − uP (i)∥∥2) n∑

    j=1

    ‖vj‖2

    (F.8)≤ Lε

    n∑j=1

    ‖vj‖2

    (F.9)From Lemma 4 and Theorem 5, we have that

    E[∥∥f�(UV >)− Γ∥∥2

    F

    ]≤ 2

    (‖Xc− f‖22 + ‖Wc‖ 2

    )≤ 2

    (∥∥∥D1/2(Xc− f)∥∥∥22

    + ‖Wc‖ 2)

    + 2Lε

    n∑j=1

    ‖vj‖2

    (2

    1 + mσrC

    )∥∥f∥∥22

    + 2Lε

    n∑j=1

    ‖vj‖2

    .where Γ is the output of Algorithm 2 with the approximated coefficient in (3.7), vi is the i-th row of V , σ is the smallestsingular value of D1/2X and C := max(5‖U‖2F ‖V ‖2F , (2 + 3r)(

    ∑j(∑k U

    2jk)

    r)(∑j(∑k V

    2jk)

    r)). This complete theproof of Theorem 7.