Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Aravindan Vijayaraghavan
NYU ⇒ Northwestern University
Smoothed Analysis of Tensor Decompositions
based on joint work with
Aditya Bhaskara
Google Research
Moses Charikar
Princeton
Ankur Moitra
MIT
Multi-dimensional arrays
Tensors
n n n
n n
• Tensor of order 𝑝 ≡ 𝑝-tensor of size 𝑛 × 𝑛 × ⋯ 𝑛 (𝑝 times)
• Elements are over ℝ
Low rank Decompositions
Tensor can be written as a sum of few rank-one tensors
𝑇 = 𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖
𝑘
𝑖=1
Rank(T) = smallest k s.t. T written as sum of k rank-1 tensors
3-Tensors:
T 𝑎1 𝑎2 𝑎𝑘
𝑐1 𝑐2 𝑐𝑘
• Rank of 𝑝-tensor 𝑇𝑛×⋯×𝑛 ≤ 𝑛𝑝−1
Low-rank 𝜖-approximation: Low-rank decomposition approximating T up to error 𝜖 in
Frobenius norm i.e. 𝑇 − 𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖𝑘𝑖=1
𝐹≤ 𝜖
Tensor Decomposition: Uniqueness
Thm [Kruskal’77]. Rank-𝑘 decompositions for 3-tensors unique
(non-algorithmic) under rank condition (𝑘 ≤3𝑛
2− 1 )
• p-tensors: rank condition gives 𝑘 ≤ (𝑝𝑛 − 𝑝+1)/2 [SB01]
Thm [Jennrich via Harshman’70]. Find unique rank-𝑘 decompositions for 3-tensors when the vectors of a decomposition are linearly independent (hence 𝑘 ≤ 𝑛)
• “Full-rank” case. Rediscovered in [Leurgans et al. 93, Chang’96]
Thm [DeLathauwer, Castiang, Cardoso’07]. Algorithm for 4-tensors of rank 𝑘 generically when 𝑘 ≤ 𝑐. 𝑛2 • p-tensors: generically handle 𝑘 ≤ 𝑐. 𝑛⌊𝑝/2⌋
Thm [Chiantini Ottaviani‘14]. Uniqueness of 3-tensors of rank 𝑘 ≤ 𝑛2/3 generically
Algorithms for Tensor Decompositions
NP-hard in general when rank 𝑘 ≫ 𝑛 (except in special settings)
[Hillar-Lim]
• Polynomial time algorithms* for robust Tensor decompositions
• Introduce Smoothed Analysis to overcome worst-case
intractability
• Handle rank 𝒌 ≫ 𝒏 for higher order tensors (𝒑 ≥ 𝟓).
*Algorithms 𝑝𝑜𝑙𝑦𝑝(𝑛, 𝑘, 1/𝜖) to get vectors in decomposition to 𝜖 error in . 2
This talk
Talk Plan
1. Applications of Tensor Decompositions to ML
– Motivating Algorithm properties
2. Smoothed Analysis Model and Results
3. Overview of the Proof
Learning Probabilistic Models: Parameter Estimation
Learning goal: Can the parameters of the model be learned from polynomial samples generated by the model ?
HMMs for speech recognition
Mixture of Gaussians for clustering points
Question: Can given data be “explained” by a simple probabilistic model?
Multiview models
Parameters • Mixing weights: 𝑤1, 𝑤2, … , 𝑤𝑘 • Gaussian 𝐺𝑖 : 𝜇𝑖 , Σ𝑖 mean 𝜇𝑖 , covariance Σ𝑖 : diagonal
Learning problem: Given many
sample points, find (𝑤𝑖 , 𝜇𝑖 , Σ𝑖)
Probabilistic model for Clustering in 𝒏-dims
Mixtures of (axis-aligned) Gaussians
• Algorithms use 𝐎(𝐞𝐱𝐩 𝒌 . 𝒑𝒐𝒍𝒚(𝒏)) samples and time [FOS’06, MV’10]
• Lower bound of Ω(𝐞𝐱𝐩(𝒌)) [MV’10] in worst case
ℝ𝑛
𝜇𝑖
𝑥
Aim: 𝒑𝒐𝒍𝒚(𝒌, 𝒏) guarantees in realistic settings
Method of Moments and Tensor decompositions
step 1. compute a tensor whose decomposition
encodes model parameters
step 2. find decomposition (and hence
parameters)
⋱
𝑛
𝑛
𝑛
⋅⋅ ⋯ ⋅⋮ 𝐸 𝑥𝑖𝑥𝑗𝑥𝑘
⋮ ⋯
𝑻 = 𝒘𝒊 𝝁𝒊 ⊗ 𝝁𝒊 ⊗ 𝝁𝒊
𝒌
𝒊=𝟏
• Uniqueness ⟹ Recover parameters 𝑤𝑖 and 𝜇𝑖 • Algorithm for Decomposition ⟹ efficient learning
[Chang] [Allman, Matias, Rhodes]
[Anandkumar,Ge,Hsu, Kakade, Telgarsky]
Third moment tensor
Aim: 1. Uniqueness of Tensor Decompositions
2. Algorithms taking time 𝒑𝒐𝒍𝒚 𝒏, 𝒌
robust to noise 𝝐 = 𝟏/𝒑𝒐𝒍𝒚(𝒏, 𝒌)
Robustness to Errors
Beware : Sampling error
Empirical estimate 𝑇 =𝜖 𝑤𝑖 𝜇𝑖⨂ 𝜇𝑖⨂ 𝜇𝑖𝑘𝑖=1
With poly(n) samples, error ϵ ≈ 1/poly(n, k)
Thm[Goyal-Vempala-Xiao]. Jennrich’s polytime algorithm for Tensor Decompositions of rank 𝑘 ≤ 𝑛 robust up to 1/𝑝𝑜𝑙𝑦(𝑛) error
⇒ Efficient Learning for many probabilistic models when no. of clusters/ topics k ≤ dimension n [Chang 96, Mossel-Roch 06, Hsu-Kakade12 , Anandkumar et al. 09-14]
Overcomplete Setting
Number of clusters/topics/states 𝐤 ≫ dimension 𝐧
Computer Vision Speech
NP-hard in worst-case (for rank k ≥ 6n)
Polytime decomposition of Tensors of rank 𝒌 ≫ 𝑛? ( rank 𝑘 ≤ 𝑛𝑝−1)
Smoothed Analysis
[Spielman & Teng 2000]
• Small random perturbation of input makes instances easy
• Best polytime guarantees in the absence of any worst-case guarantees
Good Smoothed analysis guarantees:
• Worst instances are isolated
Simplex algorithm solves LPs efficiently (explains practice).
Smoothed Analysis for Learning
Learning setting (e.g. Mixtures of Gaussians)
Worst-case instances: Means 𝝁𝒊 in pathological configurations
Means not in adversarial configurations in real-world!
What if means 𝜇𝑖 perturbed slightly ? 𝝁𝒊 𝝁 𝒊
Parameters of the model are perturbed slightly.
Smoothed Analysis for Tensor Decompositions
1. Given tensor
3. Input: 𝑇 . Analyse algorithm on 𝑇 .
2. 𝑎 𝑖(𝑗)
is random 𝜌-perturbation of 𝑎𝑖(𝑗)
i.e. add independent (gaussian) random vector of length ≈ 𝜌.
𝑇𝑑×𝑑×⋯×𝑑 = 𝑎𝑖(1)
⨂𝑎𝑖2
⨂ … ⨂𝑎𝑖𝑝
𝑘
𝑖=1
𝑇 = 𝑎 𝑖(1)
⨂𝑎 𝑖2
⨂ … ⨂𝑎 𝑖𝑝
+ noise𝑘
𝑖=1
Factors of the Decomposition are perturbed
For mixture of Gaussians, means 𝜇𝑖 perturbed slightly
T = 𝑤𝑖 𝜇𝑖 ⊗ 𝜇𝑖 ⊗ 𝜇𝑖 … ⊗ 𝜇𝑖
𝑘
𝑖=1
Smoothed Analysis model
• Different from elements of T being perturbed
• More similar in spirit generic results
than average-case analysis:
no regions that are hard
𝑇 = 𝑎 𝑖(1)
⨂𝑎 𝑖2
⨂ … ⨂𝑎 𝑖𝑝
+ noise𝑘
𝑖=1
Easy
Problem instance space
Hard
• Robust analog for generic results ?
Algorithmic Guarantees
Thm. Polynomial time algorithm for decomposing 𝑝-tensor (size 𝑛𝑝)
in smoothed analysis, when rank 𝒌 ≤ 𝒏⌊(𝒑−𝟏)/𝟐⌋ w.p. 1 − exp (−𝑛𝑓 𝑝 )
Running time, sample complexity = 𝒑𝒐𝒍𝒚𝒑 𝒏,𝟏
𝝆.
Guarantees for order-t tensors in d-dims (each)
Rank of the t-tensor=k (number of clusters)
Previous Algorithms
𝑘 ≤ 𝑛
Algorithms (smoothed case) 𝑘 ≤ 𝑛(𝑝−1)/2
Corollary. Polytime algorithms (smoothed analysis) for learning parameters of mixtures of axis-aligned Gaussians, multiview models etc. when no. of clusters k ≤ dimC for any constant C w.h.p.
Interpreting Smoothed Analysis Guarantees
Time, sample complexity = 𝑝𝑜𝑙𝑦𝑝 𝑛,1
𝜌.
Works with probability 1 − exp (−𝜌𝑛3−𝑝)
• Exponential small failure probability in polytime (for constant 𝑝)
Smooth Interpolation between Worst-case and Average-case
[Anderson, Belkin, Goyal, Rademacher, Voss ‘14]
Time, sample complexity = 𝑝𝑜𝑙𝑦𝑝 𝑛,1
𝜌, 1/𝜏 .
Works with probability =1 − 𝜏
Algorithm Details
Algorithm Outline
• Helps handle the over-complete setting (k ≫ 𝑛)
[Jennrich 70] A simple (robust) algorithm for 3-tensor T when: 𝜎𝑘 𝐴 , 𝜎𝑘 𝐵 , 𝜎2(𝐶) ≥ 1/𝑝𝑜𝑙𝑦(𝑛, 𝑘)
2. For higher order tensors using ``tensoring / flattening’’.
1. An algorithm for 3-tensors in the ``full rank setting’’ (𝐤 ≤ 𝒏).
𝐴 (𝑛 × 𝑘)
𝑇 = 𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖
𝑘
𝑖=1 Recall:
𝑎𝑖
⋱
𝑛
𝑛
𝑇 Aim: Recover A, B, C
• Any algorithm for full-rank tensors suffices
Blast from the Past
⋱
𝑇
𝑻 ≈𝝐 𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖
𝑘
𝑖=1
Recall
𝐴 (𝑛 × 𝑘)
𝑎𝑖
Aim: Recover A, B, C
Qn. Is this algorithm robust to errors ?
Yes ! Needs perturbation bounds for eigenvectors. [Stewart-Sun]
Thm. Efficiently decompose T=𝜖 𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖𝑘𝑖=1
and recover 𝐴, 𝐵, 𝐶 upto 𝜖. 𝑝𝑜𝑙𝑦(𝑛, 𝑘) error when 1) 𝐴, 𝐵 are min-singular-value ≥ 1/poly(n) 2) C doesn’t have parallel columns (robustly).
[Jennrich via Harshman 70]
Algorithm for 3-tensor 𝑇 = 𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖𝑘𝑖=1
• 𝐴, 𝐵 have rank=k i.e. ai , bi are linearly independent • C has rank ≥2
• Reduces to matrix eigen-decompositions
Algorithm:
1. Take random combination along w1 as 𝑀1.
2. Take random combination along w2 as 𝑀2.
3. Find eigen-decomposition of 𝑀1𝑀2† to get 𝐴. Similarly B,C.
Decomposition algorithm [Jennrich]
𝑇 ≈𝜖 𝑎𝑖 ⊗ 𝑏𝑖 ⊗ 𝑐𝑖
𝑘
𝑖=1
Thm. Efficiently decompose T=𝜖 𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖𝑘𝑖=1 and recover
𝐴, 𝐵, 𝐶 up to 𝜖. 𝑝𝑜𝑙𝑦(𝑛, 𝑘) error (in Frobenius norm) when 1) 𝐴, 𝐵 are full rank i.e. min-singular-value ≥ 1/poly(n) 2) C doesn’t have parallel columns (in a robust sense).
Handling high rank
into Techniques
Mapping to Higher Dimensions
How do we handle the case rank 𝒌 = 𝛀(𝒏𝟐)?
(or vectors with “many” linear dependencies?)
Consider a 6th order tensor with rank 𝑘 ≤ 𝑛2
Trick: view T as an 𝑛2 × 𝑛2 × 𝑛2 object
vectors in the decomposition are: 𝑎𝑖 ⊗ 𝑏𝑖 , {𝑐𝑖 ⊗ 𝑑𝑖}, {𝑒𝑖 ⊗ 𝑓𝑖}
𝑇 = 𝑎𝑖 ⊗ 𝑏𝑖 ⊗ 𝑐𝑖 ⊗ 𝑑𝑖 ⊗ 𝑒𝑖 ⊗ 𝑓𝑖
𝑘
𝑖=1
Qn: are these vectors 𝒂𝒊 ⊗ 𝒃𝒊 𝒊=𝟏…𝒌 linearly independent? Is ``dimensionality’’ 𝛀(𝒏𝟐)?
Bad cases
Smoothed Analysis Can we hope for “dimension” to multiply typically?
Bad example where 𝑘 > 2𝑛: • Cols of A=B composed of two orthonormal basis of ℝ𝑛 • Every 𝑛 vectors of A and B are linearly independent • But (2𝑛 − 1) vectors of Z are linearly dependent !
B(𝑛 × 𝑘)
𝑏𝑖
A(𝑛 × 𝑘)
𝑎𝑖
Z (𝑛2 × 𝑘)
𝑧𝑖= 𝑎𝑖⨂𝑏𝑖
A, B have rank 𝑛. 𝑧𝑖 = 𝑎𝑖⨂𝑏𝑖 ∈ ℝ𝑛2 for 𝑖 = 1 … 𝑘 = 𝑛2
Dimension does not grow multiplicatively in worst case
But, bad examples are pathological and hard to construct!
Qn: Are 𝒂𝒊 ⊗ 𝒃𝒊 𝒊=𝟏…𝒌 linearly independent? Is ``dimensionality’’ 𝛀(𝒏𝟐)?
Product vectors & linear structure
• ``Flattening’’ of 3𝑝-order moment tensor
• New factor matrix is full rank using Smoothed Analysis.
Theorem. For any matrix 𝐴𝑛×𝑘, for 𝑘 < 𝑛𝑝/2,
𝜎𝑘 𝑍 ≥ 1/𝑝𝑜𝑙𝑦 𝑘, 𝑛,1
𝜌 with probability 1- exp(-poly(n)).
𝐴 (𝑛 × 𝑘)
𝑎 𝑖
𝑍 (𝑛𝑝 × 𝑘)
𝑎 𝑖 ⊗ 𝑎 𝑖 ⊗ ⋯ 𝑎 𝑖
Khatri-Rao product
Proof sketch (two-wise product p=2)
Main Issue: perturbation before product..
• easy if columns perturbed after tensor product (simple anti-concentration bounds)
Technical component
show product of perturbed vectors behave
like random vectors in 𝑅𝑛2
𝑛2
𝑘
𝑍
Prop. For any matrix 𝐴, matrix 𝑍 below (𝑘 < 𝑛2/2) has
𝜎𝑘 𝐵 ≥ 1/𝑝𝑜𝑙𝑦 𝑘, 𝑛,1
𝜌 with probability 1- exp(-poly(n)).
• only 2𝑛 bits of randomness in 𝑛2 dims • Block dependencies
Projections of product vectors
Easy : 𝑛2 dimensional 𝑥, 𝜌-perturbation to 𝑥
projection > 1/𝑝𝑜𝑙𝑦(𝜌) on to S w.h.p.
anti-concentration for polynomials implies this with
probability 1-1/poly(n)
….
... 𝑎 𝑛 ⊗ 𝑏
Much tougher for product of perturbations! (inherent block structure)
Question. Given any vector 𝑎, 𝑏 ∈ ℝ𝑛 and gaussian 𝜌-perturbation
𝑎 , 𝑏 , does 𝒂 ⊗ 𝒃 have projection 𝑝𝑜𝑙𝑦(𝜌,1
𝑛) onto any given 𝑛2/2
dimensional subspace 𝑆 ⊂ 𝑅𝑛2 with prob. 1 − exp(− 𝑛) ?
Projections of product vectors
dot product of
block with 𝑏
=
Question. Given any vector 𝑎, 𝑏 ∈ ℝ𝑛 and gaussian 𝜌-perturbation
𝑎 , 𝑏 , does 𝒂 ⊗ 𝒃 have projection 𝑝𝑜𝑙𝑦(𝜌,1
𝑛) onto any given 𝑛2/2
dimensional subspace 𝑆 ⊂ 𝑅𝑛2 with prob. 1 − exp(− 𝑛) ?
𝑛2
2
𝑛2 𝑎 𝑛 ⊗ 𝑏
𝛱𝑆 is projection matrix onto 𝑆
𝛱𝑆 𝑎 ⊗ 𝑏 = 𝛱𝑆 𝑏 𝑎
𝑛2
2
𝑛
Two steps of Proof..
2. If Π𝑆(𝑏 ) has 𝑟 eigenvalues > 𝑝𝑜𝑙𝑦(𝜌,1
𝑛), then w.p. 1 − exp (−𝑟)
(over perturbation of 𝑎 ), 𝒂 ⊗ 𝒃 has large projection onto 𝑆.
follows easily analyzing
projection of a vector to
a dim-k space
will show with 𝑟 = √𝑛
1. W.h.p. (over perturbation of b), Π𝑆(𝑏 ) has at least
𝑟 eigenvalues > 𝑝𝑜𝑙𝑦(𝜌,1
𝑛)
Suppose: Choose ΠS first 𝑛 × 𝑛 “blocks” in ΠS were orthogonal...
…. ….
….
Structure in any subspace S
(restricted to 𝑛 cols)
• Entry (i,j) is:
• Translated i.i.d. Gaussian matrix!
has many big eigenvalues
√𝑛
Π𝑆 𝑏 𝑛 =
𝑣𝑖𝑗 ∈ ℝ𝑛
Main claim: every 𝑐. 𝑛2 dimensional space 𝑆 has ~√𝑛 vectors
with such a structure..
….
….
….
Property: picked blocks (n dim vectors) have “reasonable”
component orthogonal to span of rest..
Finding Structure in any subspace S
Earlier argument goes through even with blocks not fully
orthogonal!
𝑣1
𝑣2
𝑣√𝑑
Idea: obtain “good” columns one by one..
• Show there exists a block with many linearly
independent “choices”
• Fix some choices and argue the same property holds, …
Main claim (sketch)..
Generalization: similar result holds for higher order
products, implies main result.
crucially use the fact
that we have a Ω(𝑛2) dim subspace
• Uses a delicate inductive argument
Summary
• Polynomial time Algorithms when
rank 𝒌 ≫ 𝒏:
Tensor Decompositions when rank 𝑘 = 𝑛𝑂(1)
Learning when number of clusters/ topics 𝑘 = 𝑛𝑂(1)
• Smoothed Analysis for Tensor Decompositions & Learning
Future Directions
Smoothed Analysis for Tensor Decompositions, Learning
• Handling ranks that match generic results for uniqueness,
algorithms ?
• Polynomially robust analogs of [Chiantini-Ottaviani] or [Cardoso]?
• Proofs for generic results that are more amenable to noise?
Better Robustness to Errors
• Tensor decomposition algorithms that more robust to errors ?
promise: [Barak-Kelner-Steurer’14] using Lasserre hierarchy
• Modelling errors?
Thank You!
Questions?