Beyond matrices: statistical method for higher-order …pages.stat.wisc.edu/~miaoyan/tensor_tutorial2_fudan.pdfBeyond matrices: statistical method for higher-order tensors and its

Beyond matrices: statistical method for higher-ordertensors and its application. II

Miaoyan Wang

University of Wisconsin, Madison

Fudan University,

July, 2019

gene expression low-rank binary tensor tensor block model Conclusions and future work

Talk outline

Prohibitive Computational Complexity

Most higher-order tensor problems are NP-hard [Hillar & Lim, 2013].

Topics I will address:

I Application of tensor decomposition to genetics

I Low-rank tensor estimation from binary data

I Multiway clustering via tensor block models

1 / 51


Applications of tensor decomposition to biology

I Many biomedical datasets come naturally in a multiway form, e.g.,patients × genes × experimental conditions.

I For example, time-series gene expression measurements could be orga-nized as a order-3 tensor A = JagstK ∈ RN×S×T .

gene

time

individual

I g = 1, . . . ,N indexes (often thou-sands of) the genes

I s = 1, . . . ,S indexes the individu-als/samples

I t = 1, . . . ,T indexes the timepoints.

I Typically, N � ST .

Goal

Identify subsets of genes that are similarly expressed within subsets of indi-viduals, and learn the time trajectories of the associated expression level.

2 / 51


Time-series gene expression

We simulate tensorial data A ∈ RN×S×T for time-series gene expression:

I Given a time t, the matrix slide A(·, ·, t) are transposable (i.e., boththe gene and sample indexes are permutable)⇒ simulate a “checkerboard” pattern for the latent bicluster member-ships at the initial time.

I Given a gene g and a sample s, the trajectory A(g , s, ·) is temporallystructured (i.e., the time indexes are meaningful)⇒ for each bicluster, select a time trajectory from a library of candidatefunctions, such as sigmoid, sines, cosines, etc.

time

gene

individual

timegene

individual

3 / 51


Tensor Applications to Time-Series Gene Expression Data

I Data (denoted A): noisy time-series gene expression with shuffled geneand sample indexes.

I Goal: Identify subsets of genes that are similarly expressed within sub-sets of individuals, and learn the time trajectories of the associatedexpression level.

We propose to

1 first perform truncated CP decomposition of A via two-mode HOSVDalgorithm:

gene

individual

time

2 then perform K-mean clusterings on the gene factor matrixG = [g1, . . . , gk ] and the sample factor matrix S = [s1, . . . , sk ].

4 / 51


Comparison to matrix-based clustering

Tensor-based clustering outperforms the matrix-based clustering.Matrix-based clustering:

I PCA on the time-averaged matrix M1 = 1T

∑Tt=1 A(·, ·, t)

I PCA on the sample-averaged matrix M2 = 1S

∑Ss=1 A(·, s, ·)

5 / 51


Comparison to HOSVD-based clustering

In our simulations, clustering based on CP decomposition is more accuratethan that based on Tucker decomposition/HOSVD.

HOSVD:

gene

individual

time

CP decomposition:

gene

individual

time

●

●●

●

●

5 10 15 20

0.10

0.15

0.20

0.25

Classification Error Rate of Gene Clustering

Model Complexity (# of sample groups among 20 samples)

clas

sific

atio

n er

ror

rate

(C

ER

)

● CP Decomposition via Two−mode HOSVD

Tucker/HOSVD Decomposition

6 / 51


Multi-tissue gene expression analysis

I Central dogma of genetics: DNAtranscription−−−−−−−→ RNA

translation−−−−−−→ protein.I GTEx RNA-seq data: expression profiles (∼ 25, 000 genes) measured

from 544 individuals across 53 tissues.

gene

individual

individual

gene

Genotype-tissue expression (GTEx) project, Aguet et al. Nature (2017) 550, 204-213.

7 / 51


Review: Matrix SVD for biclustering

=

samples

fea

ture

s

I Columns of U describe patterns across samples

I Columns of VT describe patterns across genesY Kluger et al, Genome Research (2003). 13(4): 703-71

Data Science Specialization (COURSERA) by Brian Caffo and Jeff Leek

8 / 51


Multi-tissue gene expression analysis

We have developed a semi-nonnegative tensor decomposition for three-wayclustering of multi-tissue multi-individual gene expression data.

entriwise

Wang et al., under review (2017). doi.org/10.1101/229245

9 / 51


Performance comparison

Our algorithm identifies 3-way clusters with higher accuracy compared toexisting methods.

0.10

0.20

0.30

rela

tive

err

or

0.6 0.8 1.0 1.2 1.4

HOSVD

5 5.5 6 6.5 7

0.20

0.25

0.30

0.35

rela

tive

err

or

our method

our method

SDA: sparse decomposition of array (Hore et al. Nat. Gen. 2016).HOSVD: higher-order singular value decomposition; Tucker decomposition (Omberg et al. PNAS. 2007).

details10 / 51


Component I: shared, global expression

I Both tissue- and individual-loadings are essentially flat ⇒ this compo-nent captures global expression common to all samples.

a. Tissue Loadings (Component 1)

c. Individual Loadings (Component 1)

ranked individuals

b. Gene Loadings (Component 1)

loa

din

g

0.04

0.02

0.00

loa

din

g

0.000

0.010

0.015

loa

din

g

0.00

0.05

0.10

0.15 brain

blood

artery

esophagus

skin

heart

digestion

others

adipos

0.005

ranked genes

11 / 51



I Both tissue- and individual-loadings are essentially flat ⇒ this compo-nent captures global expression common to all samples.



ranked individuals


loa

din

g

0.04

0.02

0.00

loa

din

g

0.000

0.010

0.015

loa

din

g

0.00

0.05

0.10

0.15 brain

blood

artery

esophagus

skin

heart

digestion

others

adipos

0.005

ranked genes

11 / 51



I Top genes are mainly mitochondrial genes (15/20 top genes); othernon-mitochondrial genes include ACTB, EEF1A1, and EEF2.

loa

din

g

0.000

0.015

0 20 40 60

2.5e−11

5.0e−11

7.5e−11

1.0e−10

SRP-dependent cotraslational protein targeting to membrane

d. Enriched gene ontologies (GOs) among top genes (Component 1)

cotraslational protein targeting to membrane

protein targeting to endoplasmic reticulum

ribosomal large subunit biogenesis

cytoplasmic translation

top 200 genes

Benjamini-Hochberg

adjusted pvalue

0.010

Number of top genes in the GO


0.005

12 / 51


Component II: brain tissues

• Tissue vector clearly separates brain tissues from non-brain tissues.



ranked individuals

loa

din

g

0.04

0.02

0.00

loa

din

g

-0.03

0.00

0.03

load

ing

0.00

0.10

0.20

0.30brain

blood

artery

esophagus

skin

heart

digestion

adipos


ranked genes

13 / 51



• Tissue vector clearly separates brain tissues from non-brain tissues.



ranked individuals

loa

din

g

0.04

0.02

0.00

loa

din

g

-0.03

0.00

0.03

load

ing

0.00

0.10

0.20

0.30brain

blood

artery

esophagus

skin

heart

digestion

adipos


ranked genes

details

13 / 51



I Enriched GOs are mostly related to glutamate receptor signaling path-way and chemical synaptic transmission.

0 10 20 30

glutamate receptor signaling pathway

d. Enriched GOs among top genes (Component 2)

chemical synaptic transmission, postsynpatic

excitotary postsynaptic potential

memory

synaptic vesicle exocytosis

3e−14

6e−14

9e−14

B-H corrected p-value

Number of top genes in the GO

loa

din

g

-0.03

0.00

0.03 top 899 genes


ranked genes

14 / 51



I Age explains more variation (24.4%) than gender (0.3%) or ethnicity(4.3%).

d.


ranked individuals

loa

din

g0.04

0.02

0.00

I How to pinpoint the age-related genes?

15 / 51


Tensor projection to detect differentially-expressed genes

I We incorporate information across tissues by projecting the tensorthrough the tissue-vector.

I For a given test gene, we perform the following regression analysis

A(test gene, ·,Ti ) = Wβ0 + Xβ1 + ε,

where W is the covariate, and X is the primary variable of interest, andε ∼ N(0, I). Consider H0 : β1 = 0 vs. Hα : β1 6= 0.

indivdiuals

ge

ne

s

tissues

gene to be tested

The projected gene-by-individual matrix:

16 / 51


Power to detect differentially-expressed genes

I Using tensor-projection procedures, we identify 694 age-related genes,of which only 514 were detected by separate analyses.

I Receiver operating characteristic (ROC) plot from simulated expressiondata with age-related genes.

0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

Tru

e p

ositi

ve r

ate

0.0

0.2

0.4

0.6

0.8

1.0

single tissue ( = 10)

Tensor projection with

r = 10 = # of tissues

r = 5

r = 3 = # of tissue groups

17 / 51


Gene expression in the brain

I We analyze 13 brain tissues (“subtensor”) and apply our tensor methodto this subtensor.

I Expression modules are spatially restricted to specific brain regions.

18 / 51



I We analyze 13 brain tissues (“subtensor”) and apply our tensor methodto this subtensor.

I Expression modules are spatially restricted to specific brain regions.

loa

din

g 0.6

0.4

0.2

0.0

0.8

0.4

0.0

loa

din

g 0.8

0.4

0.0

ranked tissues

loa

din

g

0.4

0.2

0.0

ranked tissues

loa

din

g

Tissue-vector (comp 3)


Cerebellum Cortex Basal ganglia

Others (Hippocampus, Amygdala, Spinal cord, etc)


ranked tissues

Tissue-vector (comp 2)Recall: i tensor component, i = 1, 2, ... th

ranked tissues

18 / 51



Table 1: Three-way clustering analysis of gene expression in the brain

Tissue Gene Individual

enriched region enriched ontologyvariance explained by

age gender ethnicity

cerebellum dorsal spinal cord development 0.0% 8.0% 0.2%cortex behavior defense response 16.7% 0.6% 1.4%

basal ganglia forebrain generation of neurons 1.3% 0.8% 1.7%others embryonic skeletal system morphogenesis 10.5% 0.7% 5.2%

Red number indicates p-value < 0.001.

19 / 51


Power to detect differentially-expressed genes

loa

din

g

Tissue-vector Top genes

Tensor component 8

0.6

0.4

0.2

0.0

IL1RL1, SERPINA3

etc

variance explained

age 22.1%

gender 0.6%

ethnicity 1.9%

Cerebellum Cortex Basal ganglia Others

Individual-vector

ranked tissues

I The 2nd top gene SERPINA3 is implicated in Alzheimer’s disease [Kam-

boh et al, Neurobiol Aging 2006].

I This gene has a moderate age effect in each single tissue (p rangingfrom 0.05 to 1.1× 10−6 with all effects in the same direction).

I Tensor projection method yields p = 1.0× 10−8 for the age effect ofthis gene ⇒ an increase of significance by two orders of magnitude.

20 / 51


Outline




20 / 51


Review: Matrix SVD

=

samples

fea

ture

s

I Columns of U describe patterns across samples

I Columns of VT describe patterns across genes

21 / 51


Tensor rank decomposition

I Tensor analogue of matrix SVD:

Y =R∑

r=1

λrar ⊗ br⊗cr ,

where λ1 ≥ λ2 ≥ . . . ≥ λR , ‖ar‖2 = ‖br‖2 = ‖cr‖2 = 1.

I Example: a rank-3 order-3 tensor.

+ +=

22 / 51


Low-rank approximation

I A noisy rank-R tensor:

Y︸︷︷︸observation

=R∑

r=1

λrar ⊗ br ⊗ cr︸︷︷︸signal

+ E︸︷︷︸noise

,

I Assume that R is known and the error E = JεijkK ∼ i.i.d. N(0, σ2).I Maximizing the likelihood under the i.i.d. Gaussian error model is equiv-

alent to minimizing the squared loss:

minar ,br ,cr ,λr :r∈[R]

‖Y −R∑

r=1

λrar ⊗ br ⊗ cr‖2F

I CANDECOMP/PARAFAC (CP) tensor approximation (Carroll & Chang, 1970,

Kolda & Bader, 2009, Casey et al, 2018) 23 / 51


Bibliography dataset in DBLP

I Three-way bibliography relationships (author, conference, keyword) ex-tracted from DBLP database.

I A tensor entry is 1 if the triplet co-occurs in a bibliography entry.I Could be organized into a tensor of the form Y ∈ {0, 1}10k×200×10k

(10k authors, 200 conferences, and 10k keywords)Conferences

24 / 51


Latent variable models for binary tensors

Directly applying rank-R approximation to Y = JYijkK is sub-optimal. why?(symmetry w.r.t. 0↔ 1; non-negativity; non-linearity)

"Preference" tensor

Binary tensor

I Θ is unknown. Θ has low rank.I π : R 7→ [0, 1] is a known function (e.g., the logistic curve).I Θ ∈ Rd1×d2×d3, Y ∈ {0, 1}d1×d2×d3 .

25 / 51


Probabilistic models for binary tensors

Latent variable formulation:

I We observe Y = JYijkK = {0, 1}d1×d2×d3 .

I Assume an underlying continuous-valued tensor Z = JZijkK:

Yijk |Zijk =

{1 if Zijk ≥ 0,

0 if Zijk < 0.

I The latent tensor Z is further modeled as:

Z = Θ︸︷︷︸signal

+ E︸︷︷︸noise

, where Rank(Θ) ≤ R,

and the entries in E = JεijkK follow i.i.d. N(0, σ2).

I Rank (Θ) = min{R ∈ N+ : Θ =∑R

r=1 a1r ⊗ · · · ⊗ ak

r }.I Assume R is known or otherwise could be estimated (e.g., using BIC).

26 / 51


Connection with GLM

More generally, we can model the binary tensor Y as

Y = JYijkK ∼ind. Bernoulli(π(Θ)), where Rank(Θ) ≤ R,

and π : R 7→ [0, 1] is the link function (we allow π to be applied to tensorsin a point-wise manner):

I Logistic link: π(θ) = 11+e−θ/σ

⇐⇒ latent noise E follows i.i.d. logisticdistribution with scale parameter σ.

I Probit link: π(θ) = Φ(θ/σ), where Φ(·) is the CDF for standard normaldistribution ⇐⇒ latent noise E follows i.i.d. N(0, σ2).

27 / 51


Low-rank tensor estimation

I Goal: estimate low-rank Θ ∈ Rd1×d2×d3 and/or π(Θ) ∈ [0, 1]d1×d2×d3 .

I Data: Y ∈ {0, 1}d1×d2×d3 .

Log-likelihood:

LY(Θ) =∑ijk

(1(Yijk=1) log π(θijk) + 1(Yijk=0) log (1− π(θijk))

).

We propose to estimate Θ using the constrained MLE:

ΘMLE = arg maxΘ∈DLY(Θ).

where D(R, α) ={

Θ ∈ Rd1×d2×d3 : rank(Θ) ≤ R, ‖Θ‖∞ ≤ α}

.

28 / 51


Convergence rate

Define Loss (Θ1,Θ2) = 1√∏k dk‖Θ1 −Θ2‖F , where Θ1,Θ2 ∈ Rd1×···×dK .

Error bound for constrained MLE (W. and Li, 2019)

Assume that Θtrue ∈ D(R, α) and regularity conditions on the link functionπ. Then with high probability,

Loss(

ΘMLE,Θtrue)≤ C

Lαγα

√log(K )RK−1

∑k dk∏

k dk,

where

Lα := sup|θ|≤α

π(θ)

π(θ)(1− π(θ)), γα = inf

|θ|≤α

(π2(θ)

π2(θ)− π(θ)

π(θ)

)> 0.

Is this bound tight?

29 / 51


Special case: Probit link

Lα,σ ≈ασ + 1

σ, γα,σ ≈

ασ + 1

6

σ2e−α

2/σ2.

Theorem (W. and Li, 2019)

I Upper bound for constrained MLE: w.h.p.

Loss(

ΘMLE,Θtrue

)≤ min

{2α,Cσ

(α + σ

6α + σ

)exp

(α2

σ2

)√dmax∏k dk

}.

I Lower bound for any estimator:

infΘ

supΘtrue∈D(R,α)

ELoss(Θ,Θtrue) ≥ min

{α, σC ′

√dmax∏k dk

}.

I The constrained-MLE is rate-optimal in terms of {dk} in the class oflow-rank tensors.

I Tensor vs. “best” matricization: O(d−K+1) vs O(d−bK/2c).

30 / 51


Phase Diagram

I Define signal-to-noise ratio (SNR) = Θtrue

σ = ασ .

SNR � O(1) O(1) & SNR� O(d−(K−1)/2) O(d−(K−1)/2) & SNR

Binary tensor σeα2/σ2

d−(K−1)/2 ↘ σd−(K−1)/2 ↗ α

Continuous-valued tensor σd−(K−1)/2 σd−(K−1)/2 α

Table: Error bound for estimating Θtrue ∈ (Rd )⊗K from noisy observations.

I Noise helps in the high SNR regime! (Davenport et al 2014, Cai-Zhou 2013.)

''noise helps'' region

"noise hurts" region

31 / 51


Noise helps!

Intuition: If σ = 0, we cannot recover the magnitude of the underlyingtensor Θ ∈ Rd1×···×dK from the observed binary tensor Y ∈ {0, 1}d1×···×dK .

I Consider the rank-1 probit model without noise.

Yijk = 1{θijk>0}, Θ = a ⊗ a ⊗ a.

I Cannot distinguish a = (a1, . . . , ad)T and a = (sign(a1), . . . , sign(ad))T .

I In contrary to many statistical problems: noise is essential for recoveringΘ.

I Our constrained MLE based on binary observations achieves the samedegree of accuracy as if it were given access to the completely unquan-tized measurements.

32 / 51


Algorithm

Alternating optimization

maxΘ=JθijkK

LY(Θ) =∑ijk

log π (qijkθijk) ,

subject to Θ =R∑

r=1

ar ⊗ br ⊗ cr , Θ ≤ α,

where qijk = 2Yijk − 1 taking values -1 or 1 and π(·) is the link function.

I Denote by A = [a1, . . . , aR ], B = [b1, . . . ,bR ] and C = [c1, . . . , cR ].I Objective function is multi-convex in A,B and C .I Update At+1 ← maxA LY(A; Bt ,C t); solved by GLM.I Modify At+1 ← γAt + (1−γ)At+1 subject to infinity-norm constraint;

1-dimension optimization over γ ∈ [0, 1].

33 / 51


Empirical performance

Interplay between statistical efficiency and computational cost

Let M(t) = [A(t),B(t),C (t)] denote a sequence of estimators generatedfrom alternating GLM Algorithm. Under certain conditions∗ on the initialpoint M(0) and the true point Mtrue, we have, with very high probability,

Loss(

Θ(M(t)),Θtrue

)≤ C1ρ

tε︸︷︷︸algorithmic error

+C2Lαγα

√RK−1

∑k dk∏

k dk︸︷︷︸statistical error

,

where ρ ∈ (0, 1) is a contraction parameter, and C1,C2 > 0 are two con-stants.

Stopping criterium:

t ≥ T � log1/ρ

{d (k−1)/2

}.

34 / 51


A long but fun journey

The spectral analysis for higher-order tensors is much more dedicated thanthe matrix analysis.

I Norm landscape over all possible tensor unfoldings (W. et al, LAA’17)

I Perturbation analysis of Higher-order SVD algorithm (W. and Song,AISTATS’17)

I Exponential-valued tensor analysis (W and Li’19, Zeng and W.’19)

I Regularity on decomposition algorithm (W. et al’, AOAS’19)

35 / 51


Simulation

I Rank selection via BIC:σ = 0.1 σ = 0.01

True rank d = 20 d = 40 d = 60 d = 20 d = 40 d = 60

R = 10 8.7 (0.9) 10 (0) 10 (0) 8.8 (0.4) 10 (0) 10 (0)

R = 40 36.8 (1.1) 39.6(1.7) 40.2 (0.4) 36.0 (1.2) 38.8(1.6) 40.3 (1.1)

I Comparison with alternative methods

BooleanTF (Boolean tensor factorization, Miettinen 2011, Rukat et al 2018); BTF Bayesian

(Rai et al 2014); BTF Gradient (Hong et al 2018)36 / 51


Experiments with real data

I Nations (Nickel et al., 2011). 14× 14× 56 binary tensor consisting of56 relations among 14 countries.

I Human connectome project (HCP) (Wang et al., 2017a). 68×68×212 binary tensor consisting of structural connectivity patterns between68 brain regions among 212 individuals.

I Enron (Zhe et al., 2016). 581 × 124 × 48 binary tensor depictingthree-way relationships (sender, receiver, time) from the Enron emaildataset.

I Kinship (Nickel et al., 2011). 104× 104× 26 binary tensor consistingof 26 types of relations among 104 individuals.

37 / 51


Political relationships in 1950-1965

I 56 types of relationships among 14 countries between 1950-1965.

I Multi-relational data can be organized as a tensor Y ∈ {0, 1}14×14×56.A tensor entry is 1 if the relation holds between two countries.

I Relations: “sends tourists to”, “exports books to”, “joint membershipof NGOs”, “conferences”, etc.

I Countries: USSR, Poland, China, UK, Brazil, etc.

38 / 51


Estimates from binary tensor decomposition

39 / 51


Human connectome project (HCP)

I Structural connectivity patterns among 68 brain regions for 212 indi-viduals from HCP.

I Brain images were parcellated to 68 regions-of-interest following theDesikan atlas (Desikan et al 2006).

I Strong spatial patterns in the brain connectivity. e.g. edges capturedby tensor component 2 are located within the cerebral hemisphere.

I Nodes with high connectivity intensity: superior frontal gyrus (sensorysystem), corpus callosum (commissural tract).

Component 6

40 / 51


Comparison with continuous-valued tensor decomposition

I Binary tensor decomposition has prediction performance compared toclassical CP tensor decomposition.

I Link prediction from 5-fold cross-validation:

tensor decomposition methoddataset non-zeros binary (logistic link) continuous-valued

AUC MSE AUC MSEHPC 35.3% 0.9860 1.3× 10−3 0.9314 1.4× 10−2

Nations 21.1% 0.9169 1.1× 10−2 0.8619 2.2× 10−2

Kinship 3.8% 0.9708 1.2× 10−4 0.9436 1.4× 10−3

Enron 0.01% 0.9432 6.4× 10−3 0.7956 6.3× 10−5

AUC: area under the receiver operating characteristic (ROC) curve;

MSE: mean squared error

41 / 51


Outline




41 / 51


Tensor block model

I In many applications, the data tensors are often expected to have un-derlying block structure.

I Examples of tensor block model (TBM):

Input Output Input Output

Figure: (a) Our TBM method is used for multiway clustering and for revealing the underlyingcheckerbox structure in a noisy tensor. (b) The sparse TBM method is used for detecting sub-tensors of elevated means.

Related work:I Low-rank estimation: block structure implies low-rankness, but directly ap-

plying low-rank estimation to a block tensors yields an inferior estimator.I Multiway clustering: typically two steps, by first estimating a low-rank repre-

sentation, and then performing clustering on the resulting factors.42 / 51


Tensor block model

I Tensor block model on Y = Jyi1,...,iK K ∈ Rd1×···×dK :

I Suppose the tensor entry yi1,...,iK belongs to the block determined bythe rkth cluster in the mode k for rk ∈ [Rk ], then

yi1,...,iK = cr1,...,rK + εi1,...,iK , for (i1, . . . , iK ) ∈ [d1]× · · · × [dK ],

I Can be viewed as a super sparse Tucker model:

Y = C ×1 M1 ×2 · · · ×K MK + E ,

where C ∈ RR1×···×RK is a core tensor consisting of block means, Mk ∈{0, 1}Rk×dk is a membership matrix for mode k , and E is the sub-Gaussian (σ) noise tensor.

I Special cases: Gaussian tensor block model (real tensors), Stochastictensor block model (binary tensors, σ = 1/4)

43 / 51


Identifiability

Irreducible core

The core tensor C is irreducible if it cannot be written as a block tensor withthe number of mode-k clusters smaller than Rk , for any k ∈ [K ].

I K = 2: C has no two identical rows and no two identical columns.

I General K : none of order-(K -1) fibers of C are identical.

I Irreducibility is a weaker assumption than full-rankness.

Identifiability

Under the irreducibility assumption, the factor matrices Mk ’s are identifiableup to permutations of cluster labels.

I Recall that Tucker and many other factor analyses suffer from rotationalinvariance.

44 / 51


Least-square estimation

I We propose a least-square estimation:

Θ = arg minΘ∈P

{−2〈Y,Θ〉+ ‖Θ‖2

F

}= arg min

Θ∈P‖Y −Θ‖2

F .

where the parameter space P is

P ={

Θ ∈ Rd1×···×dK : Θ = C ×1 M1 ×2 · · · ×K MK ,with some

membership matrices Mk ’s and a core tensor C ∈ RR1×···×RK}.

I Algorithm: alternating optimization with suitable initialization.

45 / 51


Convergence rate (Zeng and W., 2019)

Let Θ be the least-square estimator of under tensor block model. There exists twoconstants C1,C2 > 0 such that, with w.h.p.,

Loss2(Θtrue, Θ) ≤ C1σ2∏

k dk(

∏k

Rk︸︷︷︸estimating block means

+∑k

dk logRk︸︷︷︸estimating block allocations

).

In the special case d = d1 = . . . = dK :

Our rate Tucker rate (Zhang & Xia’ 18) Lasso rate (Chi et al ’18)

O(d logR) O(dR) O(dK−1)

Partition consistency (Zeng and W., 2019)

Under mild separation assumption∗ on block means, there exist permutation ma-trices Pk ’s such that, the misclassification rate (MCR) satisfies∑

k

MCR(Mk ,PkMtrue)→ 0, in probability.

∗block-mean gap = 1‖C‖F

minrk ,r′k,I−k|fiber(Crk ,I−k

)− fiber(Cr′k,I−k

)| ≥ O(d) for all k ∈ [K ].46 / 51


Extension to sparse estimation

I In many large-scale applications, not every block in a data tensor is ofequal importance.

I For example, in the genome-wise expression data analysis, only a fewentries represent the signals while the majority come from the back-ground noise

I We propose the regularized least-square estimation:

Θsparse = arg minΘ∈P

{‖Y −Θ‖2

F + λ ‖‖ Cρ},

where ρ = 1 (lasso), ρ = 0 (subset sparse), or ρ = Frobenius (ridge),among many others.

I Sparse estimation incurs slight changes to the previous Algorithm.

Input Output Input Output

47 / 51


Simulation

I Simulate order-3 Gaussian block tensors with d1 logR1 ≈ d2 logR2 ≈d3 logR3.

I Performance assessment by root mean squared error (RMSE):

Figure: (a) Average RMSE against d1. (b) Average RMSE against rescaled samplesize N =

√d2d3/ logR1.

48 / 51


Comparison with alternative methods

We compare our tensor block model (TBM) with two popular low-ranktensor estimation: (i) CP decomposition, and (ii) Tucker decomposition.

I Non-sparse case

TBM

I Sparse case

Sparsity (ρ) Noise (σ) Penalization (λ) Estimated Sparsity Rate Correct Zero Rate Sparsity Error Rate

0.5 4λ = 0 0(0) 0(0) 0.49(0.03)

λ = 136.4 0.56(0.04) 0.99(0.02) 0.06(0.03)

0.5 8λ = 0 0(0) 0(0) 0.49(0.03)

λ = 439.7 0.59(0.05) 0.99(0.01) 0.14(0.06)

0.8 8λ = 0 0(0) 0(0) 0.80(0.05)

λ = 241.3 0.83(0.06) 0.95(0.04) 0.12(0.06)

49 / 51


Conclusions and future work

I We have developed a framework of statistical models, scalable algo-rithms, and statistical theory to analyze tensor-valued data.

I The general strategy is to carve out a broad range of specially-structuredtensors that are useful in practice, and to develop efficient algorithmsfor analyzing these high-dimensional tensor data.

I Other structures are also possible, e.g. non-negativity (M. et al, AOAS2019, in press) and smoothness (on-going).

I Extend the modeling for second-order structure in tensors, e.g. tensornormal model with factorized covariance

E ∼ N(0,Φ1 ⊗ Φ2 ⊗ Φ3).

Tensor analogy of matrix normal models (W. et al, PNAS 2018).

50 / 51


References:

I Yuchen Zeng and M. Wang. Multiway clustering via tensor block model. Under Review inNIPS. (2019).

I M. Wang and L. Li. Learning from binary multiway data: probabilistic tensor decompositionand its statistical optimality. Under Review in JMLR. (2019).

I M. Wang, J. Fischer, and Y. S. Song. Three-way clustering of multi-tissue multi-individualgene expression data using semi-nonnegative tensor tensor decomposition. Annals of Ap-plied Statistics. (2019), Vol. 13, No. 2, 1124-1148.

I M. Wang et al. Two-way Mixed-Effects Methods for Joint Association Analyses Using BothHost and Pathogen Genomes. PNAS. Vol. 115 (24), E5440-E5449, (2018)

I M. Wang and Y. S. Song. Tensor decomposition via two-mode higher-order SVD (HOSVD).Journal of Machine Learning Research W&CP (AISTATS track), Vol. 54, (2017) 614-622.

I M. Wang, K. Dao Duc, J. Fischer, and Y. S. Song. Operator norm inequalities betweentensor unfoldings on the partition lattice. Linear Algebra and its Applications, Vol. 520(2017) 44-66.

Thank you!

51 / 51

1 / 6

2 / 6

3 / 6

4 / 6

5 / 6

6 / 6

Documents

Beyond matrices: statistical method for higher-order …pages.stat.wisc.edu/~miaoyan/tensor_tutorial2_fudan.pdfBeyond matrices: statistical method for higher-order tensors and its