Robust and Scalable Algorithms for Big Data …Georgios B. Giannakis Acknowledgment: Drs. G. Mateos, K. Slavakis, G. Leus, and M. Mardani 1 Robust and Scalable Algorithms for Big Data

Georgios B. Giannakis

Acknowledgment: Drs. G. Mateos, K. Slavakis, G. Leus, and M. Mardani

1

Robust and Scalable Algorithms for Big Data Analytics

Arlington, VA, USA March 22, 2013

n  Robust principal component analysis

Ø  Linear low-rank models and sparse outliers

n  Scalable algorithms for big network data analytics Ø  (De-) centralized and online rank minimization

n  Robust sparse embedding via dictionary learning Ø  Nonlinear low-rank models Ø  Data-adaptive compressed sensing

n  Concluding remarks

2

Roadmap

BIG

Fast

BIG

Messy

3 3

Principal component analysis

Objective: robustify PCA by controlling outlier sparsity

n  Motivation: (statistical) learning from high-dimensional data

n  Principal component analysis (PCA) [Pearson’1901] Ø  Extraction of low(est)-dimensional structure Ø  Applications: source (de)coding, anomaly ID, recommender systems … Ø  PCA is non-robust to outliers [Huber’81], [Jolliffe’86], [Wright et al’09-12]

DNA microarray Traffic surveillance

4 4

PCA formulations n  Training data

n  Minimum reconstruction error Ø  Compression operator Ø  Reconstruction operator

n  Component analysis model

Solution:

5 5

Robustifying PCA n  Outlier variables s.t. outlier

otherwise

Ø  Nominal data obey ; outliers something else Ø  Linear regression [Fuchs’99], [Giannakis et al’11]

Ø  Both and unknown, typically sparse!

n  Natural (but intractable) estimator

(P0)

G. Mateos and G. B. Giannakis, ``Robust PCA as bilinear decomposition with outlier sparsity regularization,'' IEEE Transactions on Signal Processing, pp. 5176-5190, Oct. 2012.

6 6

Universal robustness

(P1)

n  (P0) is NP-hard relax e.g., [Tropp’06]

Ø  Role of sparsity-controlling is central

Q: Does (P1) yield robust estimates ? A: Yap! Huber estimator is a special case

7 7

Alternating minimization (P1)

Ø  update: SVD of outlier-compensated data Ø  update: row-wise soft-thresholding of residuals

Proposition : Algorithm 1’s iterates converge to a stationary point of (P1)

γ -γ

8 8

Video surveillance

Data: http://www.cs.cmu.edu/~ftorre/

Data PCA Robust PCA “Outliers”

n  Background modeling from video feeds [De la Torre-Black ‘01]

9 9

Robust unveiling of communities

n  Network: NCAA football teams (vertices), Fall ‘00 games (edges)

Ø  Identified exactly: Big 10, Big 12, ACC, SEC,…; Outliers: Independent teams

ARI=0.8967

n  Robust kernel PCA for identification of cohesive subgroups

Data: http://www-personal.umich.edu/~mejn/netdata/

10 10

Online robust PCA

n  Nominal:

n  Outliers:

n  Motivation: Real-time big data and memory limitations

Ø  At time , do not re-estimate

Ø  Scalability via exponentially weighted subspace tracking



n  Scalable algorithms for big network data Ø  (De-) centralized and online rank minimization

n  Robust embedding via dictionary learning Ø  Nonlinear low-rank models Ø  Data-adaptive compressed sensing


11

Roadmap

12

Modeling traffic anomalies

n  Graph G (N, L) with N nodes, L links, and F flows (F >> L); OD flow zf,t

є {0,1}

Anomaly

LxT LxF

n  Packet counts per link l and time slot t

n  Matrix model across T time slots:

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

f1

f2

l

n  Anomalies: changes in origin-destination (OD) flows [Lakhina et al’04]

Ø  Failures, congestions, DoS attacks, intrusions, flooding

13

Low-rank plus sparse matrices

n  Z has low rank, e.g., [Zhang et al‘05]; A is sparse across time and flows

Data: http://math.bu.edu/people/kolaczyk/datasets.html

0 200 400 600 800 10000

2

4x 10

8

Time index(t)|a

f,t|

14

General decomposition problem

n  Given and routing matrix , identify sparse when is low rank

Ø  fat but still low rank

(P1)

n  Rank minimization with the nuclear norm, e.g., [Recht-Fazel-Parrilo’10]

Ø  Principal Comp. Pursuit (PCP) [Candes et al’10], [Chandrasekaran et al’11]

15

Challenges and importance

n  not necessarily sparse and fat PCP not applicable

n  Important special cases

Ø  R = I : matrix decomposition with PCP [Candes et al’10] Ø  X = 0 : compressive sampling with basis pursuit [Chen et al’01] Ø  X = CLxρW’ρxT and A = 0 : PCA [Pearson 1901] Ø  X = 0, R = D unknown: dictionary learning [Olshausen’97]

n  LT + FT >> LT

X A Y

16

Exact recovery n  Noise-free case

M. Mardani, G. Mateos, and G. B. Giannakis,``Recovery of low-rank plus compressed sparse matrices with application to unveiling traffic anomalies," IEEE Trans. Information Theory, 2013.

(P0)

Theorem: Given and , assume every row and column of has at most k<s non-zero entries, and has full row rank. If C1)-C2) hold, then with (P0) exactly recovers

C1)

C2)

Q: Can one recover sparse and low-rank exactly? A: Yes! Under certain conditions on

17

In-network processing n  Robust imputation of network data matrix

Network health cartography

Smart metering

G. Mateos and K. Rajawat ”Dynamic network cartography,” IEEE Signal Processing Magazine, May 2013.

Goal: Given few rows per agent, perform distributed cleansing and imputation by leveraging low-rank of nominal data and sparsity of the outliers.

n  Challenge: not separable across rows (links/agents)

?

?

? ?

?

? ?

?

?

?

18

Separable regularization n  Key property

Lxρ ≥rank[X]

V’

W’ C

n  Separable formulation equivalent to (P1)

(P2)

Ø  Nonconvex; less variables:

Proposition: If stat. pt. of (P2) and , then is a global optimum of (P1).

19

Decentralized rank minimization

M. Mardani, G. Mateos, and G. B. Giannakis, “In-network sparsity regularized rank minimization: Algorithms and applications," IEEE Transactions on Signal Processing, 2013.

n  Alternating-direction method of multipliers (ADMM) solver for (P2) Ø  Method [Glowinski-Marrocco’75], [Gabay-Mercier’76] Ø  Learning over networks [Schizas-Ribeiro-Giannakis’07]

Consensus-based optimization Attains centralized performance

20

Internet2 data n  Real network data

Ø  Dec. 8-28, 2008 Ø  N=11, L=41, F=121, T=504

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False alarm probability

Det

ectio

n pr

obab

ility

[Lakhina04], rank=1[Lakhina04], rank=2[Lakhina04], rank=3Proposed method[Zhang05], rank=1[Zhang05], rank=2[Zhang05], rank=3

Data: http://www.cs.bu.edu/~crovella/links.html

0100

200300

400500

0

50

100

0

1

2

3

4

5

6

Time

Pfa = 0.03 Pd = 0.92

---- True ---- Estimated

Flows

Ano

mal

y vo

lum

e

21

Online rank minimization

n  Approach: regularized exponentially-weighted LS formulation

M. Mardani, G. Mateos, and G. B. Giannakis, "Dynamic anomalography: Tracking network anomalies via sparsity and low rank," IEEE Journal of Selected Topics in Signal Processing, pp. 50-66, Feb. 2013.

n  Construct an estimated map of anomalies in real time

Ø  Streaming data model:

0

2

4CHIN--ATLA

0

20

40

Anom

aly am

plitu

de

WASH--STTL

0 1000 2000 3000 4000 5000 60000

10

20

30

Time index (t)

WASH--WASH

0

5ATLA--HSTN

0

10

20

Link

traff

ic lev

el DNVR--KSCY

0

10

20

Time index (t)

HSTN--ATLA

---- Estimated ---- True

o---- Estimated ---- True

Tracking cleansed link traffic Real time unveiling of anomalies



n  Scalable algorithms for big network data analytics Ø  (De-) centralized and online rank minimization

n  Robust sparse embedding via dictionary learning Ø  Nonlinear low-rank models; data-adaptive compressed sensing


22

Roadmap

23 23

Nonlinear low-dimensional models?

q  Compressive sampling (CS) [Donoho/Candes’06]: Linear operator Ø  CS vs data-adaptive principal component analysis (PCA) [Pearson’1901] Ø  Data-adaptive nonlinear CS? ; quad-CS [Ohlsson etal’13]

q  Nonlinear dimensionality reduction for data on manifolds Ø  Kernel PCA [Scholkopf etal’98]; SDE [Weinberger’04]; reconstruction? Ø  Local linear embedding (LLE) [Roweis-Saul’00]; LEM; MDS; Isomap … Ø  Sparsity-aware embeddings [Huang etal’10], [Vidal’11], [Kong etal’12] Ø  Dictionary learning (DL) [Olshausen’97]; online DL [Mairal etal’10], [Carin etal’11]

24 24

Learning sparse manifold models

Ø  Robust sparse embedding via dictionary learning (RSE-DL)

Ø  reduces and morphs training data to yield a smoother basis for

Ø  Use matrix to learn dictionary ( )

Sparse training data fit Smooth affine manifold fit

q  Training data on a smooth but unknown manifold

25

Parsimonious nonlinear embedding

Ø  Reduced complexity embedding step ( )

q  RSE-DL appropriate for (de-)compression and reconstruction

q  Robust sparse coding: works for clustering/classification

q  Embedding preserves

26 26

RSE-DL compression and reconstruction

q  Reconstruct:

q Operational phase @ Rx: given (possibly noisy)

q Operational phase @ Tx: per data vector

Ø  Less computationally demanding modules ( )

q  Compress:

27

Test case: Swiss roll

Ø  Noise on manifold: , channel noise:

28

Comparisons with LLE, RSE, RSGE

(Average over 100 realizations)

Missing data q ”USC girl” (predates Lena!) with 50% misses

q RSE-DL: reduced complexity relative to e.g., Bayesian-type [Chen etal’ 10]

30

Concluding summary n  Robust PCA; online via robust subspace tracking

Ø  Leveraging linear low-rank models and outlier sparsity

n  Unveiling anomalies in large-scale network data

Ø  Scalable decentralized and online algorithms

Thank you!

Ø  Performance bounds? Dynamical network data? Ø  Learning via quantized big data (few bits)? Ø  RSE-DL for nonlinear compressive sampling?

n  Data-adaptive, nonlinear, low-dimensional models

n  The road ahead

31

Numerical validation n  Setup L=105, F=210, T = 420 R ~ Bernoulli(1/2) Xo = RPQ’, P, Q ~ N(0, 1/FT) aij ϵ {-1,0,1} w.p. {π/2, 1-π, π/2}

n  Relative recovery error

% non-zero entries (ρ)

rank

(XR

) (r)

0.1 2.5 4.5 6.5 8.5 10.5 12.5

10

20

30

40

50

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

rank

(X0)

[r]

[(s/FT)%]

Documents

Robust and Scalable Algorithms for Big Data …Georgios B. Giannakis Acknowledgment: Drs. G. Mateos, K. Slavakis, G. Leus, and M. Mardani 1 Robust and Scalable Algorithms for Big Data