33
Georgios B. Giannakis Acknowledgments: D. Berberidis, F. Sheikholeslami, P. Traganitis; and Prof. G. Mateos NSF 1343860, 1442686, 1514056, NSF-AFOSR 1500713, MURI-FA9550-10-1-0567, and NIH 1R01GM104975-01 1 Censor, Sketch, and Validate for Learning from Large-Scale Data ISR, U. of Maryland March 4, 2016

Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Georgios B. Giannakis

Acknowledgments: D. Berberidis, F. Sheikholeslami, P. Traganitis; and Prof. G. MateosNSF 1343860, 1442686, 1514056, NSF-AFOSR 1500713, MURI-FA9550-10-1-0567, and NIH 1R01GM104975-01

1

Censor, Sketch, and Validate for Learning from Large-Scale Data

ISR, U. of MarylandMarch 4, 2016

Page 2: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Learning from “Big Data”

Big size ( and/or )

Challenges

Incomplete

Noise and outliers

Fast streaming

Opportunities in key tasks Dimensionality reduction Online and robust

regression, classification and clustering

Denoising and imputationInternet

2

Page 3: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Roadmap

Closing comments

Large-scale nonlinear function approximation

Large-scale data and graph clustering

Large-scale linear regressions

Context and motivation

Random projections for data sketching

Adaptive censoring of uninformative data

Tracking high-dimensional dynamical data

Leveraging sparsity and low rank for anomalies and tensors

3

Page 4: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Random projections for data sketching

SVD incurs complexity Q: What if ?

M. W. Mahoney, Randomized Algorithms for Matrices and Data, Foundations and TrendsIn Machine Learning, vol. 3, no. 2, pp. 123-224, Nov. 2011.

If

Ordinary least-squares (LS) Given

For complexity reduces to

LS estimate via (pre-conditioning) random projection matrix Rd x D

4

Page 5: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Performance of randomized LS

Uniform sampling versus Hadamard preconditioning

D = 10,000 and p =50 Performance depends on X

D. P. Woodruff, ''Sketching as a Tool for Numerical Linear Algebra,''Foundations and Trends in Theoretical Computer Science, vol. 10, pp. 1-157, 2014.

condition number of ; and

For any , if , then w.h.p.Theorem.

Based on the Johnson-Lindenstrauss lemma [JL’84]

5

Page 6: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Online censoring for large-scale regressions

D. K. Berberidis, G. Wang, G. B. Giannakis, and V. Kekatos, "Adaptive Estimation from Big Data via Censored Stochastic Approximation," Proc. of Asilomar Conf., Pacific Grove, CA, Nov. 2014.

Key idea: Sequentially test and update LS estimates only for informative data

Adaptive censoring (AC) rule: Censor if

Criterion

Threshold controls avg. data reduction:

6

Page 7: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Censoring algorithms and performance

Proposition 1 AC-RLS

AC-LMS

AC recursive least-squares (RLS) at complexity

AC least mean-squares (LMS)

D. K. Berberidis, and G. B. Giannakis, "Online Censoring for Large-Scale Regressions,"IEEE Trans. on SP, July 2015 (submitted); also Proc. of ICASSP, Brisbane, Australia, April 2015.

7

Page 8: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Censoring vis-a-vis random projections

RPs for linear regressions [Mahoney ‘11], [Woodruff’14]

AC for linear regressions

Data-agnostic reduction; preconditioning costs

Data-driven measurement selection Suitable also for streaming data Minimal memory requirements

AC interpretations Reveals ‘causal’ support vectors Censors data with low LLRs:

8

Page 9: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Highly non-uniform data

AC-RLS outperforms alternatives at comparable complexity

Robust to uniform (all “important”) rows of X ;

Performance comparison Synthetic: D=10,000, p=300 (50 MC runs); Real data: estimated from full set

Q: Time-varying parameters?

9

Page 10: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Tracking high-dimensional dynamical data

ESTIMATEPREDICTOR-based REGULARIZATION

TRANSITIONMODEL

DATA DATA

REGURALIZEDREGRESSION

REGURALIZEDREGRESSION

OBSERVATION OBSERVATION

TIME-VARYING STATE

Low-complexity, reduced-dimension KF with large-scale ( ) correlated data

Prediction:

Correction:

10

Page 11: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Data at time

DimensionalityReduction

-based KFPredictorCorrector

Sketching for dynamical processes

Our economical KF: Sketch informative

Weighted LS correction incurs prohibitive complexity for

PP

Related works either costly at fusion center [Varshney etal’14] Data-agnostic with model-driven ensemble optimality [Krause-Guestrin’11]

11

Page 12: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

RP-based KF Same predictor and sketched corrector

Sketched correction

RP-based sketching

Proposition 2. With

if for , then whp

D. K. Berberidis and G. B. Giannakis, “Data sketching for tracking large-scale dynamical processes,"Proc. of Asilomar Conf., Pacific Grove, CA, Nov. 2015.

12

Page 13: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Simulated test

AC-KF outperforms RP-KF and KF with greedy design of experiments (DOE)!

AC-KF complexity is much lower than of greedy DOE-KF

A. Krause, C. Guestrin. “Submodularity and its Applications in Optimized Information Gathering:An Introduction,” ACM Trans. on Intelligent Systems and Technology, vol. 2, July 2011. 13

Page 14: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Roadmap

Closing comments

Large-scale nonlinear function approximation

Large-scale linear regressions

Context and motivation

Large-scale data and graph clustering

Online kernel regression on a budget

Online kernel classification on a budget

Leveraging sparsity and low rank for anomalies and tensors

14

Page 15: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Linear or nonlinear functions for learning?

e.g.,

Regression or classification: Given , find

Memory requirement , and complexity

Pre-select kernel (inner product) function

RKHS basis expansion

Lift via nonlinear map to linear

Kernel-based nonparametric ridge regression

15

Page 16: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Low-rank lifting function approximation Low-rank (r) subspace learning [Mardani-Mateos-GG’14] here on lifted data

group-sparsity regularizer

S1. Find projection coefficients via regularized least-squares

S2. Find subspace factor via (in) exact group shrinkage solutions

BCD solver: at Iteration , and available

Low-rank subspace tracking via stochastic approximation (also with a “budget”)F. Sheikholeslami and G. B. Giannakis, "Scalable Kernel-based Learning via Low-rank Approximation of Lifted Data," Proc. of Intl. Conf. on Machine Learning, New York City, submitted Feb. 2016.

Nystrom approximation: special case with or

17

Page 17: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Linear regression!

Online kernel regression and classification

Proposition 4. If iid with kernel matrix

can be approximated as , and w.p. at least it holds that

High-performance online kernel-based feature extraction on a budget (OK-FEB)

Kernel matrix approximation

bounds also on support vector machines for regression and classification

17

Page 18: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

80%-20% split for training-testing

OK-FEB LSVM outperforms K-SVM (LibSVM) in both training and testing phases

• Run time for OK-FEB+LSVM vs K-SVM

Kernel approximation via low-rank features

C. C. Chang and C. J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. on Intelligent Systems and Technology, vol. 2, pp.1-27, April 2011.

Infer annual income exceeds 50K using as features (education, age, gender,…)

5 10 15 20 25 30 35 40 45 50

103

104

Rank (r)

Kern

el m

atrix

mis

smat

ch

BCD-ExBCD-PGNystPrb-NystImpNystSVD

5 10 15 20 25 30 35 40 45 5010

-2

100

102

104

Rank (r)

Run

tim

e (s

ec)

Adult dataset

18

Page 19: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

0 1 2 3 4 5 6

x 104

10-2

100

102

104

Iteration (sample number)

CPU

tim

e(se

c)

OGDPerceptronNormaRBPforgetronBOGDOKFEB

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

x 104

0

0.05

0.1

0.15

0.2

0.25

Iteration (sample number)

Aver

age

LS-e

rror

OK-FEB with linear classification and regression

Adult dataset (classification) Slice dataset (regression)

0 0.5 1 1.5 2 2.5 3

x 104

0.2

0.25

0.3

0.35

Iteration (sample number)

Mis

clas

sific

atio

n pr

ob.

0 0.5 1 1.5 2 2.5 3

x 104

10-2

100

102

CPU

tim

e(se

c)

Iteration (sample number)

PerceptronOGDRBPForgetronProjectronBOGDBPAsOKFEB

F. Sheikholeslami, D. K. Berberidis, and G. B. Giannakis, "Kernel-based Low-rank Feature Extraction on a Budget for Big Data Streams," Proc. of Globalsip Conf., Dec. 14-16, 2015; also in arxiv:1601.07947.

OK-FEB LSVM outperforms budgeted K-SVM/SVR variants in classification/regression

19

Page 20: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Roadmap

Closing comments

Large-scale nonlinear function approximation

Large-scale linear regressions

Context and motivation

Large-scale data and graph clustering

Random sketching and validation (SkeVa)

SkeVa-based spectral and subspace clustering

Leveraging sparsity and low rank for anomalies and tensors

20

Page 21: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

21

Big data clustering Clustering: Given , or their distances, assign them to K clusters

A1. Random Projections: Use dxD matrix R to form RX; apply K-means in d-space

K-means: locally optimal, but simple; complexity O(NDKI)

Centroids

Assignments

Hard clustering: NP-hard! Soft clustering:

Probabilistic clustering amounts to pdf estimation Gaussian mixtures (EM-based estimation) Regularizer can account for unknown K

Q. What if and/or ?

C. Boutsidis, A. Zousias, P. Drineas, and M. W. Mahoney, “Randomized dimensionality reduction for K-means clustering,” IEEE Trans. on Information Theory, vol. 61, pp. 1045-1062, Feb. 2015. 21

Page 22: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

2222

Random sketching and validation (SkeVa) Randomly select “informative” dimensions

Algorithm

Sketch dimensions:

Similar approaches possible for

For

Run k-means on

Re-sketch dimensions

Validate using consensus set

Augment centroids ,

Sequential and kernel variants available

P. A. Traganitis, K. Slavakis, and G. B. Giannakis, "Sketch and Validate for Big Data Clustering,"IEEE Journal on Special Topics in Signal Processing, vol. 9, pp. 678-690, June 2015.

Page 23: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Divergence-based SkeVa Idea: “Informative” draws yield reliable estimates of multimodal data pdf!

Compare pdf estimates via “distances”

• Integrated square-error (ISE)

For

Sketch points

If , then re-sketch points

Cluster ; associate to

If

2623

Page 24: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

RP versus SkeVa comparisons

KDDb dataset (subset)D = 2,990,384, N = 10,000, K = 2

versus SkeVaRP: [Boutsidis etal ‘15]

2724

Page 25: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Performance and SkeVa generalizations Di-SkeVa is fully parallelizable

A. For independent draws, can be lower bounded

Proposition 5. For a given probability of a successful Di-SkeVa draw r quantified by pdf dist. ∆, the number of draws is lower bounded w.h.p. q by

Q. How many samples/draws SkeVa needs?

Bound can be estimated online

SkeVa module can be used for spectral clustering and subspace clustering

25

Page 26: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Identification of network communities

26P. A. Traganitis, K. Slavakis, and G. B. Giannakis, “Spectral clustering of large-scale communities via random sketching and validation,” Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015.

Kernel K-means instrumental for partitioning of large graphs (spectral clustering) Relies on graph Laplacian to capture nodal correlations

For , kernel-based SkeVa reduces complexity to

arXiv collaboration network (General Relativity): N=4,158 nodes, 13,422 edges, K = 36 [Leskovec’11]

Spectral Clustering3.1 sec

SkeVa (n = 500)0.5 sec

SkeVa (n=1,000)0.85 sec

Page 27: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

M. Mardani, G. Mateos, and G. B. Giannakis, “Recovery of low-rank plus compressed sparse matrices with applicationto unveiling traffic anomalies," IEEE Transactions on Information Theory, pp. 5186-5205, Aug. 2013.

Graph G (N, L) with N nodes, L links, and F flows (F >> L); OD flow zf,t

є {0,1}

Anomaly

Packet counts per link l and time slot t

Matrix model across T time slots:

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

f1

f2

l

Anomalies: changes in origin-destination (OD) flows [Lakhina et al’04]

Failures, congestions, DoS attacks, intrusions, flooding

Modeling Internet traffic anomalies

27

Page 28: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Z (and X:=RZ) low rank, e.g., [Zhang et al‘05]; A is sparse across time and flows

Data: http://math.bu.edu/people/kolaczyk/datasets.html

0 200 400 600 800 10000

2

4x 108

Time index(t)

|af,

t|

Low-rank plus sparse matrices

(P1)

28

Page 29: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Real network data, Dec. 8-28, 2003

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False alarm probability

Det

ectio

n pr

obab

ility

[Lakhina04], rank=1[Lakhina04], rank=2[Lakhina04], rank=3Proposed method[Zhang05], rank=1[Zhang05], rank=2[Zhang05], rank=3

Data: http://www.cs.bu.edu/~crovella/links.html

0100

200300

400500

0

50

100

0

1

2

3

4

5

6

Time

Pfa = 0.03Pd = 0.92

---- True---- Estimated

Flows

Anom

aly

volu

me

Improved performance by leveraging sparsity and low rank Succinct depiction of the network health state across flows and time

Internet2 data

29

Page 30: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

From low-rank matrices to tensors

B=

br

βi

A=

ar

αi

PARAFAC decomposition per slab t [Harshman ’70]

C=

cr

γi

Tensor subspace comprises R rank-one matrices

Data cube , e.g., sub-sampled MRI frames

Goal: Given streaming , learn the subspace matrices (A,B) recursively, and impute possible misses of Yt

J. A. Bazerque, G. Mateos, and G. B. Giannakis, "Rank regularization and Bayesian inference for tensor completion and extrapolation," IEEE Trans. on Signal Processing, vol. 61, no. 22, pp. 5689-5703, Nov. 2013. 30

Page 31: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Online tensor subspace learning

Real-time reconstruction (FFT per iteration)

Stochastic alternating minimization; parallelizable across bases

Image domain low tensor rank

M. Mardani, G. Mateos, and G. B. Giannakis, "Subspace learning and imputation for streaming big datamatrices and tensors," IEEE Trans. on Signal Processing, vol. 63, pp. 2663 - 2677, May 2015. 31

Page 32: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Dynamic cardiac MRI test in vivo dataset: 256 k-space 200x256 frames

R=100, 90% misses R=150, 75% misses

Ground-truth frame

Sampling trajectory

Low-rank plus can also capture motion effects

Potential for accelerating MRI at high spatio-temporal resolution

M. Mardani and G. B. Giannakis, "Accelerating dynamic MRI via tensor subspace learning,“ Proc. of ISMRM 23rd Annual Meeting and Exhibition, Toronto, Canada, May 30 - June 5, 2015. 32

Page 33: Censor, Sketch, and Validate for Learning from Large-Scale Data · 2016-05-25 · Proc. Conf. on Info. Science and Systems, Baltimore, Maryland, March 18-20, 2015. Kernel K-means

Closing comments

Other key Big Data tasks

Regression and tracking dynamic data

Large-scale learning

Enabling tools for Big Data

Scalable computing platforms

K. Slavakis, G. B. Giannakis, and G. Mateos, “Modeling and optimization for Big Data analytics,” IEEE Signal Processing Magazine, vol. 31, no. 5, pp. 18-31, Sep. 2014.

Nonlinear non-parametric function approximation Clustering massive, high-dimensional data and graphs

Visualization, mining, privacy, and security

Acquisition, processing, and storage

Fundamental theory, performance analysisdecentralized, robust, and parallel algorithms

Big Data application domains … Sustainable Systems, Social, Health, and Bio-Systems, Life-enriching

Multimedia, Secure Cyberspace, Business, and Marketing Systems … Thank You!33