Compressed Sensing of Big Data Networks - Aaltojunga1/TalkEPFLDec2017.pdf · cf. L. Lov asz, \Large Networks and Graph Limits" 15/51. aalto-logo-en-3 Supervised Learning data points

$Page 1: Compressed Sensing of Big Data Networks - Aaltojunga1/TalkEPFLDec2017.pdf · cf. L. Lov asz, \Large Networks and Graph Limits" 15/51. aalto-logo-en-3 Supervised Learning data points$
aalto-logo-en-3

Compressed Sensing of Big Data Networks

Alexander JungDept. of Computer Science, Aalto University

December 20, 2017

1 / 51

aalto-logo-en-3

Some Brainy Quotes on The Data Deluge

“We’re Drowning in Information and Starving for Knowledge.”- Rutherford D. Rogers.

“There is Nothing More Practical Than a Good Theory.”- Kurt Lewin.

2 / 51

aalto-logo-en-3

About Me

Phd 2012 at TU Vienna

Post-Doc stay at ETH Zurich 2012

Univ.-Ass. at TU Vienna 2013-2015

2015 - , Assistant Professor (tenure-track) at Aalto Universtiy

3 / 51

aalto-logo-en-3

My Research Group

heading the group “Machine Learning for Big Data”

currently five Phd students, several MSc and BSc students

research revolves around fundamental limits and efficientalgorithms for machine learning involving massivenetwork-structured datasets (big data over networks)

4 / 51

aalto-logo-en-3

My Teaching

since 2015, CS-E3210 “Machine Learning: Basic Principles”(this year 600 students)

since 2016, CS-E4020 “Convex Optimization for Big Data”(this year 50 students)

from 2018, CS-E4800 “Artificial Intelligence” (currently over170 students enrolled)

5 / 51

aalto-logo-en-3

My Service

together withProf. Sergiy A. Vorobyov (Aalto) and Prof. Holger Rauhut(RWTH Aachen)im currently co-editing a special research topic at Frontiers Appl.Math. Stat.

6 / 51

aalto-logo-en-3

Overview

1 Machine Learning for Big Data over Networks

2 Network Lasso and Sparse Label Propagation

3 The Network Nullspace Property

4 The Network Compatibility Condition

5 The Final Three Slides

7 / 51

aalto-logo-en-3

Outline






8 / 51

aalto-logo-en-3

Big Data Fuels Machine Learning

availability of vast amounts of training data allows

to train extremely complex models such as

deep neural networks

9 / 51

aalto-logo-en-3

Andrew Ng’s Rocket Picture

Big Data Complex Model Modern AI/

Deep Learning

10 / 51

aalto-logo-en-3

AI Everywhere

Shazam identifies the ear-worm tune you are listening to

spam filters keep your inbox tidy

Google.com became personal Jeannie

11 / 51

aalto-logo-en-3

Shazam - Live Demo

watched Kill Bill recently

fighting scence with a cool background song

Shazam App digged out the title in seconds!

song unrelated to my preferences in Spotify/FB etc...

12 / 51

https://youtu.be/a3aFv8IQb4s?t=9m5s

https://www.youtube.com/watch?v=5Us1ASENPws

aalto-logo-en-3

Shazam

13 / 51

aalto-logo-en-3

A Key Principle

modern AI systems organizebig data as networks

14 / 51

aalto-logo-en-3

Big Data over Networks

datasets and models often have intrinsic network structure

chip design internet bioinformatics

social networks universe material science

cf. L. Lovasz, “Large Networks and Graph Limits”

15 / 51

aalto-logo-en-3

Supervised Learning

data points zi from some input space X

data points labeled with values taken from output space Y

data point zi ∈ X labeled with xi ∈ Y

hypothesis or predictor is a map x [·] : X → Y

GOAL: learn predictor x [·] based on all available data

16 / 51

aalto-logo-en-3

House Prize Prediction (X = R,Y = R)

data point zi is living area (feet2) of a house

data point zi labeled with prize xi of house

GOAL: learn predictor x [·] : R→ R based on available data

CS229 Lecture notes

Andrew Ng

Supervised learning

Let’s start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:

Living area (feet2) Price (1000$s)2104 4001600 3302400 3691416 2323000 540

......

We can plot this data:

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

0

100

200

300

400

500

600

700

800

900

1000

housing prices

square feet

pric

e (in

$10

00)

Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?

1

17 / 51

aalto-logo-en-3

Empirical Risk/Loss Minimization

learn predictor x [·] : R→ R based on available data {zi , xi}

x[z]=a · z+b

zi

xi

xi−x[zi]

predictor x [·] modeled linear x [z ] = a · z + b

minimize empirical loss

mina,b

∑i

(xi − x [zi ])2

18 / 51

aalto-logo-en-3

Machine Learning on Graphs

why embed data points {zi}Ni=1 into R?

a minimalist choice for input space: X = V := {zi}Ni=1

“similar” points zi , zj connected by edge with weight Wi ,j > 0

data point zi ∈ V with label x [i ] ∈ R (output space)

predictor is a graph signal x [·] : V → R

“learning predictor x [·]” = “graph signal recovery”!

19 / 51

aalto-logo-en-3

House Prize Prediction on Graphs

zi, xi

zj , xjWi,j

20 / 51

aalto-logo-en-3

Empirical Loss Minimization over Graphs

acquire labels for data points in small sampling set M

learn x [·] : V → R using edge weights W and labels {xi}i∈M

aim at small empirical (training) error∑i∈M

fi (x [i ]; xi )

loss function fi (·) might vary over data points

e.g., fi (. . .) :=(x [i ]−xi )2 or fi (. . .) := |x [i ]−xi |

21 / 51

aalto-logo-en-3

Clustering Hypothesis

dataset represented by data graph G = (V, E ,W)

hypothesis/predictor x [·] : V → R maps node zi to label x [i ]

learn predictor x [·] based on initial labels xi for i ∈M

clustering hypothesis: data points in well-connected subsets(clusters) of V have similar labels

amounts to requiring small total variation (TV)

‖x [·]‖TV :=∑{i ,j}∈E

Wi ,j |x [i ]− x [j ]|

22 / 51

aalto-logo-en-3

A Zero-Order Model for Clustered Signals

simple signal model conforming to clustering hypothesis

x [i ] =∑C∈F

aCIC[i ] , with IC[i ] =

{1 if i ∈ C0 else.

using partition F = {C1, . . . , C|F|} with disjoint clusters Cl

C1 C2a1 a2

∂Fsampled node i∈M

lets denote, for fixed F , set of clustered signals by XF

23 / 51

aalto-logo-en-3

Good Clusters - Small Total Variation

we allow for arbitrary partition F = {C1, . . . , C|F|}

our results are most useful for “reasonable clusters” Cl

cluster boundary ∂F with small average weight∑boundary

Wi ,j �∑

interior

Wi ,j

amounts to requiring small total variation (TV)

‖x [·]‖TV :=∑{i ,j}∈E

Wi ,j |x [i ]− x [j ]|

note that TV does not require partition !!

24 / 51

aalto-logo-en-3

Outline






25 / 51

aalto-logo-en-3

The Learning (Recovery) Problem

observe few initial labels xi for i ∈M (� V)

aim at learning all labels x [i ], for i ∈ V

empirical risk incurred by particular hypothesis x [·] is∑i∈M

fi (x [i ]; xi )

with some loss function fi (·; ·) ∈ R associated with node i

balance empirical risk with total variation

‖x [·]‖TV =∑{i ,j}∈E

Wi ,j |x [i ]− x [j ]|

26 / 51

aalto-logo-en-3

The Network Lasso

network Lasso (nLasso) [Hallac, 2015]

x [·] ∈ argminx[·]

∑i∈M

fi (x [i ]; xi ) + λ‖x [·]‖TV

with initial labels xi provided for zi ∈M

choosing large λ enforces small total variation ‖x [·]‖TV

choosing small λ enforces small empirical error

typical choices fi (x ; y) :=(x−y)2, or fi (x ; y) := |x−y |

nLasso does not require partition F !

27 / 51

aalto-logo-en-3

Structure of Network Lasso

nLasso has particular structure:


f (x [·]) :=∑i∈M

fi (x [i ]; xi ) + λ‖x [·]‖TV

with convex loss functions fi (x [i ]; xi )

total variation ‖x [·]‖TV is a non-diffable convex function

objective sum of two non-smooth convex components

minimizing each component individually is easy

28 / 51

aalto-logo-en-3

Convex Optimization Problems

nLasso: x [·] ∈ argminx[·]∑

i∈M fi (x [i ]; xi ) + λ‖x [·]‖TV

objective sum of two non-smooth convex components

nLasso delivers x [·] if and only if 0∈∂f (x [·])

perfect prey for proximal methods

rewrite 0∈∂f (x [·]) as x [·]=P x [·] with some operator P

do a fixed-point iteration x (k+1)[·] = Px (k)[·]

29 / 51

aalto-logo-en-3

Proximal Methods for nLasso

nLasso characterized by x [·] = P x [·]

compute x [·] by fixed point iteration

different options for P (EXPLOIT THIS FREEDOM!)

particular P yields ADMM, Pock-Chambolle,. . .

often allow for efficient (distributed) implementation

30 / 51

aalto-logo-en-3

Sparse Label Propagation

nLasso


∑i∈M

fi (x [i ]; xi ) + λ‖x [·]‖TV

amounts to balancing empirical error with total variation

we might also insist in consistency with initial labels

this suggests to use sparse label propagation (SLP)


‖x [·]‖TV s.t. x [i ] = xi for all i ∈M

propagate labels xi such that {x [i ]− x [j ]}(i ,j)∈E is sparse

SLP is equivalent to a linear program!

31 / 51

aalto-logo-en-3

Outline






32 / 51

aalto-logo-en-3

The Learning (Recovery) Problem

dataset with graph G and some initial labels xi ∈M

graph signal x [·] representing labels is clustered

C1 C2a1 a2


when does SLP recover x [·] ?

33 / 51

aalto-logo-en-3

The Intuition of Nullspace Conditions

stack initial labels into vector y ∈ RM

stack clustered graph signal into vector x ∈ XF ⊆ RV

recover signal x from “measurements” y = Mx

selector matrix M with rows {ei}i∈M

recovery impossible for any x in nullspace K(M) of M

we have to make sure that K(M) ∩ XF = ∅

34 / 51

aalto-logo-en-3

Network Flows

consider oriented empirical graph−→G

flow f [e] with demands d [i ] is mapping f [·] :−→E → R:

the conservation law∑in

f [(i , j)]−∑out

f [(j , i)] = d [i ], for any i ∈ V

and the capacity constraints

f [(i , j)] ≤Wi,j for any oriented edge (i , j) ∈ −→E .

35 / 51

aalto-logo-en-3

Flows with Demands

d[1]<0

d[2]=0

d[3]=0

d[4]=0

d[5]=0 d[6]>0

36 / 51

aalto-logo-en-3

The Network Nullspace Property (NNSP)

consider partition F = {C1, . . .} of the empirical graph G

sampling set M satisfies network nullspace property w.r.t. F ,denoted NNSP-(M,F), if there exist flow f [e] with demands

f [e]=2Wi ,j for {i , j} ∈ ∂F ,d [i ]=0 for every node i ∈V \M.

C1 C2

a1a2

∂F

sampled node

55

5 5 5 5

5

37 / 51

aalto-logo-en-3

When is NNSP Satisfied?

38 / 51

aalto-logo-en-3

NNSP implies SLP Recovers Clustered Signals

Theorem. Consider a clustered graph signal x [i ] =∑C∈F aCIC[i ]

which is observed only at the sampling set M⊆ V yielding initiallabels xi = x [i ] for i ∈M. If NNSP-(M,F) holds, then the SLPproblem


‖x [·]‖TV s.t. x [i ] = xi for all i ∈M

has a unique solution which coincides with x [·].

39 / 51

aalto-logo-en-3

NNSP implies SLP is Robust

Theorem. Consider graph signal x [·] observed only at thesampling set M. If NNSP-(M,F) holds, SLP delivers x [·] with

‖x [·]− x [·]‖TV ≤ 6 mina∈R|F|

‖x [·]−∑C∈F

aCIC[·]‖TV.

40 / 51

aalto-logo-en-3

Network Lasso for Learning Clustered Signals

learn clustered graph signal (predictor) x [·] using nLasso


∑i∈M|x [i ]− xi |+ λ‖x [·]‖TV

true graph signal x [·] assumed to be clustered:

x [i ] =∑C∈F

aCIC[i ] , with partition F = {C1, . . . , C|F|}

C1 C2a1 a2


when is the solution x [·] of nLasso close to x [·] ?

41 / 51

aalto-logo-en-3

Outline






42 / 51

aalto-logo-en-3

Network Lasso for Learning Clustered Signals

use nLasso for learning underlying graph signal x [·]:x [·] ∈ argmin

x[·]

∑i∈M|x [i ]− xi |+ λ‖x [·]‖TV

true graph signal x [·] assumed to be clustered:

x [i ] =∑C∈F

aCIC[i ] , with IC[i ] =

{1 if i ∈ C0 else.

using partition F = {C1, . . . , C|F|} with disjoint clusters Cl

C1 C2a1 a2


when is the solution x [·] of nLasso close to x [·] ?

43 / 51

aalto-logo-en-3

The Intuition of Compatbility Conditions

stack initial labels into vector y ∈ RM

stack clustered graph signal into vector x ∈ XF ⊆ RV

recover signal x from noisy “measurements” y = Mx + n

selector matrix M with rows {ei}i∈M

stable recovery if M well conditioned when restricted to XF

44 / 51

aalto-logo-en-3

The Network Compatibility Condition (NCC)

consider partition F = {C1, . . .} of the data graph G

sampling set M satisfies network compatibility condition(NCC) w.r.t. partition F and with parameters K > 0, L > 1, ifthere exist flow f [e] with demands s.t.

f [{i , j}]=LWi ,j for {i , j} ∈ ∂F ,

|d [i ]|≤ K for i ∈M

d [i ]=0 for every node i ∈V \M.

45 / 51

aalto-logo-en-3

The Network Compatibility Condition (NCC)

C1 C2a1

a2

∂F

sampled node

1/41/2

1/2 1/2 1/2 1/2

1/2

NCC satisfied with K = 1 and L = 4

46 / 51

aalto-logo-en-3

NCC implies Accurate Network Lasso

Theorem. Clustered signal xc [i ] =∑C∈F aCIC[i ] observed at the

sampling set M⊆ V yielding noisy initial labels xi = xc [i ] + n[i ]for i ∈M. If M satisfies NCC with parameters K > 0, L > 1, anysolution x [·] of the nLasso (with λ = 1/K ), i.e.,


∑i∈M|x [i ]− xi |+ (1/K )‖x [·]‖TV

satisfies

‖x [·]−xc [·]‖TV≤(K + 4/(L− 1))∑i∈M|n[i ]|.

47 / 51

aalto-logo-en-3

Outline






48 / 51

aalto-logo-en-3

So what...?

extended nullspace and compatibility cond. to graph sig.

network nullspace property (NNSP) for SLP

network compatibility condition (NCC) for nLasso

NNSP and NCC amount to existence of certain network flows

NNSP and NCC depend on connectivity of sampled nodes

49 / 51

aalto-logo-en-3

Reading Material for Holidays

A. Mara, AJ, “Recovery Conditions and Sampling Strategiesfor Network Lasso”, Asilomar 2017.

AJ, N. Quang, A. Mara, “When is Network Lasso Accurate?”,Arxiv preprint, 2017

AJ, A. Heimowitz and Y.C. Eldar , “The Network NullspaceProperty for Compressed Sensing over Networks”, SAMPTA2017.

AJ, A.O. Hero III, A. Mara, S. Jahromi, “Semi-SupervisedLearning via Sparse Label Propagation”, arXiv 2017.

S. Basirian, AJ, “Random Walk Sampling for Big Data overNetworks”, SAMPTA 2017.

R.T. Rockafellar, “Convex Analysis” THE OLD TESTAMENTOF CONVEX ANALYSIS!

50 / 51

aalto-logo-en-3

Thank God is X-mas!

Frohe Weihnachten

und einen guten Rutsch!

51 / 51

Documents

Compressed Sensing of Big Data Networks - Aaltojunga1/TalkEPFLDec2017.pdf · cf. L. Lov asz, \Large Networks and Graph Limits" 15/51. aalto-logo-en-3 Supervised Learning data points