Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
aalto-logo-en-3
Compressed Sensing of Big Data Networks
Alexander JungDept. of Computer Science, Aalto University
December 20, 2017
1 / 51
aalto-logo-en-3
Some Brainy Quotes on The Data Deluge
“We’re Drowning in Information and Starving for Knowledge.”- Rutherford D. Rogers.
“There is Nothing More Practical Than a Good Theory.”- Kurt Lewin.
2 / 51
aalto-logo-en-3
About Me
Phd 2012 at TU Vienna
Post-Doc stay at ETH Zurich 2012
Univ.-Ass. at TU Vienna 2013-2015
2015 - , Assistant Professor (tenure-track) at Aalto Universtiy
3 / 51
aalto-logo-en-3
My Research Group
heading the group “Machine Learning for Big Data”
currently five Phd students, several MSc and BSc students
research revolves around fundamental limits and efficientalgorithms for machine learning involving massivenetwork-structured datasets (big data over networks)
4 / 51
aalto-logo-en-3
My Teaching
since 2015, CS-E3210 “Machine Learning: Basic Principles”(this year 600 students)
since 2016, CS-E4020 “Convex Optimization for Big Data”(this year 50 students)
from 2018, CS-E4800 “Artificial Intelligence” (currently over170 students enrolled)
5 / 51
aalto-logo-en-3
My Service
together withProf. Sergiy A. Vorobyov (Aalto) and Prof. Holger Rauhut(RWTH Aachen)im currently co-editing a special research topic at Frontiers Appl.Math. Stat.
6 / 51
aalto-logo-en-3
Overview
1 Machine Learning for Big Data over Networks
2 Network Lasso and Sparse Label Propagation
3 The Network Nullspace Property
4 The Network Compatibility Condition
5 The Final Three Slides
7 / 51
aalto-logo-en-3
Outline
1 Machine Learning for Big Data over Networks
2 Network Lasso and Sparse Label Propagation
3 The Network Nullspace Property
4 The Network Compatibility Condition
5 The Final Three Slides
8 / 51
aalto-logo-en-3
Big Data Fuels Machine Learning
availability of vast amounts of training data allows
to train extremely complex models such as
deep neural networks
9 / 51
aalto-logo-en-3
Andrew Ng’s Rocket Picture
Big Data Complex Model Modern AI/
Deep Learning
10 / 51
aalto-logo-en-3
AI Everywhere
Shazam identifies the ear-worm tune you are listening to
spam filters keep your inbox tidy
Google.com became personal Jeannie
11 / 51
aalto-logo-en-3
Shazam - Live Demo
watched Kill Bill recently
fighting scence with a cool background song
Shazam App digged out the title in seconds!
song unrelated to my preferences in Spotify/FB etc...
12 / 51
aalto-logo-en-3
Shazam
13 / 51
aalto-logo-en-3
A Key Principle
modern AI systems organizebig data as networks
14 / 51
aalto-logo-en-3
Big Data over Networks
datasets and models often have intrinsic network structure
chip design internet bioinformatics
social networks universe material science
cf. L. Lovasz, “Large Networks and Graph Limits”
15 / 51
aalto-logo-en-3
Supervised Learning
data points zi from some input space X
data points labeled with values taken from output space Y
data point zi ∈ X labeled with xi ∈ Y
hypothesis or predictor is a map x [·] : X → Y
GOAL: learn predictor x [·] based on all available data
16 / 51
aalto-logo-en-3
House Prize Prediction (X = R,Y = R)
data point zi is living area (feet2) of a house
data point zi labeled with prize xi of house
GOAL: learn predictor x [·] : R→ R based on available data
CS229 Lecture notes
Andrew Ng
Supervised learning
Let’s start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:
Living area (feet2) Price (1000$s)2104 4001600 3302400 3691416 2323000 540
......
We can plot this data:
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0
100
200
300
400
500
600
700
800
900
1000
housing prices
square feet
pric
e (in
$10
00)
Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?
1
17 / 51
aalto-logo-en-3
Empirical Risk/Loss Minimization
learn predictor x [·] : R→ R based on available data {zi , xi}
x[z]=a · z+b
zi
xi
xi−x[zi]
predictor x [·] modeled linear x [z ] = a · z + b
minimize empirical loss
mina,b
∑i
(xi − x [zi ])2
18 / 51
aalto-logo-en-3
Machine Learning on Graphs
why embed data points {zi}Ni=1 into R?
a minimalist choice for input space: X = V := {zi}Ni=1
“similar” points zi , zj connected by edge with weight Wi ,j > 0
data point zi ∈ V with label x [i ] ∈ R (output space)
predictor is a graph signal x [·] : V → R
“learning predictor x [·]” = “graph signal recovery”!
19 / 51
aalto-logo-en-3
House Prize Prediction on Graphs
zi, xi
zj , xjWi,j
20 / 51
aalto-logo-en-3
Empirical Loss Minimization over Graphs
acquire labels for data points in small sampling set M
learn x [·] : V → R using edge weights W and labels {xi}i∈M
aim at small empirical (training) error∑i∈M
fi (x [i ]; xi )
loss function fi (·) might vary over data points
e.g., fi (. . .) :=(x [i ]−xi )2 or fi (. . .) := |x [i ]−xi |
21 / 51
aalto-logo-en-3
Clustering Hypothesis
dataset represented by data graph G = (V, E ,W)
hypothesis/predictor x [·] : V → R maps node zi to label x [i ]
learn predictor x [·] based on initial labels xi for i ∈M
clustering hypothesis: data points in well-connected subsets(clusters) of V have similar labels
amounts to requiring small total variation (TV)
‖x [·]‖TV :=∑{i ,j}∈E
Wi ,j |x [i ]− x [j ]|
22 / 51
aalto-logo-en-3
A Zero-Order Model for Clustered Signals
simple signal model conforming to clustering hypothesis
x [i ] =∑C∈F
aCIC[i ] , with IC[i ] =
{1 if i ∈ C0 else.
using partition F = {C1, . . . , C|F|} with disjoint clusters Cl
C1 C2a1 a2
∂Fsampled node i∈M
lets denote, for fixed F , set of clustered signals by XF
23 / 51
aalto-logo-en-3
Good Clusters - Small Total Variation
we allow for arbitrary partition F = {C1, . . . , C|F|}
our results are most useful for “reasonable clusters” Cl
cluster boundary ∂F with small average weight∑boundary
Wi ,j �∑
interior
Wi ,j
amounts to requiring small total variation (TV)
‖x [·]‖TV :=∑{i ,j}∈E
Wi ,j |x [i ]− x [j ]|
note that TV does not require partition !!
24 / 51
aalto-logo-en-3
Outline
1 Machine Learning for Big Data over Networks
2 Network Lasso and Sparse Label Propagation
3 The Network Nullspace Property
4 The Network Compatibility Condition
5 The Final Three Slides
25 / 51
aalto-logo-en-3
The Learning (Recovery) Problem
observe few initial labels xi for i ∈M (� V)
aim at learning all labels x [i ], for i ∈ V
empirical risk incurred by particular hypothesis x [·] is∑i∈M
fi (x [i ]; xi )
with some loss function fi (·; ·) ∈ R associated with node i
balance empirical risk with total variation
‖x [·]‖TV =∑{i ,j}∈E
Wi ,j |x [i ]− x [j ]|
26 / 51
aalto-logo-en-3
The Network Lasso
network Lasso (nLasso) [Hallac, 2015]
x [·] ∈ argminx[·]
∑i∈M
fi (x [i ]; xi ) + λ‖x [·]‖TV
with initial labels xi provided for zi ∈M
choosing large λ enforces small total variation ‖x [·]‖TV
choosing small λ enforces small empirical error
typical choices fi (x ; y) :=(x−y)2, or fi (x ; y) := |x−y |
nLasso does not require partition F !
27 / 51
aalto-logo-en-3
Structure of Network Lasso
nLasso has particular structure:
x [·] ∈ argminx[·]
f (x [·]) :=∑i∈M
fi (x [i ]; xi ) + λ‖x [·]‖TV
with convex loss functions fi (x [i ]; xi )
total variation ‖x [·]‖TV is a non-diffable convex function
objective sum of two non-smooth convex components
minimizing each component individually is easy
28 / 51
aalto-logo-en-3
Convex Optimization Problems
nLasso: x [·] ∈ argminx[·]∑
i∈M fi (x [i ]; xi ) + λ‖x [·]‖TV
objective sum of two non-smooth convex components
nLasso delivers x [·] if and only if 0∈∂f (x [·])
perfect prey for proximal methods
rewrite 0∈∂f (x [·]) as x [·]=P x [·] with some operator P
do a fixed-point iteration x (k+1)[·] = Px (k)[·]
29 / 51
aalto-logo-en-3
Proximal Methods for nLasso
nLasso characterized by x [·] = P x [·]
compute x [·] by fixed point iteration
different options for P (EXPLOIT THIS FREEDOM!)
particular P yields ADMM, Pock-Chambolle,. . .
often allow for efficient (distributed) implementation
30 / 51
aalto-logo-en-3
Sparse Label Propagation
nLasso
x [·] ∈ argminx[·]
∑i∈M
fi (x [i ]; xi ) + λ‖x [·]‖TV
amounts to balancing empirical error with total variation
we might also insist in consistency with initial labels
this suggests to use sparse label propagation (SLP)
x [·] ∈ argminx[·]
‖x [·]‖TV s.t. x [i ] = xi for all i ∈M
propagate labels xi such that {x [i ]− x [j ]}(i ,j)∈E is sparse
SLP is equivalent to a linear program!
31 / 51
aalto-logo-en-3
Outline
1 Machine Learning for Big Data over Networks
2 Network Lasso and Sparse Label Propagation
3 The Network Nullspace Property
4 The Network Compatibility Condition
5 The Final Three Slides
32 / 51
aalto-logo-en-3
The Learning (Recovery) Problem
dataset with graph G and some initial labels xi ∈M
graph signal x [·] representing labels is clustered
C1 C2a1 a2
∂Fsampled node i∈M
when does SLP recover x [·] ?
33 / 51
aalto-logo-en-3
The Intuition of Nullspace Conditions
stack initial labels into vector y ∈ RM
stack clustered graph signal into vector x ∈ XF ⊆ RV
recover signal x from “measurements” y = Mx
selector matrix M with rows {ei}i∈M
recovery impossible for any x in nullspace K(M) of M
we have to make sure that K(M) ∩ XF = ∅
34 / 51
aalto-logo-en-3
Network Flows
consider oriented empirical graph−→G
flow f [e] with demands d [i ] is mapping f [·] :−→E → R:
the conservation law∑in
f [(i , j)]−∑out
f [(j , i)] = d [i ], for any i ∈ V
and the capacity constraints
f [(i , j)] ≤Wi,j for any oriented edge (i , j) ∈ −→E .
35 / 51
aalto-logo-en-3
Flows with Demands
d[1]<0
d[2]=0
d[3]=0
d[4]=0
d[5]=0 d[6]>0
36 / 51
aalto-logo-en-3
The Network Nullspace Property (NNSP)
consider partition F = {C1, . . .} of the empirical graph G
sampling set M satisfies network nullspace property w.r.t. F ,denoted NNSP-(M,F), if there exist flow f [e] with demands
f [e]=2Wi ,j for {i , j} ∈ ∂F ,d [i ]=0 for every node i ∈V \M.
C1 C2
a1a2
∂F
sampled node
55
5 5 5 5
5
37 / 51
aalto-logo-en-3
When is NNSP Satisfied?
38 / 51
aalto-logo-en-3
NNSP implies SLP Recovers Clustered Signals
Theorem. Consider a clustered graph signal x [i ] =∑C∈F aCIC[i ]
which is observed only at the sampling set M⊆ V yielding initiallabels xi = x [i ] for i ∈M. If NNSP-(M,F) holds, then the SLPproblem
x [·] ∈ argminx[·]
‖x [·]‖TV s.t. x [i ] = xi for all i ∈M
has a unique solution which coincides with x [·].
39 / 51
aalto-logo-en-3
NNSP implies SLP is Robust
Theorem. Consider graph signal x [·] observed only at thesampling set M. If NNSP-(M,F) holds, SLP delivers x [·] with
‖x [·]− x [·]‖TV ≤ 6 mina∈R|F|
‖x [·]−∑C∈F
aCIC[·]‖TV.
40 / 51
aalto-logo-en-3
Network Lasso for Learning Clustered Signals
learn clustered graph signal (predictor) x [·] using nLasso
x [·] ∈ argminx[·]
∑i∈M|x [i ]− xi |+ λ‖x [·]‖TV
true graph signal x [·] assumed to be clustered:
x [i ] =∑C∈F
aCIC[i ] , with partition F = {C1, . . . , C|F|}
C1 C2a1 a2
∂Fsampled node i∈M
when is the solution x [·] of nLasso close to x [·] ?
41 / 51
aalto-logo-en-3
Outline
1 Machine Learning for Big Data over Networks
2 Network Lasso and Sparse Label Propagation
3 The Network Nullspace Property
4 The Network Compatibility Condition
5 The Final Three Slides
42 / 51
aalto-logo-en-3
Network Lasso for Learning Clustered Signals
use nLasso for learning underlying graph signal x [·]:x [·] ∈ argmin
x[·]
∑i∈M|x [i ]− xi |+ λ‖x [·]‖TV
true graph signal x [·] assumed to be clustered:
x [i ] =∑C∈F
aCIC[i ] , with IC[i ] =
{1 if i ∈ C0 else.
using partition F = {C1, . . . , C|F|} with disjoint clusters Cl
C1 C2a1 a2
∂Fsampled node i∈M
when is the solution x [·] of nLasso close to x [·] ?
43 / 51
aalto-logo-en-3
The Intuition of Compatbility Conditions
stack initial labels into vector y ∈ RM
stack clustered graph signal into vector x ∈ XF ⊆ RV
recover signal x from noisy “measurements” y = Mx + n
selector matrix M with rows {ei}i∈M
stable recovery if M well conditioned when restricted to XF
44 / 51
aalto-logo-en-3
The Network Compatibility Condition (NCC)
consider partition F = {C1, . . .} of the data graph G
sampling set M satisfies network compatibility condition(NCC) w.r.t. partition F and with parameters K > 0, L > 1, ifthere exist flow f [e] with demands s.t.
f [{i , j}]=LWi ,j for {i , j} ∈ ∂F ,
|d [i ]|≤ K for i ∈M
d [i ]=0 for every node i ∈V \M.
45 / 51
aalto-logo-en-3
The Network Compatibility Condition (NCC)
C1 C2a1
a2
∂F
sampled node
1/41/2
1/2 1/2 1/2 1/2
1/2
NCC satisfied with K = 1 and L = 4
46 / 51
aalto-logo-en-3
NCC implies Accurate Network Lasso
Theorem. Clustered signal xc [i ] =∑C∈F aCIC[i ] observed at the
sampling set M⊆ V yielding noisy initial labels xi = xc [i ] + n[i ]for i ∈M. If M satisfies NCC with parameters K > 0, L > 1, anysolution x [·] of the nLasso (with λ = 1/K ), i.e.,
x [·] ∈ argminx[·]
∑i∈M|x [i ]− xi |+ (1/K )‖x [·]‖TV
satisfies
‖x [·]−xc [·]‖TV≤(K + 4/(L− 1))∑i∈M|n[i ]|.
47 / 51
aalto-logo-en-3
Outline
1 Machine Learning for Big Data over Networks
2 Network Lasso and Sparse Label Propagation
3 The Network Nullspace Property
4 The Network Compatibility Condition
5 The Final Three Slides
48 / 51
aalto-logo-en-3
So what...?
extended nullspace and compatibility cond. to graph sig.
network nullspace property (NNSP) for SLP
network compatibility condition (NCC) for nLasso
NNSP and NCC amount to existence of certain network flows
NNSP and NCC depend on connectivity of sampled nodes
49 / 51
aalto-logo-en-3
Reading Material for Holidays
A. Mara, AJ, “Recovery Conditions and Sampling Strategiesfor Network Lasso”, Asilomar 2017.
AJ, N. Quang, A. Mara, “When is Network Lasso Accurate?”,Arxiv preprint, 2017
AJ, A. Heimowitz and Y.C. Eldar , “The Network NullspaceProperty for Compressed Sensing over Networks”, SAMPTA2017.
AJ, A.O. Hero III, A. Mara, S. Jahromi, “Semi-SupervisedLearning via Sparse Label Propagation”, arXiv 2017.
S. Basirian, AJ, “Random Walk Sampling for Big Data overNetworks”, SAMPTA 2017.
R.T. Rockafellar, “Convex Analysis” THE OLD TESTAMENTOF CONVEX ANALYSIS!
50 / 51
aalto-logo-en-3
Thank God is X-mas!
Frohe Weihnachten
und einen guten Rutsch!
51 / 51