58
Emmanuel Abbe Princeton University ISIT Tutorial Information theory and machine learning Part II 1 Martin Wainwright UC Berkeley

ISIT Tutorial Information theory and machine learning Part II

Embed Size (px)

Citation preview

Page 1: ISIT Tutorial Information theory and machine learning Part II

Emmanuel AbbePrinceton University

ISIT TutorialInformation theory and machine learning

Part II

1

Martin WainwrightUC Berkeley

Page 2: ISIT Tutorial Information theory and machine learning Part II

Examples:- community detection in social networks

2

Inverse problems on graphs

A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents

by observing local noisy interactions of these agents

Page 3: ISIT Tutorial Information theory and machine learning Part II

Examples:- community detection in social networks- image segmentation

3

Inverse problems on graphs

A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents

by observing local noisy interactions of these agents

Page 4: ISIT Tutorial Information theory and machine learning Part II

Examples:- community detection in social networks

- data classification and information retrieval- image segmentation

4

Inverse problems on graphs

A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents

by observing local noisy interactions of these agents

Page 5: ISIT Tutorial Information theory and machine learning Part II

Examples:- community detection in social networks- image segmentation

- object matching, synchronization- page sorting- protein-to-protein interactions- haplotype assembly- ...

- data classification and information retrieval

5

Inverse problems on graphs

A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents

by observing local noisy interactions of these agents

Page 6: ISIT Tutorial Information theory and machine learning Part II

In each case: observe information on the edges of a network that has been generated from hidden attributes on the nodes,

and try to infer back these attributes

Dual to the graphical model learning problem (previous part)

6

Inverse problems on graphs

A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents

by observing local noisy interactions of these agents

Page 7: ISIT Tutorial Information theory and machine learning Part II

X1

Xn

Y1

C

YN

W

W

Different: the code is a design parameter and takes typically specific non-local interactions of the bits (e.g., random, LDPC, polar codes).

What about graph-based codes?

7

(1) What are the relevant types of “codes” and “channels” behind machine learning problems?(2) What are the fundamental limits for these?

Page 8: ISIT Tutorial Information theory and machine learning Part II

1. Community detection and clustering2. Stochastic block models : fundamental limits and capacity-achieving algorithms3. Open problems 4. Graphical channels and low-rank matrix recovery

Outline of the talk

8

Page 9: ISIT Tutorial Information theory and machine learning Part II

Community detection and clustering

9

Page 10: ISIT Tutorial Information theory and machine learning Part II

social networks:“friendship”

biological networks:“protein interactions”

genome HiC networks:“DNA contacts”

call graphs:“calls”

Networks provide local interactions among agents

10

Page 11: ISIT Tutorial Information theory and machine learning Part II

social networks:“friendship”

biological networks:“protein interactions”

genome HiC networks:“DNA contacts”

call graphs:“calls”

Networks provide local interactions among agentsone often wants to infer global similarity classes

11

Page 12: ISIT Tutorial Information theory and machine learning Part II

Tutorial motto:Can on establish a clear line-of-sight

as in communications with the Shannon capacity? Pea

k D

ata

Effi

cien

cy

SNR

WAN wireless tech.

Shannon-capacity

Infeasible Region

EV-DO

802.16

HSDPA

A long-studied and notoriously hard problem

what is a good clustering?assort. and disassort. relations

how to get a good clustering?computationally hard

The challenges of community detection

work with models

12

many heuristics...

Page 13: ISIT Tutorial Information theory and machine learning Part II

The Stochastic Block Model

13

Page 14: ISIT Tutorial Information theory and machine learning Part II

p = (p1, . . . , pk) <- probability vector = relative size of the communities

k = 4The DMC of clustering..?

P = diag(p)

The stochastic block model

np1 np2

np3np4

14

SBM(n, p,W )

<- symmetric matrix with entries in [0,1] = prob .of connecting among communitiesW =

0

B@W11 . . . W1k...

. . ....

Wk1 · · · Wkk

1

CA

W11W12

W13

W14

W22

W23

W24

W44 W34

W33

Page 15: ISIT Tutorial Information theory and machine learning Part II

The (exact) recovery problem

We will see weaker recovery requirements later

Starting point: progress in science often comes from understanding special cases...

15

Definition. An algorithm

ˆXn(·) solves (exact) recovery in the SBM if

for a random graph G under the model, limn!1 P(Xn=

ˆXn(G)) = 1.

Let Xn= [X1, . . . , Xn] represent the community variables

of the nodes (drawn under p)

Page 16: ISIT Tutorial Information theory and machine learning Part II

SBM with 2 symmetric communities: 2-SBM

16

Page 17: ISIT Tutorial Information theory and machine learning Part II

2-SBM

p pq

p1 = p2 = 1/2

n

2

n

2

17

W11 = W22 = p W12 = q

Page 18: ISIT Tutorial Information theory and machine learning Part II

Some historyfor 2-SBM

Recovery problem

HollandLaskeyLeinhardt

1983

BoppanaSnijdersNowicki

CondonKarpDyer

Frieze

BickelChen

2010

McSherry

2014

JerrumSorkin

CarsonImpagliazzoBui, Chaudhuri,

Leighton, Sipser

RoheChatterjeeYu

Bui, Chaudhuri,

Leighton, Sipser ’84 maxflow-mincut p = ⌦(1/n), q = o(n

�1�4/((p+q)n))

Boppana ’87 spectral meth. (p� q)/

pp+ q = ⌦(

plog(n)/n)

Dyer, Frieze ’89 min-cut via degrees p� q = ⌦(1)

Snijders, Nowicki ’97 EM algo. p� q = ⌦(1)

Jerrum, Sorkin ’98 Metropolis aglo. p� q = ⌦(n

�1/6+✏)

Condon, Karp ’99 augmentation algo. p� q = ⌦(n

�1/2+✏)

Carson, Impagliazzo ’01 hill-climbing algo. p� q = ⌦(n

�1/2log

4(n))

Mcsherry ’01 spectral meth. (p� q)/

pp � ⌦(

plog(n)/n)

Bickel, Chen ’09 N-G modularity (p� q)/

pp+ q = ⌦(log(n)/

pn)

Rohe, Chatterjee, Yu ’11 spectral meth. p� q = ⌦(1)

Instead of ‘how’,when can we recover

the clusters (IT)?

P(X̂n = Xn) ! 1

18

algorithms driven...

Page 19: ISIT Tutorial Information theory and machine learning Part II

Information-theoretic view of clustering (über alles)

reliable comm. iff R < 1-H(ɛ)

X1

Xn

Y1

C

YN

W

W

W =

✓1� ✏ ✏✏ 1� ✏

R =n

NX1

Xn

Y1

YN

...

...

W

W

R =2

n

W =

✓1� p p1� q q

reliable comm.

exact recovery

,

unorthodoxcode!

2-SBM

N =

✓n

2

iff ???

19

Page 20: ISIT Tutorial Information theory and machine learning Part II

Recovery problem

HollandLaskeyLeinhardt

1983

BoppanaSnijdersNowicki

CondonKarpDyer

Frieze

BickelChen

2010

McSherry

2014

JerrumSorkin

CarsonImpagliazzoBui, Chaudhuri,

Leighton, Sipser

RoheChatterjeeYu

Infeasible

a

befficiently achievable

Bui, Chaudhuri,

Leighton, Sipser ’84 maxflow-mincut p = ⌦(1/n), q = o(n

�1�4/((p+q)n))

Boppana ’87 spectral meth. (p� q)/

pp+ q = ⌦(

plog(n)/n)

Dyer, Frieze ’89 min-cut via degrees p� q = ⌦(1)

Snijders, Nowicki ’97 EM algo. p� q = ⌦(1)

Jerrum, Sorkin ’98 Metropolis aglo. p� q = ⌦(n

�1/6+✏)

Condon, Karp ’99 augmentation algo. p� q = ⌦(n

�1/2+✏)

Carson, Impagliazzo ’01 hill-climbing algo. p� q = ⌦(n

�1/2log

4(n))

Mcsherry ’01 spectral meth. (p� q)/

pp � ⌦(

plog(n)/n)

Bickel, Chen ’09 N-G modularity (p� q)/

pp+ q = ⌦(log(n)/

pn)

Rohe, Chatterjee, Yu ’11 spectral meth. p� q = ⌦(1)

Some historyfor 2-SBM

p =

a log(n)

n, q =

b log(n)

nRecovery i↵

a+b2 � 1 +

pab

Abbe-Bandeira-Hall

20

P(X̂n = Xn) ! 1

Page 21: ISIT Tutorial Information theory and machine learning Part II

Recovery problem

HollandLaskeyLeinhardt

1983

BoppanaSnijdersNowicki

CondonKarpDyer

Frieze

BickelChen

2010

McSherry

2014

Detection problem

Coja-Oghlan

DecelleKrzakalaMoore Zdeborova

Massoulié

MosselNeemanSlyJerrum

Sorkin

CarsonImpagliazzo

p =a

n, q =

b

n(a� b)2 > 2(a+ b)

Detection i↵

Mossel-Neeman-Sly

Bui, Chaudhuri,Leighton, Sipser

RoheChatterjeeYu

efficiently achievable

Bui, Chaudhuri,

Leighton, Sipser ’84 maxflow-mincut p = ⌦(1/n), q = o(n

�1�4/((p+q)n))

Boppana ’87 spectral meth. (p� q)/

pp+ q = ⌦(

plog(n)/n)

Dyer, Frieze ’89 min-cut via degrees p� q = ⌦(1)

Snijders, Nowicki ’97 EM algo. p� q = ⌦(1)

Jerrum, Sorkin ’98 Metropolis aglo. p� q = ⌦(n

�1/6+✏)

Condon, Karp ’99 augmentation algo. p� q = ⌦(n

�1/2+✏)

Carson, Impagliazzo ’01 hill-climbing algo. p� q = ⌦(n

�1/2log

4(n))

Mcsherry ’01 spectral meth. (p� q)/

pp � ⌦(

plog(n)/n)

Bickel, Chen ’09 N-G modularity (p� q)/

pp+ q = ⌦(log(n)/

pn)

Rohe, Chatterjee, Yu ’11 spectral meth. p� q = ⌦(1)

9✏ > 0 : P(d(X̂n, Xn)

n<

1

2� ✏) ! 1

Some historyfor 2-SBM

p =

a log(n)

n, q =

b log(n)

nRecovery i↵

a+b2 � 1 +

pab

Abbe-Bandeira-Hall

What about multiple/asymm. communities?Conjecture: detection changes with 5 or more

21

P(X̂n = Xn) ! 1

Page 22: ISIT Tutorial Information theory and machine learning Part II

p =

a log n

n, q =

b log n

n

n/2n/2

pBj

pBi

qRi Rji j

Recovery in the 2-SBM: IT limit

If a+b2 �

pab < 1 then ML fails w.h.p.Converse:

Abbe-Bandeira-Hall ’14 22

ML fails if two nodes can be swapped to reduce the cut

what is ML? ! min-bisection

P (B1 R1) = n�((a+b)/2�pab)+o(1)

P (9i : Bi Ri) = ? ⇣ nP (B1 R1) (weak correlations)

Page 23: ISIT Tutorial Information theory and machine learning Part II

n/2n/2

pp q

P(9X̂n : d(X̂n, Xn)/n /2 [1/2� ✏, 1/2 + ✏]) ! 1�1 +1

Recovery in the 2-SBM: efficient algorithms

max x

TAx

s.t. xi = ±1

1

tx = 0

Abbe-Bandeira-Hall ’14

X = xx

tlifting:

max tr(AX)

s.t. Xii = 1

1

tX = 0

X ⌫ 0

rank(X) = 1

ML-decoding:spectral:max x

TAx

s.t. kxk = 1

1

tx = 0

NP-hard

SDP:

23

Page 24: ISIT Tutorial Information theory and machine learning Part II

Note that SDP can be expensive...

-> Analyze the spectral norm of a random matrix[Abbe-Bandeira-Hall ’14] Bernstein: slightly loose

[Xu-Hanjek-Wu ’15] Seginer bound[Bandeira, Bandeira-Van Handel ’15] tight bound

Recovery in the 2-SBM: efficient algorithms

Abbe-Bandeira-Hall ’14 24

Theorem. The SDP solves recovery if 2LSBM + 11

t+ In ⌫ 0

where LSBM = DG+ �DG� �A.

Page 25: ISIT Tutorial Information theory and machine learning Part II

The general SBM

25

Page 26: ISIT Tutorial Information theory and machine learning Part II

Quiz: If a node is in community i, how many neighbors does ithave in expectation in community j ?

1. npj

4. 7i

“degree profile matrix”

26

np1 np2

np3np4

W11W12

W13

W14

W22

W23

W24

W44 W34

W33

SBM(n, p,W )

2. npjWij

3. npiWij

0

@ nPW

1

A

Page 27: ISIT Tutorial Information theory and machine learning Part II

Back to the Information-theoretic view of clustering

X1

Xn

Y1

C

YN

W

W

R =n

NX1

Xn

Y1

YN

SBM

...

...

W

W

R =2

n

reliable comm. iff R < 1-H(ɛ)

reliable comm. iff 1 < J(p,W) ???reliable comm. iff R < max I(p,W)p | {z }

KL-divergence

reliable comm. iff 1 < (a+b)/2-√ab__

✓ ◆W

✓ ◆W

27

Page 28: ISIT Tutorial Information theory and machine learning Part II

Main results

(über alles)

where

| {z }Dt(µ, ⌫)

We call D+ the CH-divergence.

D

1

2(pa�

pb)2 � 1

Abbe-Bandeira-Hall ’14

J(p,Q) := mini<j

D+((PQ)i, (PQ)j) � 1

• D1/2(µ, ⌫) =12k

pµ�

p⌫k22 is the Hellinger divergence (distance)

• � logmaxtP

i µti⌫

ti is the Cherno↵ divergence

• Dt is an f -divergence:P

i ⌫if(µi/⌫i)

[Abbe-Sandon ’15]

f(x) = 1� t+ tx� x

t

D+(µ, ⌫) := max

t2[0,1]

X

`2[k]

�tµ` + (1� t)⌫` � µt

`⌫1�t`

28

Theorem 1. Recovery is solvable in SBM(n, p,Q log(n)/n) if and only if

Page 29: ISIT Tutorial Information theory and machine learning Part II

Main results

(über alles)(über alles)

Is recovery in the general SBM solvable efficiently down the information theoretic threshold?

where

Theorem 2. The degree-profiling algorithm achieves the threshold

and runs in quasi-linear time.

J(p,Q) := mini<j

D+((PQ)i, (PQ)j) � 1

YES!

[Abbe-Sandon ’15]

D+(µ, ⌫) := max

t2[0,1]

X

`2[k]

�tµ` + (1� t)⌫` � µt

`⌫1�t`

29

Theorem 1. Recovery is solvable in SBM(n, p,Q log(n)/n) if and only if

Page 30: ISIT Tutorial Information theory and machine learning Part II

(über alles)

[Abbe-Sandon ’15]

When can we extract a specific community?

30

(PQ)i

(PQ)j

1

Theorem. If community i has a profile (PQ)i at D+-distance at least 1

from all other profiles (PQ)j , j 6= i, then it can be extracted w.h.p.

What if we do not know the parameters?We can learn them on the fly: [Abbe-Sandon ’15] (second paper)

Page 31: ISIT Tutorial Information theory and machine learning Part II

Proof techniques and algorithms

31

Page 32: ISIT Tutorial Information theory and machine learning Part II

A key step

(über alles)(über alles)

Hypothesis 1

Hypothesis 2

dvdv

.⇠ P(log(n)(PQ)1)

dv.⇠ P(log(n)(PQ)2)

1 2

3

1 2

3

Theorem. For any ✓1, ✓2 2 (R+ \ {0})k with ✓1 6= ✓2 and p1, p2 2 R+ \ {0},X

x2Zk+

min(Pln(n)✓1(x)p1,Pln(n)✓2(x)p2) = ⇥⇣n

�D+(✓1,✓2)�o(1)⌘,

where D+ is the CH-divergence.

[Abbe-Sandon ’15] 32

Page 33: ISIT Tutorial Information theory and machine learning Part II

Plan:put effort in recovering most of the nodes and

then finish greedily with local improvements

How to use this?

[Abbe-Sandon ’15] 33

Page 34: ISIT Tutorial Information theory and machine learning Part II

(1) Split into two graphsG”

GG’ loglog-degree

log-degree

G’(2) Run on Sphere-comparison

-> gets a fraction 1-o(1) (see next)

(3) Take now with the clustering of G” G’

Pe

= n�D+((PQ)1,(PQ)2)+o(1)

The degree-profiling algorithm

Hypothesis 1

Hypothesis 2

dv...⇠ P(log(n)(PQ)1)

dv...⇠ P(log(n)(PQ)2)

dv

(capacity-achieving)

[Abbe-Sandon ’15] 34

Page 35: ISIT Tutorial Information theory and machine learning Part II

How do we get most nodes correctly?

35

Page 36: ISIT Tutorial Information theory and machine learning Part II

Other recovery requirements

For all the above: what are the “efficient” VS. “information-theoretic” fundamental limits?

36

Exact recovery: c = 1

Almost exact recovery: c = 1� o(1)

Weak recovery or detection : for some (for the symmetric k-SBM).

c = 1/k + ✏ ✏ > 0

Partial recovery: An algorithm solves partial recovery in the SBM with accuracy if it produces a clustering which is correct on a fraction of the nodes with high probability.

cc

Page 37: ISIT Tutorial Information theory and machine learning Part II

Proposed notion of SNR:

|�min

|2�max

e.v. of PQ

Theorem (informal). In the sparse SBM(n, p,Q/n),the Sphere-comparison algorithm recovers a fraction

of nodes which approaches 1 when the SNR diverges.

What is a good notion of SNR?

Note that the SNR scales if Q scales!

=

(a�b)2

2(a+b) for 2-symm. comm.

=

(a�b)2

k(a+(k�1)b) for k-symm. comm.

[Abbe-Sandon ’15]

Bin(n/2, a/n)

Bin(n/2, b/n)

Partial recovery in SBM(n, p,Q/n)

37

Page 38: ISIT Tutorial Information theory and machine learning Part II

A node neighborhood inSBM(n, p,Q/n)

np1 npk

�v = i

p1Qi1 pkQik

. . .(PQ)i

|{z

}

depth r

...

Nr(v)((PQ)r)i

Sphere-comparison

[Abbe-Sandon ’15] 38

Page 39: ISIT Tutorial Information theory and machine learning Part II

�v = i

...

...

�v0 = j

Nr(v)

Nr0(v0)

Compare v and v’ from:

|Nr(v) \Nr0(v0)|

Sphere-comparisonTake now two nodes:

[Abbe-Sandon ’15] 39

hard to analyze...

Page 40: ISIT Tutorial Information theory and machine learning Part II

�v = i

...

...

�v0 = j

E

Nr,r0[E](v · v0)= number of crossing edges

Compare v and v’ from:

⇡ Nr[G\E](v) ·cQ

nNr0[G\E](v

0)

Subsample G with prob. c to get E

Nr[G\E](v)

Nr0[G\E](v0) ⇡ ((1� c)PQ)re�v · cQ

n((1� c)PQ)r

0e�v0

= c(1� c)r+r0e�v ·Q(PQ)r+r0e�v0 /n

Sphere-comparison

Decorrelate:

Additional steps:1. look at several depths -> Vandermonde syst.2. use “anchor nodes”

[Abbe-Sandon ’15] 40

Page 41: ISIT Tutorial Information theory and machine learning Part II

A real data example

41

Page 42: ISIT Tutorial Information theory and machine learning Part II

our algorithm: ~60/1222

The political blogs network

42

[Adamic and Glance ’05]

1222 blogs(left- and right-leaning)

edge = hyperlink between blogs

We can recover 95% of the nodes correctlyThe CH-divergence is close to 1

Page 43: ISIT Tutorial Information theory and machine learning Part II

Some open problems in community detection

43

Page 44: ISIT Tutorial Information theory and machine learning Part II

Some open problems in community detection

I. The SBM:a. Recovery

[Abbe-Sandon ’15] should extend to k = o(log(n))

[Chen-Xu ’14] k,p,q scale with n polynomially

Growing nb. of communities? Sub-linear communities?

44

Page 45: ISIT Tutorial Information theory and machine learning Part II

Some open problems in community detection

I. The SBM:

b. Detection and broadcasting on treesa. Recovery

[Mossel-Neeman-Sly ’13] Converse for detection in 2-SBM:

p=a/n, q=b/n

Galton-Watson tree Poisson((a+b)/2)

1 1 0

Xr

BSC(b/(a+ b))

If (a+b)/2<=1 the tree dies w.p. 1If (a+b)/2>1 the tree survives w.p. >0

Unorthodox broadcasting problem: when can we detect the root-bit?

c >1

(1� 2")2If and only if

[Evans-Kenyon-Peres-Schulman ’00]

c =a+ b

2

() (a� b)2

2(a+ b)> 1

45

Page 46: ISIT Tutorial Information theory and machine learning Part II

Some open problems in community detection

I. The SBM:

b. Detection and broadcasting on treesa. Recovery

For k clusters? SNR =(a� b)2

k(a� (k � 1)b)

46

[Decelle-Krzakala-Zdeborova-Moore ’11]

impossible possible efficient

0 SNR1ck

Conjecture. For the symmetric k-SBM(n, a, b), there exists ck s.t.

(1) If SNR < ck, then detection cannot be solved,

(2) If ck < SNR < 1, then detection can be solved

information-theoretically but not e�ciently,

(3) If SNR > 1, then detection can be solved e�ciently.

Moreover ck = 1 for k 2 {2, 3, 4} and ck < 1 for k � 5.

Page 47: ISIT Tutorial Information theory and machine learning Part II

Some open problems in community detection

I. The SBM:

b. Detection and broadcasting on treesc. Partial recovery and the SNR-distortion curve

a. Recovery

1/2

1SNR

� = 1� ↵ k = 2

Conjecture. For the symmetric k-SBM(n, a, b) and ↵ 2 (1/k, 1),there exists �k, �k s.t. partial-recovery of accuracy ↵ is solvable

if and only if SNR > �k, and e�ciently solvable i↵ SNR > �k.

k � 5

47

Page 48: ISIT Tutorial Information theory and machine learning Part II

- Censored block models- Labelled block models

n/2n/2

0 1

pp p

[Abbe-Bandeira-Bracher-Singer ’14]

“correlation clustering” [Bansal, Blum, Chawla ’04] “LDGM codes” [Kumar, Pakzad, Salavati, Shokrollahi ’12]

“labelled block model” [Heimlicher, Lelarge, Massoulié ’12]“soft CSPs” [Abbe, Montanari ’13]

“pairwise measurements” [Chen, Suh, Goldsmith ’14 and ’15] “bounded-size correlation clustering” [Puleo, Milenkovic ’14]

Some open problems in community detection

2-CBM [Abbe-Bandeira-Bracher-Singer, Xu-Lelarge-Massoulie,Saad-Krzakala-Lelarge-Zdeborova,Chin-Rao-Vu, Hajek-Wu-Xu]

2-CBM(n, p, ✏)

II. Other block models

48

Page 49: ISIT Tutorial Information theory and machine learning Part II

- Mixed-membership block models [Airoldi-Blei-Fienber-Xing ’08]- Overlapping block models [Abbe-Sandon ’15]

OSBM(n, p, f)u v

XuXu Xv⇠ p p on {0, 1}s⇠ p

f : {0, 1}s ⇥ {0, 1}s ! [0, 1]connect (u, v) with prob. f(Xu, Xv)

Some open problems in community detection

- Censored block models- Labelled block models

II. Other block models

49

- Degree-corrected block models [Karrer-Newman ’11]

Page 50: ISIT Tutorial Information theory and machine learning Part II

- Planted community [Deshpande-Montanari ’14 , Montanari ’15]

, p p

Some open problems in community detection

- Mixed-membership block models - Overlapping block models

- Censored block models- Labelled block models- Degree-corrected block models

II. Other block models

50

Page 51: ISIT Tutorial Information theory and machine learning Part II

- Hypergraph models

Some open problems in community detection

- Planted community

- Mixed-membership block models - Overlapping block models

- Censored block models- Labelled block models- Degree-corrected block models

II. Other block models

51

Page 52: ISIT Tutorial Information theory and machine learning Part II

- Hypergraph models

Some open problems in community detection

- Planted community

- Mixed-membership block models - Overlapping block models

- Censored block models- Labelled block models- Degree-corrected block models

II. Other block models

For all the above: Is there a CH-divergence behind? A generalized notion of SNR? Detection gaps?

Efficient algorithms?

52

Page 53: ISIT Tutorial Information theory and machine learning Part II

Some open problems in community detection

III. Beyond block models:

b. Low-rank matrix recovery

b. Graphical channels

A general information-theoretic model?

a. Exchangeable random arrays and graphons w : [0, 1]2 ! [0, 1]

P (Eij = 1|Xi = xi, Xj = xj) = w(xi, xj)

ui

uj

w

×

(ui, uj)

w(ui, uj)

G1

G2T

Figure 1: [Left] Given a graphon w : [0, 1]2 → [0, 1], we draw i.i.d. samples ui, uj fromUniform[0,1] and assign Gt[i, j] = 1 with probability w(ui, uj), for t = 1, . . . , 2T . [Middle]Heat map of a graphon w. [Right] A random graph generated by the graphon shown in the middle.Rows and columns of the graph are ordered by increasing ui, instead of i for better visualization.

Wolfe [9] studied the consistency properties, but did not provide algorithms to estimate the graphon.To the best of our knowledge, the only method that estimates graphons consistently, besides ours, isUSVT [8]. However, our algorithm has better complexity and outperformsUSVT in our simulations.More recently, other groups have begun exploring approaches related to ours [28, 24].

The proposed approximation procedure requires w to be piecewise Lipschitz. The basic idea is toapproximate w by a two-dimensional step function !w with diminishing intervals as n increases.Theproposed method is called the stochastic blockmodel approximation (SBA) algorithm, as the idea ofusing a two-dimensional step function for approximation is equivalent to using the stochastic blockmodels [10, 22, 13, 7, 25]. The SBA algorithm is defined up to permutations of the nodes, so theestimated graphon is not canonical. However, this does not affect the consistency properties of theSBA algorithm, as the consistency is measured w.r.t. the graphon that generates the graphs.

2 Stochastic blockmodel approximation: Procedure

In this section we present the proposed SBA algorithm and discuss its basic properties.

2.1 Assumptions on graphons

We assume that w is piecewise Lipschitz, i.e., there exists a sequence of non-overlaping intervalsIk = [αk−1, αk] defined by 0 = α0 < . . . < αK = 1, and a constant L > 0 such that, for any(x1, y1) and (x2, y2) ∈ Iij = Ii × Ij ,

|w(x1, y1)− w(x2, y2)| ≤ L (|x1 − x2|+ |y1 − y2|) .

For generality we assume w to be asymmetric i.e., w(u, v) ̸= w(v, u), so that symmetric graphonscan be considered as a special case. Consequently, a random graph G(n,w) generated by w isdirected, i.e., G[i, j] ̸= G[j, i].

2.2 Similarity of graphon slices

The intuition of the proposed SBA algorithm is that if the graphon is smooth, neighboring cross-sections of the graphon should be similar. In other words, if two labels ui and uj are close i.e.,|ui− uj| ≈ 0, then the difference between the row slices |w(ui, ·)−w(uj , ·)| and the column slices|w(·, ui) − w(·, uj)| should also be small. To measure the similarity between two labels using thegraphon slices, we define the following distance

dij =1

2

"# 1

0[w(x, ui)− w(x, uj)]

2 dx+

# 1

0[w(ui, y)− w(uj , y)]

2 dy

$. (2)

2

[Lovasz] SBMs can approximate graphons [Choi-Wolfe, Airoldi-Costa-Chan]

53

Page 54: ISIT Tutorial Information theory and machine learning Part II

[Abbe-Montanari ’13]Graphical channels

• G = (V,E) a k-hypergraph• Q : X k ! Y a channel (kernel)

X1

Xn

Y1

YN

...

...

G

Q

Q

A family of channels motivated by inference on graph problems

For x 2 X V, y 2 YE

,

P(y|x) =Q

e2E(G) Q(ye|x[e])

How much information do graphical channels carry?

54

limn!1

1

nI(Xn;G)Theorem. exists for ER graphs and some kernels

-> why not always? what is the limit?

Page 55: ISIT Tutorial Information theory and machine learning Part II

Connection: sparse PCA and clustering

55

Page 56: ISIT Tutorial Information theory and machine learning Part II

SBM and low-rank Gaussian model

Spiked Wigner model:

If (finite SNR), with (large degrees)npn, nqn ! 1�n =n(pn � qn)2

2(pn + qn)! �

Y� =

r�

nXXt + Z

limn!1

1

nI(X;G(n, pn, qn)) lim

n!1

1

nI(X;Y�)

=�

4+

�2⇤

4�� �⇤

2+ I(�⇤)

where �⇤ solves � = �(1�MMSE(�)) Y1(�) =p�X1 + Z1

(single-letter)

Theorem. =

[Deshpande-Abbe-Montanari ’15]

Gaussian symmetric

• X = (X1, . . . , Xn) i.i.d. Bernoulli(✏) ! sparse-PCA

[Amini-Wainwright ’09, Desphande-Montanari ’14]

• X = (X1, . . . , Xn) i.i.d. Radamacher(1/2) ! SBM???

I-MMSE [Guo-Shamai-Verdú]

Yij = cXiXj + Zij

56

?=

Page 57: ISIT Tutorial Information theory and machine learning Part II

Conclusion

Community detection couples naturally with the channel view of information theory and more specifically with:

- graph-based codes- f-divergences- broadcasting problems - I-MMSE - ...

More generally, the problem of inferring global similarity classes in data sets from noisy local interactions is at the center of many problems in ML, and aninformation-theoretic view of these problems seems needed and powerful.

unorthodox versions...

|{z

}

57

Page 58: ISIT Tutorial Information theory and machine learning Part II

www.princeton.edu/~eabbeDocuments related to the tutorial:

Questions?

58