ISIT Tutorial Information theory and machine learning Part II

Emmanuel AbbePrinceton University

ISIT TutorialInformation theory and machine learning

Part II

1

Martin WainwrightUC Berkeley

Examples:- community detection in social networks

2

Inverse problems on graphs

A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents

by observing local noisy interactions of these agents

Examples:- community detection in social networks- image segmentation

3




Examples:- community detection in social networks

- data classification and information retrieval- image segmentation

4




Examples:- community detection in social networks- image segmentation

- object matching, synchronization- page sorting- protein-to-protein interactions- haplotype assembly- ...

- data classification and information retrieval

5




In each case: observe information on the edges of a network that has been generated from hidden attributes on the nodes,

and try to infer back these attributes

Dual to the graphical model learning problem (previous part)

6




X1

Xn

Y1

C

YN

W

W

Different: the code is a design parameter and takes typically specific non-local interactions of the bits (e.g., random, LDPC, polar codes).

What about graph-based codes?

7

(1) What are the relevant types of “codes” and “channels” behind machine learning problems?(2) What are the fundamental limits for these?

1. Community detection and clustering2. Stochastic block models : fundamental limits and capacity-achieving algorithms3. Open problems 4. Graphical channels and low-rank matrix recovery

Outline of the talk

8

Community detection and clustering

9

social networks:“friendship”

biological networks:“protein interactions”

genome HiC networks:“DNA contacts”

call graphs:“calls”

Networks provide local interactions among agents

10

social networks:“friendship”

biological networks:“protein interactions”

genome HiC networks:“DNA contacts”

call graphs:“calls”

Networks provide local interactions among agentsone often wants to infer global similarity classes

11

Tutorial motto:Can on establish a clear line-of-sight

as in communications with the Shannon capacity? Pea

k D

ata

Effi

cien

cy

SNR

WAN wireless tech.

Shannon-capacity

Infeasible Region

EV-DO

802.16

HSDPA

A long-studied and notoriously hard problem

what is a good clustering?assort. and disassort. relations

how to get a good clustering?computationally hard

The challenges of community detection

work with models

12

many heuristics...

The Stochastic Block Model

13

p = (p1, . . . , pk) <- probability vector = relative size of the communities

k = 4The DMC of clustering..?

P = diag(p)

The stochastic block model

np1 np2

np3np4

14

SBM(n, p,W )

<- symmetric matrix with entries in [0,1] = prob .of connecting among communitiesW =

0

B@W11 . . . W1k...

. . ....

Wk1 · · · Wkk

1

CA

W11W12

W13

W14

W22

W23

W24

W44 W34

W33

The (exact) recovery problem

We will see weaker recovery requirements later

Starting point: progress in science often comes from understanding special cases...

15

Definition. An algorithm

ˆXn(·) solves (exact) recovery in the SBM if

for a random graph G under the model, limn!1 P(Xn=

ˆXn(G)) = 1.

Let Xn= [X1, . . . , Xn] represent the community variables

of the nodes (drawn under p)

SBM with 2 symmetric communities: 2-SBM

16

2-SBM

p pq

p1 = p2 = 1/2

n

2

n

2

17

W11 = W22 = p W12 = q

Some historyfor 2-SBM

Recovery problem

HollandLaskeyLeinhardt

1983

BoppanaSnijdersNowicki

CondonKarpDyer

Frieze

BickelChen

2010

McSherry

2014

JerrumSorkin

CarsonImpagliazzoBui, Chaudhuri,

Leighton, Sipser

RoheChatterjeeYu

Bui, Chaudhuri,

Leighton, Sipser ’84 maxflow-mincut p = ⌦(1/n), q = o(n

�1�4/((p+q)n))

Boppana ’87 spectral meth. (p� q)/

pp+ q = ⌦(

plog(n)/n)

Dyer, Frieze ’89 min-cut via degrees p� q = ⌦(1)

Snijders, Nowicki ’97 EM algo. p� q = ⌦(1)

Jerrum, Sorkin ’98 Metropolis aglo. p� q = ⌦(n

�1/6+✏)

Condon, Karp ’99 augmentation algo. p� q = ⌦(n

�1/2+✏)

Carson, Impagliazzo ’01 hill-climbing algo. p� q = ⌦(n

�1/2log

4(n))

Mcsherry ’01 spectral meth. (p� q)/

pp � ⌦(

plog(n)/n)

Bickel, Chen ’09 N-G modularity (p� q)/

pp+ q = ⌦(log(n)/

pn)

Rohe, Chatterjee, Yu ’11 spectral meth. p� q = ⌦(1)

Instead of ‘how’,when can we recover

the clusters (IT)?

P(X̂n = Xn) ! 1

18

algorithms driven...

Information-theoretic view of clustering (über alles)

reliable comm. iff R < 1-H(ɛ)

X1

Xn

Y1

C

YN

W

W

W =

✓1� ✏ ✏✏ 1� ✏

◆

R =n

NX1

Xn

Y1

YN

...

...

W

W

R =2

n

W =

✓1� p p1� q q

◆

reliable comm.

exact recovery

,

unorthodoxcode!

2-SBM

N =

✓n

2

◆

iff ???

19

Recovery problem


1983


CondonKarpDyer

Frieze

BickelChen

2010

McSherry

2014

JerrumSorkin

CarsonImpagliazzoBui, Chaudhuri,

Leighton, Sipser

RoheChatterjeeYu

Infeasible

a

befficiently achievable

Bui, Chaudhuri,


�1�4/((p+q)n))


pp+ q = ⌦(

plog(n)/n)




�1/6+✏)


�1/2+✏)


�1/2log

4(n))


pp � ⌦(

plog(n)/n)


pp+ q = ⌦(log(n)/

pn)



p =

a log(n)

n, q =

b log(n)

nRecovery i↵

a+b2 � 1 +

pab

Abbe-Bandeira-Hall

20

P(X̂n = Xn) ! 1

Recovery problem


1983


CondonKarpDyer

Frieze

BickelChen

2010

McSherry

2014

Detection problem

Coja-Oghlan

DecelleKrzakalaMoore Zdeborova

Massoulié

MosselNeemanSlyJerrum

Sorkin

CarsonImpagliazzo

p =a

n, q =

b

n(a� b)2 > 2(a+ b)

Detection i↵

Mossel-Neeman-Sly

Bui, Chaudhuri,Leighton, Sipser

RoheChatterjeeYu

efficiently achievable

Bui, Chaudhuri,


�1�4/((p+q)n))


pp+ q = ⌦(

plog(n)/n)




�1/6+✏)


�1/2+✏)


�1/2log

4(n))


pp � ⌦(

plog(n)/n)


pp+ q = ⌦(log(n)/

pn)


9✏ > 0 : P(d(X̂n, Xn)

n<

1

2� ✏) ! 1


p =

a log(n)

n, q =

b log(n)

nRecovery i↵

a+b2 � 1 +

pab

Abbe-Bandeira-Hall

What about multiple/asymm. communities?Conjecture: detection changes with 5 or more

21

P(X̂n = Xn) ! 1

p =

a log n

n, q =

b log n

n

n/2n/2

pBj

pBi

qRi Rji j

Recovery in the 2-SBM: IT limit

If a+b2 �

pab < 1 then ML fails w.h.p.Converse:

Abbe-Bandeira-Hall ’14 22

ML fails if two nodes can be swapped to reduce the cut

what is ML? ! min-bisection

P (B1 R1) = n�((a+b)/2�pab)+o(1)

P (9i : Bi Ri) = ? ⇣ nP (B1 R1) (weak correlations)

n/2n/2

pp q

P(9X̂n : d(X̂n, Xn)/n /2 [1/2� ✏, 1/2 + ✏]) ! 1�1 +1

Recovery in the 2-SBM: efficient algorithms

max x

TAx

s.t. xi = ±1

1

tx = 0

Abbe-Bandeira-Hall ’14

X = xx

tlifting:

max tr(AX)

s.t. Xii = 1

1

tX = 0

X ⌫ 0

rank(X) = 1

ML-decoding:spectral:max x

TAx

s.t. kxk = 1

1

tx = 0

NP-hard

SDP:

23

Note that SDP can be expensive...

-> Analyze the spectral norm of a random matrix[Abbe-Bandeira-Hall ’14] Bernstein: slightly loose

[Xu-Hanjek-Wu ’15] Seginer bound[Bandeira, Bandeira-Van Handel ’15] tight bound

Recovery in the 2-SBM: efficient algorithms

Abbe-Bandeira-Hall ’14 24

Theorem. The SDP solves recovery if 2LSBM + 11

t+ In ⌫ 0

where LSBM = DG+ �DG� �A.

The general SBM

25

Quiz: If a node is in community i, how many neighbors does ithave in expectation in community j ?

1. npj

4. 7i

“degree profile matrix”

26

np1 np2

np3np4

W11W12

W13

W14

W22

W23

W24

W44 W34

W33

SBM(n, p,W )

2. npjWij

3. npiWij

0

@ nPW

1

A

Back to the Information-theoretic view of clustering

X1

Xn

Y1

C

YN

W

W

R =n

NX1

Xn

Y1

YN

SBM

...

...

W

W

R =2

n

reliable comm. iff R < 1-H(ɛ)

reliable comm. iff 1 < J(p,W) ???reliable comm. iff R < max I(p,W)p | {z }

KL-divergence

reliable comm. iff 1 < (a+b)/2-√ab__

✓ ◆W

✓ ◆W

27

Main results

(über alles)

where

| {z }Dt(µ, ⌫)

We call D+ the CH-divergence.

D

1

2(pa�

pb)2 � 1

Abbe-Bandeira-Hall ’14

J(p,Q) := mini<j

D+((PQ)i, (PQ)j) � 1

• D1/2(µ, ⌫) =12k

pµ�

p⌫k22 is the Hellinger divergence (distance)

• � logmaxtP

i µti⌫

ti is the Cherno↵ divergence

• Dt is an f -divergence:P

i ⌫if(µi/⌫i)

[Abbe-Sandon ’15]

f(x) = 1� t+ tx� x

t

D+(µ, ⌫) := max

t2[0,1]

X

`2[k]

�tµ` + (1� t)⌫` � µt

`⌫1�t`

�

28

Theorem 1. Recovery is solvable in SBM(n, p,Q log(n)/n) if and only if

Main results

(über alles)(über alles)

Is recovery in the general SBM solvable efficiently down the information theoretic threshold?

where

Theorem 2. The degree-profiling algorithm achieves the threshold

and runs in quasi-linear time.

J(p,Q) := mini<j

D+((PQ)i, (PQ)j) � 1

YES!

[Abbe-Sandon ’15]

D+(µ, ⌫) := max

t2[0,1]

X

`2[k]

�tµ` + (1� t)⌫` � µt

`⌫1�t`

�

29

Theorem 1. Recovery is solvable in SBM(n, p,Q log(n)/n) if and only if

(über alles)

[Abbe-Sandon ’15]

When can we extract a specific community?

30

(PQ)i

(PQ)j

1

Theorem. If community i has a profile (PQ)i at D+-distance at least 1

from all other profiles (PQ)j , j 6= i, then it can be extracted w.h.p.

What if we do not know the parameters?We can learn them on the fly: [Abbe-Sandon ’15] (second paper)

Proof techniques and algorithms

31

A key step

(über alles)(über alles)

Hypothesis 1

Hypothesis 2

dvdv

.⇠ P(log(n)(PQ)1)

dv.⇠ P(log(n)(PQ)2)

1 2

3

1 2

3

Theorem. For any ✓1, ✓2 2 (R+ \ {0})k with ✓1 6= ✓2 and p1, p2 2 R+ \ {0},X

x2Zk+

min(Pln(n)✓1(x)p1,Pln(n)✓2(x)p2) = ⇥⇣n

�D+(✓1,✓2)�o(1)⌘,

where D+ is the CH-divergence.

[Abbe-Sandon ’15] 32

Plan:put effort in recovering most of the nodes and

then finish greedily with local improvements

How to use this?


(1) Split into two graphsG”

GG’ loglog-degree

log-degree

G’(2) Run on Sphere-comparison

-> gets a fraction 1-o(1) (see next)

(3) Take now with the clustering of G” G’

Pe

= n�D+((PQ)1,(PQ)2)+o(1)

The degree-profiling algorithm

Hypothesis 1

Hypothesis 2

dv...⇠ P(log(n)(PQ)1)

dv...⇠ P(log(n)(PQ)2)

dv

(capacity-achieving)


How do we get most nodes correctly?

35

Other recovery requirements

For all the above: what are the “efficient” VS. “information-theoretic” fundamental limits?

36

Exact recovery: c = 1

Almost exact recovery: c = 1� o(1)

Weak recovery or detection : for some (for the symmetric k-SBM).

c = 1/k + ✏ ✏ > 0

Partial recovery: An algorithm solves partial recovery in the SBM with accuracy if it produces a clustering which is correct on a fraction of the nodes with high probability.

cc

Proposed notion of SNR:

|�min

|2�max

e.v. of PQ

Theorem (informal). In the sparse SBM(n, p,Q/n),the Sphere-comparison algorithm recovers a fraction

of nodes which approaches 1 when the SNR diverges.

What is a good notion of SNR?

Note that the SNR scales if Q scales!

=

(a�b)2

2(a+b) for 2-symm. comm.

=

(a�b)2

k(a+(k�1)b) for k-symm. comm.

[Abbe-Sandon ’15]

Bin(n/2, a/n)

Bin(n/2, b/n)

Partial recovery in SBM(n, p,Q/n)

37

A node neighborhood inSBM(n, p,Q/n)

np1 npk

�v = i

p1Qi1 pkQik

. . .(PQ)i

|{z

}

depth r

...

Nr(v)((PQ)r)i

Sphere-comparison


�v = i

...

...

�v0 = j

Nr(v)

Nr0(v0)

Compare v and v’ from:

|Nr(v) \Nr0(v0)|

Sphere-comparisonTake now two nodes:


hard to analyze...

�v = i

...

...

�v0 = j

E

Nr,r0[E](v · v0)= number of crossing edges

Compare v and v’ from:

⇡ Nr[G\E](v) ·cQ

nNr0[G\E](v

0)

Subsample G with prob. c to get E

Nr[G\E](v)

Nr0[G\E](v0) ⇡ ((1� c)PQ)re�v · cQ

n((1� c)PQ)r

0e�v0

= c(1� c)r+r0e�v ·Q(PQ)r+r0e�v0 /n

Sphere-comparison

Decorrelate:

Additional steps:1. look at several depths -> Vandermonde syst.2. use “anchor nodes”


A real data example

41

our algorithm: ~60/1222

The political blogs network

42

[Adamic and Glance ’05]

1222 blogs(left- and right-leaning)

edge = hyperlink between blogs

We can recover 95% of the nodes correctlyThe CH-divergence is close to 1

Some open problems in community detection

43


I. The SBM:a. Recovery

[Abbe-Sandon ’15] should extend to k = o(log(n))

[Chen-Xu ’14] k,p,q scale with n polynomially

Growing nb. of communities? Sub-linear communities?

44


I. The SBM:

b. Detection and broadcasting on treesa. Recovery

[Mossel-Neeman-Sly ’13] Converse for detection in 2-SBM:

p=a/n, q=b/n

Galton-Watson tree Poisson((a+b)/2)

1 1 0

Xr

BSC(b/(a+ b))

If (a+b)/2<=1 the tree dies w.p. 1If (a+b)/2>1 the tree survives w.p. >0

Unorthodox broadcasting problem: when can we detect the root-bit?

c >1

(1� 2")2If and only if

[Evans-Kenyon-Peres-Schulman ’00]

c =a+ b

2

() (a� b)2

2(a+ b)> 1

45


I. The SBM:

b. Detection and broadcasting on treesa. Recovery

For k clusters? SNR =(a� b)2

k(a� (k � 1)b)

46

[Decelle-Krzakala-Zdeborova-Moore ’11]

impossible possible efficient

0 SNR1ck

Conjecture. For the symmetric k-SBM(n, a, b), there exists ck s.t.

(1) If SNR < ck, then detection cannot be solved,

(2) If ck < SNR < 1, then detection can be solved

information-theoretically but not e�ciently,

(3) If SNR > 1, then detection can be solved e�ciently.

Moreover ck = 1 for k 2 {2, 3, 4} and ck < 1 for k � 5.


I. The SBM:

b. Detection and broadcasting on treesc. Partial recovery and the SNR-distortion curve

a. Recovery

1/2

1SNR

� = 1� ↵ k = 2

Conjecture. For the symmetric k-SBM(n, a, b) and ↵ 2 (1/k, 1),there exists �k, �k s.t. partial-recovery of accuracy ↵ is solvable

if and only if SNR > �k, and e�ciently solvable i↵ SNR > �k.

k � 5

47

- Censored block models- Labelled block models

n/2n/2

0 1

pp p

[Abbe-Bandeira-Bracher-Singer ’14]

“correlation clustering” [Bansal, Blum, Chawla ’04] “LDGM codes” [Kumar, Pakzad, Salavati, Shokrollahi ’12]

“labelled block model” [Heimlicher, Lelarge, Massoulié ’12]“soft CSPs” [Abbe, Montanari ’13]

“pairwise measurements” [Chen, Suh, Goldsmith ’14 and ’15] “bounded-size correlation clustering” [Puleo, Milenkovic ’14]


2-CBM [Abbe-Bandeira-Bracher-Singer, Xu-Lelarge-Massoulie,Saad-Krzakala-Lelarge-Zdeborova,Chin-Rao-Vu, Hajek-Wu-Xu]

2-CBM(n, p, ✏)

II. Other block models

48

- Mixed-membership block models [Airoldi-Blei-Fienber-Xing ’08]- Overlapping block models [Abbe-Sandon ’15]

OSBM(n, p, f)u v

XuXu Xv⇠ p p on {0, 1}s⇠ p

f : {0, 1}s ⇥ {0, 1}s ! [0, 1]connect (u, v) with prob. f(Xu, Xv)


- Censored block models- Labelled block models


49

- Degree-corrected block models [Karrer-Newman ’11]

- Planted community [Deshpande-Montanari ’14 , Montanari ’15]

, p p


- Mixed-membership block models - Overlapping block models

- Censored block models- Labelled block models- Degree-corrected block models


50

- Hypergraph models


- Planted community




51

- Hypergraph models


- Planted community




For all the above: Is there a CH-divergence behind? A generalized notion of SNR? Detection gaps?

Efficient algorithms?

52


III. Beyond block models:

b. Low-rank matrix recovery

b. Graphical channels

A general information-theoretic model?

a. Exchangeable random arrays and graphons w : [0, 1]2 ! [0, 1]

P (Eij = 1|Xi = xi, Xj = xj) = w(xi, xj)

ui

uj

w

×

(ui, uj)

w(ui, uj)

G1

G2T

Figure 1: [Left] Given a graphon w : [0, 1]2 → [0, 1], we draw i.i.d. samples ui, uj fromUniform[0,1] and assign Gt[i, j] = 1 with probability w(ui, uj), for t = 1, . . . , 2T . [Middle]Heat map of a graphon w. [Right] A random graph generated by the graphon shown in the middle.Rows and columns of the graph are ordered by increasing ui, instead of i for better visualization.

Wolfe [9] studied the consistency properties, but did not provide algorithms to estimate the graphon.To the best of our knowledge, the only method that estimates graphons consistently, besides ours, isUSVT [8]. However, our algorithm has better complexity and outperformsUSVT in our simulations.More recently, other groups have begun exploring approaches related to ours [28, 24].

The proposed approximation procedure requires w to be piecewise Lipschitz. The basic idea is toapproximate w by a two-dimensional step function !w with diminishing intervals as n increases.Theproposed method is called the stochastic blockmodel approximation (SBA) algorithm, as the idea ofusing a two-dimensional step function for approximation is equivalent to using the stochastic blockmodels [10, 22, 13, 7, 25]. The SBA algorithm is defined up to permutations of the nodes, so theestimated graphon is not canonical. However, this does not affect the consistency properties of theSBA algorithm, as the consistency is measured w.r.t. the graphon that generates the graphs.

2 Stochastic blockmodel approximation: Procedure

In this section we present the proposed SBA algorithm and discuss its basic properties.

2.1 Assumptions on graphons

We assume that w is piecewise Lipschitz, i.e., there exists a sequence of non-overlaping intervalsIk = [αk−1, αk] defined by 0 = α0 < . . . < αK = 1, and a constant L > 0 such that, for any(x1, y1) and (x2, y2) ∈ Iij = Ii × Ij ,

|w(x1, y1)− w(x2, y2)| ≤ L (|x1 − x2|+ |y1 − y2|) .

For generality we assume w to be asymmetric i.e., w(u, v) ̸= w(v, u), so that symmetric graphonscan be considered as a special case. Consequently, a random graph G(n,w) generated by w isdirected, i.e., G[i, j] ̸= G[j, i].

2.2 Similarity of graphon slices

The intuition of the proposed SBA algorithm is that if the graphon is smooth, neighboring cross-sections of the graphon should be similar. In other words, if two labels ui and uj are close i.e.,|ui− uj| ≈ 0, then the difference between the row slices |w(ui, ·)−w(uj , ·)| and the column slices|w(·, ui) − w(·, uj)| should also be small. To measure the similarity between two labels using thegraphon slices, we define the following distance

dij =1

2

"# 1

0[w(x, ui)− w(x, uj)]

2 dx+

# 1

0[w(ui, y)− w(uj , y)]

2 dy

$. (2)

2

[Lovasz] SBMs can approximate graphons [Choi-Wolfe, Airoldi-Costa-Chan]

53

[Abbe-Montanari ’13]Graphical channels

• G = (V,E) a k-hypergraph• Q : X k ! Y a channel (kernel)

X1

Xn

Y1

YN

...

...

G

Q

Q

A family of channels motivated by inference on graph problems

For x 2 X V, y 2 YE

,

P(y|x) =Q

e2E(G) Q(ye|x[e])

How much information do graphical channels carry?

54

limn!1

1

nI(Xn;G)Theorem. exists for ER graphs and some kernels

-> why not always? what is the limit?

Connection: sparse PCA and clustering

55

SBM and low-rank Gaussian model

Spiked Wigner model:

If (finite SNR), with (large degrees)npn, nqn ! 1�n =n(pn � qn)2

2(pn + qn)! �

Y� =

r�

nXXt + Z

limn!1

1

nI(X;G(n, pn, qn)) lim

n!1

1

nI(X;Y�)

=�

4+

�2⇤

4�� ⇤

2+ I(�⇤)

where �⇤ solves � = �(1�MMSE(�)) Y1(�) =p�X1 + Z1

(single-letter)

Theorem. =

[Deshpande-Abbe-Montanari ’15]

Gaussian symmetric

• X = (X1, . . . , Xn) i.i.d. Bernoulli(✏) ! sparse-PCA

[Amini-Wainwright ’09, Desphande-Montanari ’14]

• X = (X1, . . . , Xn) i.i.d. Radamacher(1/2) ! SBM???

I-MMSE [Guo-Shamai-Verdú]

Yij = cXiXj + Zij

56

?=

Conclusion

Community detection couples naturally with the channel view of information theory and more specifically with:

- graph-based codes- f-divergences- broadcasting problems - I-MMSE - ...

More generally, the problem of inferring global similarity classes in data sets from noisy local interactions is at the center of many problems in ML, and aninformation-theoretic view of these problems seems needed and powerful.

unorthodox versions...

|{z

}

57

www.princeton.edu/~eabbeDocuments related to the tutorial:

Questions?

58

http://www.princeton.edu/~eabbe

http://www.princeton.edu/~eabbe