64
Graph Neural Networks Alejandro Ribeiro Electrical and Systems Engineering, University of Pennsylvania [email protected] Thanks: Joan Bruna, Luiz Chamon, Fernando Gama, Daniel Lee, Antonio Garcia Marques, Mark Eisen, Elvin Isufi, Vijay Kumar, Geert Leus, George Pappas, Jimmy Paulos, Luana Ruiz, Kate Tolstaya, Clark Zhang Support: ARO W911NF1710438, Intel ISTC-WAS, NSF CCF 1717120, ARL DCIST CRA W911NF-17-2-0181 June 7, 2019 A. Ribeiro Graph Neural Networks 1/49

Graph Neural Networks - GSP workshopGraph Neural Networks Alejandro Ribeiro Electrical and Systems Engineering, University of Pennsylvania ... Machine Learning on Graphs ... Compose

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Graph Neural Networks

Alejandro RibeiroElectrical and Systems Engineering, University of Pennsylvania

[email protected]

Thanks: Joan Bruna, Luiz Chamon, Fernando Gama, Daniel Lee, Antonio Garcia Marques,Mark Eisen, Elvin Isufi, Vijay Kumar, Geert Leus, George Pappas, Jimmy Paulos,Luana Ruiz, Kate Tolstaya, Clark Zhang

Support: ARO W911NF1710438, Intel ISTC-WAS,NSF CCF 1717120, ARL DCIST CRA W911NF-17-2-0181

June 7, 2019

A. Ribeiro Graph Neural Networks 1/49

Machine Learning for Graph Signals

Machine Learning for Graph Signals

Authorship Attribution

Learning Decentralized Controllers in Distributed Systems

Learning Optimal Resource Allocations in Wireless Communications Networks

Invariance and Stability Properties of Graph Neural Networks

Concluding Remarks

A. Ribeiro Graph Neural Networks 2/49

Machine Learning for Everything

I The promise of Machine Learning is boundless. Should we use it to solve everything?

I Machine Learning (ML) is a set of tools for solving optimization problems with data

⇒ Other names for ML are regression, pattern recognition, or statistical signal processing

I Thus, the use of ML is justified when models are unavailable or inaccurate

I Or when they are too complex so we are better off using them as generators of simulated data

A. Ribeiro Graph Neural Networks 3/49

Machine Learning for Everything

I The promise of Machine Learning is boundless. Should we use it to solve everything?

A. Ribeiro Graph Neural Networks 4/49

Machine Learning for Everything

I The promise of Machine Learning is boundless. Can we use it to solve everything?

A. Ribeiro Graph Neural Networks 5/49

Machine Learning for Everything

I The promise of Machine Learning is boundless. Can we use it to solve everything?

I In 2019 Machine Learning ≡ Deep Learning ≡ Convolutional Neural Networks (CNNs)

⇒ If they rely on convolutions, we expect CNNs to work for Euclidean signals only (time, images)

⇒ Recent remarkable successes of Neural Networks are for image and speech processing, indeed

I Fully connected neural networks do not scale. They work for small scale problems only

I In fact, no ML method works in high dimensions if we can’t exploit signal structure

A. Ribeiro Graph Neural Networks 6/49

Machine Learning on Graphs

I Graphs are generic models of structure that can help to learn in several practical problems

I There is overwhelming empirical and theoretical justification to choose a neural network (NN)

I Challenge is we want to run a NN over this

I But one of the central purposes of graph signal processing is to generalize convolutions to graphs

⇒ Compose graph filters with pointwise nonlinearities ⇒ (convolutional) graph neural network (GNN)

A. Ribeiro Graph Neural Networks 7/49

Machine Learning on Graphs

I Graphs are generic models of structure that can help to learn in several practical problems

I There is overwhelming empirical and theoretical justification to choose a neural network (NN)

I Challenge is we want to run a NN over this I And we are good at running NNs over this

I But one of the central purposes of graph signal processing is to generalize convolutions to graphs

⇒ Compose graph filters with pointwise nonlinearities ⇒ (convolutional) graph neural network (GNN)

A. Ribeiro Graph Neural Networks 7/49

Machine Learning on Graphs

I Graphs are generic models of structure that can help to learn in several practical problems

I There is overwhelming empirical and theoretical justification to choose a neural network (NN)

I Challenge is we want to run a NN over this I And we are good at running NNs over this

I But one of the central purposes of graph signal processing is to generalize convolutions to graphs

⇒ Compose graph filters with pointwise nonlinearities ⇒ (convolutional) graph neural network (GNN)

A. Ribeiro Graph Neural Networks 7/49

Authorship Attribution

Machine Learning for Graph Signals

Authorship Attribution

Learning Decentralized Controllers in Distributed Systems

Learning Optimal Resource Allocations in Wireless Communications Networks

Invariance and Stability Properties of Graph Neural Networks

Concluding Remarks

A. Ribeiro Graph Neural Networks 8/49

Authorship Attribution with Word Adjacency Networks

I Function words are those that don’t carry meaning

I Their use depends on the language’s grammar

I Different authors use slightly different grammar

I Capture with a word adjacency network (WAN)

⇒ How often pairs of works appear together

Segarra-Eisen-Ribeiro, Authorship Attribution through Function

Word Adjacency Networks, TSP 2015

abitcoupleaboardaboutaboveabsentaccordingaccordinglyacrossafteragainstaheadalbeitallalongalongside

althoughamidamidst

amongamongst

anandanother

anyan

ybod

y

anyo

ne

anyt

hing

arou

nd

asasid

e

astra

ddle

astri

de

ataway

bar

barr

ing

beca

use

befo

re

behi

nd

belo

wbe

neat

h

besi

debe

side

sbe

twee

nbe

yond

both

but

bycan

cert

ain

circ

acl

ose

conc

erni

ngco

nseq

uent

lyco

nsid

erin

gco

uld

dare

desp

itedo

wn

due

durin

gea

chei

ther

enou

ghev

ery

ever

ybod

yev

eryo

neev

eryt

hing

exce

ptex

clud

ing

failin

gfe

wfe

wer

follo

wing

for

from

given

heap

s

henc

e

howev

er

ifinspite

view

including

inside

instead

intoititsitselflesslikelittleloadslotsmanymaymightminusmoremostmuchmustnearneedneither nevertheless next no

nobody none nor nothing notwithstandingof off on once one onto opposite

or other oughtour out outside

over part past

pending

per pertaining

plenty

plusregarding

respecting

round

savesaving

several

shallshould

similar

sinceso som

esom

ebody

something

suchthanthatthethemthem

selvesthenthencethereforethesetheythisthothosethoughthroughthroughoutthrutillto tow

ardtow

ardsunder

underneath

unlessunlike

untilunto

upupon

usused

various

versus

viawanting

what

whatever

when

whenever

where

whereas

wherever

whether

which

whichever

whilewhilstwhowhoever

whomwhomever

whosewillwithwithinwithoutwouldyet

A. Ribeiro Graph Neural Networks 9/49

Authorship Attribution with Word Adjacency Networks

I Function words are those that don’t carry meaning

I Their use depends on the language’s grammar

I Different authors use slightly different grammar

I Capture with a word adjacency network (WAN)

⇒ How often pairs of works appear together

Segarra-Eisen-Ribeiro, Authorship Attribution through Function

Word Adjacency Networks, TSP 2015

a

aboard

about

all

and

anotherany

arou

ndas

at

awaybu

tbydesp

ite

eith

er

for

from

if

in

it

like

little

may

more

much

neither

no

nor

nothing

of

on

one

or

our

out

round shall so such

than

that

the

then

they

this

those

to

up

upon

us

what

will

with

yet

A. Ribeiro Graph Neural Networks 9/49

Authorship Attribution with Word Adjacency Networks

I Function words are those that don’t carry meaning

I Their use depends on the language’s grammar

I Different authors use slightly different grammar

I Capture with a word adjacency network (WAN)

⇒ How often pairs of works appear together

Segarra-Eisen-Ribeiro, Authorship Attribution through Function

Word Adjacency Networks, TSP 2015

a

aboard

aboutagainst

all

alongan

andanotherany

aroundas

asid

eat

awaybo

thbutbyca

ndesp

ite

dow

n

each

eith

er

enou

gh

for

from

henc

eif

in

into

itlik

elitt

le

man

y

may

might

more

most

much

must

neither

next

no

none

nor

nothing

of

on

onceone

orother our

outround shall should

so

some

such than that

the them

then

thence

therefore

these

they

this

thosethrough

to

untilunto

upupon

us

what

when

where

whether

which

while

will

with

would

yet

A. Ribeiro Graph Neural Networks 9/49

Henry the VI is a Collaborative Effort by Shakespeare and Marlowe

I Shakespeare’s and Marlowe’s WANs are sufficiently different to ascertain their collaboration in Henry VI

aaboutafteragainstallalong

amongamongstanand

anotheranyasasid

eatawayba

rbeca

use

befo

re

beyo

nd

bothbu

tbycanclos

e

cons

eque

ntly

dare

desp

ite

dow

n

due

each

eith

eren

ough

ever

yex

cept

few

for

from

heap

she

nce

how

ever

ifinspite

insi

dein

toit

its

like

little

lots

many

may

might

more

most

much

must

near

neither

next

nononenornothingofononceoneopposite or

other ourout

outside part past plenty regardinground saving

shall

should

since

so some

such

than

that

the them

themselves

thenthence

therefore

thesetheythisthosethroughthroughouttill

totow

ardtow

ardsunder

underneathunless

untilunto

upupon

usused

what

when

where

whether

which

while

who

whom

whose

willwithwithinwouldyet

aaboutafteragainstallalong

amongamongstanand

anotheranyasasid

eatawayba

rbeca

use

befo

re

beyo

nd

bothbu

tbycanclos

e

cons

eque

ntly

dare

desp

ite

dow

n

due

each

eith

eren

ough

ever

yex

cept

few

for

from

heap

she

nce

how

ever

ifinspite

insi

dein

toit

its

like

little

lots

many

may

might

more

most

much

must

near

neither

next

nononenornothingofononceoneopposite or

other ourout

outside part past plenty regardinground saving

shall

should

since

so some

such

than

that

the them

themselves

thenthence

therefore

thesetheythisthosethroughthroughouttill

totow

ardtow

ardsunder

underneathunless

untilunto

upupon

usused

what

when

where

whether

which

while

who

whom

whose

willwithwithinwouldyet

Segarra-Eisen-Egan-Ribeiro, Attributing the Authorship of the Henry VI Plays by Word Adjacency, Shakespeare Quarterly 2016

A. Ribeiro Graph Neural Networks 10/49

Henry the VI is a Collaborative Effort by Shakespeare and Marlowe

I Shakespeare’s and Marlowe’s WANs are sufficiently different to ascertain their collaboration in Henry VI

a

about

all

along

and

as

bar

but

by

clos

edesp

ite

dow

n

exce

pt

for

from

heap

s

henc

e

in

it

like

man

y

may

more

much

neither

no

none

nor

of

on

or

our

out

roundshall

so

such than that the themselves

then

thence

this

throughoutto

towards

underneathuntil

unto

upon

us

while

will

with

a

about

all

along

and

as

bar

but

by

clos

edesp

ite

dow

n

exce

pt

for

from

heap

s

henc

e

in

it

like

man

y

may

more

much

neither

no

none

nor

of

on

or

our

out

roundshall

so

such than that the themselves

then

thence

this

throughoutto

towards

underneathuntil

unto

upon

us

while

will

with

Segarra-Eisen-Egan-Ribeiro, Attributing the Authorship of the Henry VI Plays by Word Adjacency, Shakespeare Quarterly 2016

A. Ribeiro Graph Neural Networks 10/49

Authorship Attribution with a GNN

I When texts are long we can attribute by comparing word networks of different texts

I When texts are short comparing networks is unreliable

⇒ Compare histograms of different texts defined as graph signals over WANs

I Pickup pages (1K words) written by E. Bronte or J. Austen from a pool of 22 contemporaries

Emily Bronte Jane Austen

I Different GNN architectures all achieve good error rates ⇒ ∼ 12% (Bronte) and ∼ 4% (Austen)

Ruiz-Gama-Marques-Ribeiro, Invariance-Preserving Localized Activation Functions for Graph Neural Networks, arxiv:1903.12575, 2019

A. Ribeiro Graph Neural Networks 11/49

Learning Decentralized Controllers in Distributed Systems

Machine Learning for Graph Signals

Authorship Attribution

Learning Decentralized Controllers in Distributed Systems

Learning Optimal Resource Allocations in Wireless Communications Networks

Invariance and Stability Properties of Graph Neural Networks

Concluding Remarks

A. Ribeiro Graph Neural Networks 12/49

Decentralized Coordination of a Robot Swarm

I We want the team to coordinate on their individual velocities without colliding with each other

I This is a very easy problem to solve if we allow for centralized coordination ⇒ ui =∑N

i=1 vi

I But it is very difficult to solve if we do do not allow for centralized coordination ⇒ ui = . . .

A. Ribeiro Graph Neural Networks 13/49

Information Structure on Distributed Systems

I The challenge in designing behaviors for distributed systems is the partial information structure

xi(n)

I Node i has access to its own local information at time n ⇒ xi(n)

I

I

I

A. Ribeiro Graph Neural Networks 14/49

Information Structure on Distributed Systems

I The challenge in designing behaviors for distributed systems is the partial information structure

xi(n) xj(n−1) for j ∈ N 1i

I Node i has access to its own local information at time n ⇒ xi(n)

I And the information of its 1-hop neighbors at time n − 1 ⇒ xj(n−1) for all j ∈ N 1i

I

I

A. Ribeiro Graph Neural Networks 14/49

Information Structure on Distributed Systems

I The challenge in designing behaviors for distributed systems is the partial information structure

xi(n) xj(n−1) for j ∈ N 1i xj(n−2) for j ∈ N 2

i

I Node i has access to its own local information at time n ⇒ xi(n)

I And the information of its 1-hop neighbors at time n − 1 ⇒ xj(n−1) for all j ∈ N 1i

I And the information of its 2-hop neighbors at time n − 2 ⇒ xj(n−2) for all j ∈ N 2i

I

A. Ribeiro Graph Neural Networks 14/49

Information Structure on Distributed Systems

I The challenge in designing behaviors for distributed systems is the partial information structure

xi(n) xj(n−1) for j ∈ N 1i xj(n−2) for j ∈ N 2

i xj(n−3) for j ∈ N 3i

I Node i has access to its own local information at time n ⇒ xi(n)

I And the information of its 1-hop neighbors at time n − 1 ⇒ xj(n−1) for all j ∈ N 1i

I And the information of its 2-hop neighbors at time n − 2 ⇒ xj(n−2) for all j ∈ N 2i

I And the information of its 3-hop neighbors at time n − 3 ⇒ xj(n−3) for all j ∈ N 3i

A. Ribeiro Graph Neural Networks 14/49

Learning to Imitate the Centralized Solution

I Control actions can only depend on information history ⇒ Hin =K−1⋃k=0

{xj(n−k) : j ∈ N k

i

}I Optimal controller is famously difficult to find. Even for very simple linear systems

⇒ Witsenhausen, H. “A counterexample in stochastic optimum control” (February 1968)

I When optimal solutions are out of reach we resort to heuristics ⇒ data driven heuristics

I The centralized optimal control policy π∗(xn) can be computed during training time

I Introduce parametrization and learn decentralized policy that imitates centralized policy

H∗ = argminH

Eπ∗[L(π(Hin,H

),π∗(xn)

)]I Need parametrization H adapted to the information structure Hin ⇒ Aggregation GNN

A. Ribeiro Graph Neural Networks 15/49

The Aggregation Sequence (Diffusion Sequence)

y0n = xn y1n = Sy0(n−1) y2n = Sy1(n−2) y3n = Sy2(n−3)

I Aggregate information at nodes through successive averaging with graph adjacency S

ykn = Sy(k−1)(n−1) ⇒[

ykn

]i

=[

Syk−1(n−1)

]i

=∑

j=1,j∈Nin

[S]ij

[yk−1n−1

]j

I Computed with local operations that respect information structure of distributed system

A. Ribeiro Graph Neural Networks 16/49

Aggregation Graph Neural Networks

y0n = xn y1n = Sy0(n−1) y2n = Sy1(n−2) y3n = Sy2(n−3)

CNNzin uin

I Aggregation sequence zin =[[

y0n

]i;[y1n

]i; . . . ;

[y(K−1)n

]i

]has a temporal structure

I Having a temporal structure it can be processed with a regular CNN

Gama-Marques-Leus-Ribeiro, Convolutional Neural Network Architectures for Signals Supported on Graphs, TSP, 2019

A. Ribeiro Graph Neural Networks 17/49

Permutation Covariance

I The Aggregation Graph Neural Network is defined by a filter tensor H

I This tensor is shared by all the nodes in the graph = agents in the system

⇒ If two agents observe the same input

⇒ Their K -hop neighbors observe the same inputs

⇒ And the local structures of the graph are the same

⇒ The output of the control policy it the same. As it should. Or not.

I Aggregation GNN is permutation covariant ⇒ Permute graph and input = permute output

I Permutation covariance is not a choice. It is a necessity for offline training

A. Ribeiro Graph Neural Networks 18/49

Offline Training vs Online Execution

I If we want to train offline and execute online we can’t assume the graph is the same

I Train online on a graph like this

I

I

I

I

A. Ribeiro Graph Neural Networks 19/49

Offline Training vs Online Execution

I If we want to train offline and execute online we can’t assume the graph is the same

I Train online on a graph like this I And execute offline on a graph like this

I

I

I

I

A. Ribeiro Graph Neural Networks 19/49

Offline Training vs Online Execution

I If we want to train offline and execute online we can’t assume the graph is the same

I Train online on a graph like this I And execute offline on a graph like this

I CNN runs on aggregation sequence ⇒ It can be ran independently of graph’s structure.

I Permutation covariance says that if graphs are similar CNN outputs will be similar.

I Thereby producing similar policies and ensuring generalization across different graphs

I

A. Ribeiro Graph Neural Networks 19/49

Offline Training vs Online Execution

I If we want to train offline and execute online we can’t assume the graph is the same

I Train online on a graph like this I And execute offline on a graph like this

I CNN runs on aggregation sequence ⇒ It can be ran independently of graph’s structure.

I Permutation covariance says that if graphs are similar CNN outputs will be similar.

I Thereby producing similar policies and ensuring generalization across different graphs

I Aggregation sequence is the only linear processing that generates permutation covariance.

A. Ribeiro Graph Neural Networks 19/49

Coordinating a Robot Swarm

I The GNN learns to imitate the central policy. Outperforms existing distributed control methods

Tolstaya-Gama-Paulos-Pappas-Kumar-Ribeiro, Learning Decentralized Controllers for Robot Swarms with Graph Neural

Networks, arxiv:1903.10527, 2019

A. Ribeiro Graph Neural Networks 20/49

Learning Optimal Resource Allocations in Wireless Communications Networks

Machine Learning for Graph Signals

Authorship Attribution

Learning Decentralized Controllers in Distributed Systems

Learning Optimal Resource Allocations in Wireless Communications Networks

Invariance and Stability Properties of Graph Neural Networks

Concluding Remarks

A. Ribeiro Graph Neural Networks 21/49

Interference Channel

I Groups of communicating pairs from infrastructure base stations to mobile user terminals

A. Ribeiro Graph Neural Networks 22/49

Interference Channel

I Groups of communicating pairs from infrastructure base stations to mobile user terminals

I There is interference crosstalk from other communicating pairs that we can describe with a graph

A. Ribeiro Graph Neural Networks 23/49

Interference Channel

I Pairs communicate over a time varying fading channel. Pair i is interfered by neighboring pairs j

⇒ Channel is hi for communicating pair i . Transmitter allocates power pi (h). h = [h1; . . . ; hn]

⇒ Channel crosstalk from pair j to receiver of pair i is hji . Nonzero when j ∈ n(i)

I We want to select a power allocation that maximizes communication rates in some sense

p∗(h) = argmaxp(h)

Eh

[∑i

log

(1 +

hipi (h)

1 +∑

j∈n(i) hijpj(h)

)]I This is a problem we know well. We can’t solve it exactly but we can approximate it (WMMSE)

A. Ribeiro Graph Neural Networks 24/49

Model Uncertainty and Model Complexity

I There are two drawbacks to the use of WMMSE and other model based solutions

⇒ We can model the rate function but there is a mismatch between reality and model

ci = log

(1 +

hipi (h)

1 +∑

j∈n(i) hijpj(h)

)+ f (h, p(h))

⇒ We modularize design but in reality we can have ∼ 103 base stations with ∼ 105 active users

I To some extent, both drawbacks can be ameliorated with an ML parametrization

⇒ The function f is unknown but it is possible to probe the environment to evaluate f (x, u)

⇒ The optimization problem is too difficult (large scale). The parametrization may make it easier

I ML parametrizations are justified in large scale wireless communications with uncertain models

A. Ribeiro Graph Neural Networks 25/49

Interference Channel as a Statistical Learning Problem

I Unsupervised statistical learning data to actions that minimize and expected loss

u∗(x) = argminu(x)

Ex

[f(

x, u(x)) ]

I Given fading channel we search for a power allocation that maximizes expected sum rates

p∗(h) = argmaxp(h)

Eh

[∑i

log

(1 +

hipi (h)

1 +∑

j∈n(i) hijpj(h)

)]I Fading channel ∼ Data. Power allocation ∼ classifier. Capacity function ∼ loss.

I Parametrize with a Neural Network. Or parametrize with a Graph Neural Network. Either way

θ∗ = argmaxθ

Eh

[∑i

log

(1 +

hipi (h,θ)

1 +∑

j∈n(i) hijpj(h,θ)

)]

Eisen-Zhang-Chamon-Lee-Ribeiro, Learning Optimal Resource Allocations in Wireless Systems, TSP, 2019

A. Ribeiro Graph Neural Networks 26/49

Learning at Scale with Graph Neural Networks

Fully connected neural network ⇒ 8 pairs

0 2 4 6 8 10104

0.5

1

1.5

2

2.5

3

Graph Neural Network ⇒ 50 pairs

0 0.5 1 1.5 2 2.5 3 3.5 4104

0

0.5

1

1.5

2

2.5

3

3.5

I A GNN parametrized resource allocation scales to large networks ⇒ The only solution that scales

Eisen-Ribeiro, Large Scale Wireless Power Allocation with Graph Neural Networks, SPAWC 2019

A. Ribeiro Graph Neural Networks 27/49

Transferring for Scale

I GNN built for 50 pairs generalizes to larger networks

⇒ No need for retraining ⇒ exploits permutation invariance of graph convolution

⇒ Performance of GNN trained with 50 nodes (top) to one trained for larger network (bottom)

Sum-rate0 1 2 3 4 5 6 7 8 9 10

0

0.02

0.04

0.06

0.08

0.1

REGNN - 50

Sum-rate0 1 2 3 4 5 6 7 8 9 10

0

0.02

0.04

0.06

0.08

0.1

REGNN - 75

Sum-rate0 5 10 15

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

REGNN - 50

Sum-rate0 5 10 15

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

REGNN - 100

Figure: (left) 75 and (right) 100 pairs

A. Ribeiro Graph Neural Networks 28/49

Transferring for Scale

I GNN built for 50 pairs generalizes to larger networks

⇒ No need for retraining ⇒ exploits permutation invariance of graph convolution

50 100 150 200 250 300 350 400 450 5000

5

10

15

20

25

A. Ribeiro Graph Neural Networks 29/49

Invariance and Stability Properties of Graph Neural Networks

Machine Learning for Graph Signals

Authorship Attribution

Learning Decentralized Controllers in Distributed Systems

Learning Optimal Resource Allocations in Wireless Communications Networks

Invariance and Stability Properties of Graph Neural Networks

Concluding Remarks

A. Ribeiro Graph Neural Networks 30/49

Signals Supported on Graphs (Graph Signals)

I Describe graph with adjacency matrix S

⇒ Sij = Proximity between i and j

I Define a signal x on top of the graph

⇒ xi = Signal value at node i

2

3

1

4

5

x1

x2

x3

x4

x5

I Element Sij encodes notion of proximity between signal values xi and xj

I Graph Signal Processing (Neural Networks) ⇒ Exploit structure encoded in S to process x

I Why do we expect the graph structure S to be useful in processing x?

A. Ribeiro Graph Neural Networks 31/49

The importance of signal structure in time

I Signal and Information Processing is always about exploiting the structure of the signal

⇒ We don’t always make it explicit. But it is always implicit

I Discrete time described by cyclic graph

⇒ Time n follows time n − 1

⇒ Signal value xn similar to xn−1

I Formalized with the notion of frequency

1

2

3

4

5

6

x1

x2

x3

x4

x5

x6

A. Ribeiro Graph Neural Networks 32/49

Exploiting Time Structure in Machine Learning

I Given a training set of input-output example pairs T {(x, y)}

I Find the best arbitrary linear regressor for some loss function f ⇒ H∗ = argminH

∑(x,y)∈T

f(

Hx, y)

I Or, find the best convolution (linear) regressor for same loss f ⇒ h∗ = argminh

∑(x,y)∈T

f(

h ∗ x, y)

where h ∗ x means a signal with components zn =∑

k hhxn−k

I The linear regressor is better than the convolution by definition. Both are linear. One is generic

⇒ This is true in the test set. In reality, the convolution does better. It generalizes.

A. Ribeiro Graph Neural Networks 33/49

Convolutional Neural Networks (CNN)

I The convolution is covariant to time shifts ⇒ S(h ∗ x) = h ∗ S(x) .

⇒ It respects and exploits the internal symmetries of the input signal

I What should we do if we want a nonlinear processing architecture?

⇒ Compose convolution with a pointwise nonlinearity

I And if we want a more ddescriptive architecture?

I Stack layers composing pointwise nonlinearities with convolutional filters (input x0 = x, outputy = xL)

x1 = σ1

(h1 ∗ x

), . . . , x` = σ`

(h` ∗ x`−1

), . . . , xL = σL

(hL ∗ xL−1

)I This is a CNN. It is covariant to time shifts

A. Ribeiro Graph Neural Networks 34/49

Graph Convolutions

I Interpret the graph adjacency or Laplacian as a shift operator S

I Define graph convolution as a polynomial on the shift operator

y = h1S0x + h1S1x + h2S2x + . . .+ hLSLx =K−1∑k=0

hkSkx := h ∗S x

I Generalizes regular (time) convolution ⇒ It’s recovered if we use the adjacency of a cycle graph

A. Ribeiro Graph Neural Networks 35/49

Graph Neural Networks (GNNs)

I The graph convolution is covariant to permutations (relabelings)

h ∗P(S) P(x) =K−1∑k=0

hk (PT SP)kPT x = PT

( K−1∑k=0

hkSkx

)= P(h ∗S x)

I Graph convolutions exploit the internal symmetry of the graph signal

I And if we want a more descriptive architecture?

I Stack layers composing pointwise nonlinearities with graph convolutional filters (input x0 = x andoutput y = xL)

x1 = σ1

(h1 ∗S x

), . . . , x` = σ`

(h` ∗S x`−1

), . . . , xL = σL

(hL ∗S xL−1

)I This is a GNN. It is covariant to permutations (relabelings)

A. Ribeiro Graph Neural Networks 36/49

Covariance to Permutations

I GNN output depends on input signal, graph shift, and filter tensor

Φ(x; S,H)= xL = σL

(hL ∗S xL−1

). . . , x1 = σ1

(h1 ∗S x

)Theorem (Gama, Ribeiro, Bruna)

For shift operators S′ = PTSP and graph signals x′ = PTx it holds

Φ(x′; S′,H) = PTΦ(x; S,H)

A. Ribeiro Graph Neural Networks 37/49

Covariance to Permutations is More Valuable than Apparent

I Invariance to node relabelings allows GNNs to exploit internal symmetries of graph signals

I Although different, signals on (a) and (b) are permutations of one other

⇒ Permutation invariance means that the GNN can learns to classify (b) from seeing (a)

1 2

3

45

6 7 10

8 9

12 11

1 2

3

45

6 7 10

8 9

12 11

11 12

7

89

10 3 6

4 5

2 1

(a) (b) (c)

A. Ribeiro Graph Neural Networks 38/49

Covariance to Permutations is a Property of Graph Filters

I Permutation covariance = property of graph convolutions inherited to GNNs

⇒ Q1: What is good about pointwise nonlinearities?

⇒ Q2: What is wrong with linear graph convolutions?

I A2: They can be unstable to perturbations of the graph if we push their discriminative power

I A1: They can be made stable to perturbations while retaining discriminability

I Beautifully, these questions can only be answered with an analysis in the GFT domain

A. Ribeiro Graph Neural Networks 39/49

Graph Convolutions in the Frequency Domain

I Graph convolution is a polynomial on the shift operator ⇒ y =K−1∑k=0

hkSkx

I Decompose operator as S = VHΛV to write the spectral representation of the graph convolution

VHy =K−1∑k=0

hk(VHSV)k VHx ⇒ y =K−1∑k=0

hkΛk x

I where we have used the graph Fourier transform (GFT) definitions x = VHx and y = VHy

I Graph convolution is a pointwise operation in the spectral domain

⇒ Determined by the (graph) frequency response ⇒K−1∑k=0

hkλkk

A. Ribeiro Graph Neural Networks 40/49

Graph Frequency Response

I We can reinterpret the frequency response as a polynomial on continuous λ ⇒ h(λ) =K−1∑k=0

hkλk

λ0 λ1 λk λn−1

I Frequency response is the same no matter the graph ⇒ It’s instantiated on its particular spectrum

A. Ribeiro Graph Neural Networks 41/49

Graph Frequency Response

I We can reinterpret the frequency response as a polynomial on continuous λ ⇒ h(λ) =K−1∑k=0

hkλk

λ0 λ1 λk λn−1

I Frequency response is the same no matter the graph ⇒ It’s instantiated on its particular spectrum

A. Ribeiro Graph Neural Networks 41/49

Graph Frequency Response

I We can reinterpret the frequency response as a polynomial on continuous λ ⇒ h(λ) =K−1∑k=0

hkλk

λ0 λ1 λk λn−1

I Frequency response is the same no matter the graph ⇒ It’s instantiated on its particular spectrum

A. Ribeiro Graph Neural Networks 41/49

Measuring Graph Perturbations

I To measure graph perturbations introduce a relative perturbation model ⇒ S′ = EHS + SE

I Since graphs S and S′ are identical if they are permutations of each other define the set

E ={

E : PTS′P = EHS + SE for some P ∈ P}

P = set of permutation matrices

I Smallest E matrix that maps S into a permutation of S′ is the relative distance between S and S′

d(S,S′) = minE∈E‖E‖

I This distance measures relative differences between graphs modulo permutations

⇒ In particular, it is small if the graphs are close to being permutations of each other

A. Ribeiro Graph Neural Networks 42/49

Stablity of Integral Lipschitz Filters to Relative Perturbations

I Let h(λ) be the frequency response of filter h. We say h is integral Lipschitz if |λh′(λ)| ≤ C

TheoremConsider integral Lipschitz filter h and graphs S and S′. Their relative distance satisfies d(S,S′) ≤ ε/2and the matrix E that achieves minimum distance is such that ‖E/‖E‖ − I‖ ≤ ε. It holds that for allsignals x

‖h ∗S x− PTh ∗S′ x‖ ≤ Cε + O(ε2)

I Not all filters are stable to graph perturbations (more later). But integral Lipschitz filters are

I Requires validity of the structural perturbation constraint ‖E/‖E‖ − I‖ ≤ ε

Gama-Bruna-Ribeiro, Stability Properties of Graph Neural Networks, arxiv:1905.04497, 2019

A. Ribeiro Graph Neural Networks 43/49

Stability of Graph Neural Networks with Integral Lipschitz Filters

I The nonlinearity σ is said to be normalized Lipschitz if ⇒ ‖σ`(x)− σ(x′)‖ ≤ ‖x− x′‖

TheoremConsider a GNN with L layers having integral Lipschitz filter h` and normalized Lipschitz nonlinearitiesσ`. Graphs S and S′ are such that their relative distance satisfies d(S,S′) ≤ ε/2 and the matrix E thatachieves minimum distance satisfies ‖E/‖E‖ − I‖ ≤ ε. It holds that for all signals x

‖Φ(x′; S′,H)− PTΦ(x; S,H)‖ ≤ CLε + O(ε2)

I GNNs can be made stable to graph perturbations if filters are integral LIpschitz

Gama-Bruna-Ribeiro, Stability Properties of Graph Neural Networks, arxiv:1905.04497, 2019

A. Ribeiro Graph Neural Networks 44/49

Proof of Stability Theorems

I The theorems are elementary to prove for edge dilation ⇒ multiply edges by α ≈ 1

I Graph spectrum is dilated ⇒ If S = VΛVH then S′ = V(αΛ)VH

λ0 λ1 λk λn−1

I Small deformations may result in large filter variations for large λ

A. Ribeiro Graph Neural Networks 45/49

Proof of Stability Theorems

I The theorems are elementary to prove for edge dilation ⇒ multiply edges by α ≈ 1

I Graph spectrum is dilated ⇒ If S = VΛVH then S′ = V(αΛ)VH

λ0 λ1 λk λn−1

I Small deformations may result in large filter variations for large λ

A. Ribeiro Graph Neural Networks 45/49

Proof of Stability Theorems

I The theorems are elementary to prove for edge dilation ⇒ multiply edges by α ≈ 1

I Graph spectrum is dilated ⇒ If S = VΛVH then S′ = V(αΛ)VH

λ0 λ1 λk λn−1

I Small deformations may result in large filter variations for large λ

A. Ribeiro Graph Neural Networks 45/49

Discriminative Graph Filter Banks are Unstable

I Q2: What is wrong with linear graph convolutions?

I Graph convolutions can’t simultaneously be stable to deformations and discriminate features athigh frequencies

λ0 λ1 λk λn−1

I This limits their value in several machine learning problems

A. Ribeiro Graph Neural Networks 46/49

Nonlinearities Create Low Frequency Components

I Q1: What is good about pointwise nonlinearities?

I Preserves permutation covariance while generating low graph frequency components that we candiscriminate with stable filters

λ0 λ1 λk λn−1

I Showing spectrum of rectified graph signal ⇒ xrelu = max(x, 0)

⇒ Low frequency components are garbage. But stable garbage

A. Ribeiro Graph Neural Networks 47/49

Concluding Remarks

Machine Learning for Graph Signals

Authorship Attribution

Learning Decentralized Controllers in Distributed Systems

Learning Optimal Resource Allocations in Wireless Communications Networks

Invariance and Stability Properties of Graph Neural Networks

Concluding Remarks

A. Ribeiro Graph Neural Networks 48/49

Concluding Remarks

I Graph Neural Networks (GNN) generalize Convolutional Neural Networks (CNN) to graph signals

⇒ They leverage structure ⇒ Machine learning in high dimensions necessitates structure

I GNNs show particular promise in distributed collaborative intelligent systems

⇒ Data needed for their execution respects the information structure of distributed systems

I GNNs can be made discriminative and stable to deformations of the graph

⇒ A property that linear filters can’t have. Explains their better performance

⇒ Analogous to stability of CNNs versus instability of convolutional filters

A. Ribeiro Graph Neural Networks 49/49