Convolutional Neural Networks on Graphs with Fast...

Preview:

Citation preview

Convolutional Neural Networks on Graphs with Fast LocalizedSpectral Filtering

M. Defferrard1 X. Bresson2 P. Vandergheynst1

1EPFL, Lausanne, Switzerland

2Nanyang Technological University, Singapore

Itay Boneh, Asher KabakovitchTel-Aviv University, Deep Learning Seminar

2017

Spectral FilteringNon-parametric

From convolution theorem:

x ∗G g = U(UTg � UT x

)= U

(g � UT x

)Or algebrically:

x ∗G g = U

g1 0. . .

0 gn

UT x

I Not localized

I O(n) parameters to train

I O(n2) multiplications (no FFT)

Spectral FilteringNon-parametric

From convolution theorem:

x ∗G g = U(UTg � UT x

)= U

(g � UT x

)Or algebrically:

x ∗G g = U

g1 0. . .

0 gn

UT x

I Not localized

I O(n) parameters to train

I O(n2) multiplications (no FFT)

Spectral FilteringPolynomial Parametrization

g is a continuous 1-D function, parametrized by θ:

x ∗G g = U(gθ(λ)� UT x

)= U

gθ(λ1) 0. . .

0 gθ(λn)

UT x

= Ugθ(Λ)UT x = gθ(L)x

Tφ = T~λ⇒ f (T ) = φf (Λ)φ−1

Spectral FilteringPolynomial Parametrization

g is a continuous 1-D function, parametrized by θ:

x ∗G g = U(gθ(λ)� UT x

)= U

gθ(λ1) 0. . .

0 gθ(λn)

UT x

= Ugθ(Λ)UT x = gθ(L)x

Polynomial parametrization:

gθ(Λ) =K−1∑k=0

θkΛk

I K-localized: 1-hop for every L application

I K parameters to train - independent on the graph’s size

I Still O(n2) multiplications (multiplications with the basis U)

Spectral FilteringPolynomial Parametrization

Impulse response on a 2D Euclidean domain

Impulse response on a graph

Spectral FilteringRecursive Polynomial Parametrization

Chebyshev polynomials: Tk(λ) = 2λTk−1(λ)− Tk−2(λ)

T0(λ) = 1

T1(λ) = λ

Parametrization:

gθ(L) =K−1∑k=0

θkTk(L)

L = 2Lλ−1n − I (orthonormal basis in [−1, 1])

Spectral FilteringRecursive Polynomial Parametrization

Filtering:

y = gθ(L)x =K−1∑k=0

θkTk(L)x =K−1∑k=0

θk xk

Recurrence: xk = Tk(L)x = 2Lxk−1 − xk−2

x0 = x

x1 = Lx

I K-localized: 1-hop for every L application

I K parameters to train - independent on the graph’s size

I O(Kn) multiplications (actually O(K |ε|))

I No EVD of L

Spectral FilteringRecursive Polynomial Parametrization

Filtering:

y = gθ(L)x =K−1∑k=0

θkTk(L)x =K−1∑k=0

θk xk

Recurrence: xk = Tk(L)x = 2Lxk−1 − xk−2

x0 = x

x1 = Lx

I K-localized: 1-hop for every L application

I K parameters to train - independent on the graph’s size

I O(Kn) multiplications (actually O(K |ε|))

I No EVD of L

Graph CNN

Graph CNNLearning Filters

ys,j =

Fin∑i=1

gθi,j (L)xs,i

∂L∂θi ,j

=S∑

s=1

[xs,i ,0, . . . , xs,i ,K−1]T∂L∂ys,j

∂L∂xs,i

=Fout∑j=1

gθi,j (L)∂L∂ys,j

s = 1, . . . ,S - sample indexi = 1, . . . ,Fin - input feature map indexj = 1, . . . ,Fout - output feature map indexθi ,j - Fin × Fout Chebyshev coefficients vectors of order KL - Loss over a mini-batch of S samples

Graph CNNLearning Filters

ys,j =

Fin∑i=1

gθi,j (L)xs,i

∂L∂θi ,j

=S∑

s=1

[xs,i ,0, . . . , xs,i ,K−1]T∂L∂ys,j

∂L∂xs,i

=Fout∑j=1

gθi,j (L)∂L∂ys,j

O (K |ε|FinFoutS)

Easily paralleled

Graph PoolingCoarsening

I Multilevel clustering algorithm

I Reduce the size of the graph by a specified factor (2)

I Do all this efficiently

Graclus multilevel clustering algorithm

I Maximizing local normalized cut

I Greedily pick an unmarked vertex i and match it with an unmatched vertex jwhich maximizes the local normalized cut Wi ,j(1/di + 1/dj).

I Extremely fast.

I Dividing the number of nodes by approximately 2.

I Might generate singletons (non matched) nodes. Solved by using fake nodes.

Graph PoolingCoarsening

I Multilevel clustering algorithm

I Reduce the size of the graph by a specified factor (2)

I Do all this efficiently

Graclus multilevel clustering algorithm

I Maximizing local normalized cut

I Greedily pick an unmarked vertex i and match it with an unmatched vertex jwhich maximizes the local normalized cut Wi ,j(1/di + 1/dj).

I Extremely fast.

I Dividing the number of nodes by approximately 2.

I Might generate singletons (non matched) nodes. Solved by using fake nodes.

Graph PoolingFast Pooling

Example: Pooling by 4

I Graclus generates singletons: n0 = 8→ n1 = 5→ n2 = 3

I By adding fake nodes we get: n2 = 3→ n1 = 6→ n0 = 12

I z = [max(x0, x1),max(x4, x5, x6),max(x8, x9, x10)]

I Balanced binary trees ⇒ efficient on GPUs.

ExperimentsMNIST

Wi ,j = e−‖xi−xj‖22/σ2

I 28×28 pixels + 192 fake nodes ⇒ n = |V| = 976

I 8-NN graph ⇒ |ε| = 3198

I Based on LeNet-5 ⇒ K = 5

ExperimentsMNIST

Wi ,j = e−‖xi−xj‖22/σ2

I 28×28 pixels + 192 fake nodes ⇒ n = |V| = 976

I 8-NN graph ⇒ |ε| = 3198

I Based on LeNet-5 ⇒ K = 5

ExperimentsMNIST

Wi ,j = e−‖xi−xj‖22/σ2

I 28×28 pixels + 192 fake nodes ⇒ n = |V| = 976

I 8-NN graph ⇒ |ε| = 3198

I Based on LeNet-5 ⇒ K = 5

I Isotropic filters (no orientation)

I Uninvestigated optimizations and initializations

Experiments20NEWS

I 18,846 text documentsassociated with 20 classes

I 10K most common wordsfrom the 94K unique words⇒ n = |V| = 10K

I 16-NN,Wi ,j = e−‖zi−zj‖

22/σ

2

(zi - word2vec) ⇒|ε| = 132, 834

I x - bag-of-words model

Experiments20NEWS

Comparison with other methods (K = 5)

I Slightly worse than multinomial naiveBayes classifier.

I Defeats fully-connected networks withmuch less parameters.

Total training time divided by # ofgradient steps

I Scales as O(n) as opposed toO(n2)

Experiments20NEWS

Comparison with other methods (K = 5)

I Slightly worse than multinomial naiveBayes classifier.

I Defeats fully-connected networks withmuch less parameters.

Total training time divided by # ofgradient steps

I Scales as O(n) as opposed toO(n2)

ExperimentsGraph Quality

⇒ The data structure is important

⇒ A well constructed graph is important

⇒ Proper approximations (LSHForest) can be used for larger DBs.

Conclusions and Future Work

I Introduced a model with linear complexity.

I The quality of the input graph is of paramount importance.

I Local stationarity and compositionality are verified for text documents as long asthe graph is well constructed.

Recommended