View
2
Download
0
Category
Preview:
Citation preview
Convolutional Neural Networks on Graphs with Fast LocalizedSpectral Filtering
M. Defferrard1 X. Bresson2 P. Vandergheynst1
1EPFL, Lausanne, Switzerland
2Nanyang Technological University, Singapore
Itay Boneh, Asher KabakovitchTel-Aviv University, Deep Learning Seminar
2017
Spectral FilteringNon-parametric
From convolution theorem:
x ∗G g = U(UTg � UT x
)= U
(g � UT x
)Or algebrically:
x ∗G g = U
g1 0. . .
0 gn
UT x
I Not localized
I O(n) parameters to train
I O(n2) multiplications (no FFT)
Spectral FilteringNon-parametric
From convolution theorem:
x ∗G g = U(UTg � UT x
)= U
(g � UT x
)Or algebrically:
x ∗G g = U
g1 0. . .
0 gn
UT x
I Not localized
I O(n) parameters to train
I O(n2) multiplications (no FFT)
Spectral FilteringPolynomial Parametrization
g is a continuous 1-D function, parametrized by θ:
x ∗G g = U(gθ(λ)� UT x
)= U
gθ(λ1) 0. . .
0 gθ(λn)
UT x
= Ugθ(Λ)UT x = gθ(L)x
Tφ = T~λ⇒ f (T ) = φf (Λ)φ−1
Spectral FilteringPolynomial Parametrization
g is a continuous 1-D function, parametrized by θ:
x ∗G g = U(gθ(λ)� UT x
)= U
gθ(λ1) 0. . .
0 gθ(λn)
UT x
= Ugθ(Λ)UT x = gθ(L)x
Polynomial parametrization:
gθ(Λ) =K−1∑k=0
θkΛk
I K-localized: 1-hop for every L application
I K parameters to train - independent on the graph’s size
I Still O(n2) multiplications (multiplications with the basis U)
Spectral FilteringPolynomial Parametrization
Impulse response on a 2D Euclidean domain
Impulse response on a graph
Spectral FilteringRecursive Polynomial Parametrization
Chebyshev polynomials: Tk(λ) = 2λTk−1(λ)− Tk−2(λ)
T0(λ) = 1
T1(λ) = λ
Parametrization:
gθ(L) =K−1∑k=0
θkTk(L)
L = 2Lλ−1n − I (orthonormal basis in [−1, 1])
Spectral FilteringRecursive Polynomial Parametrization
Filtering:
y = gθ(L)x =K−1∑k=0
θkTk(L)x =K−1∑k=0
θk xk
Recurrence: xk = Tk(L)x = 2Lxk−1 − xk−2
x0 = x
x1 = Lx
I K-localized: 1-hop for every L application
I K parameters to train - independent on the graph’s size
I O(Kn) multiplications (actually O(K |ε|))
I No EVD of L
Spectral FilteringRecursive Polynomial Parametrization
Filtering:
y = gθ(L)x =K−1∑k=0
θkTk(L)x =K−1∑k=0
θk xk
Recurrence: xk = Tk(L)x = 2Lxk−1 − xk−2
x0 = x
x1 = Lx
I K-localized: 1-hop for every L application
I K parameters to train - independent on the graph’s size
I O(Kn) multiplications (actually O(K |ε|))
I No EVD of L
Graph CNN
Graph CNNLearning Filters
ys,j =
Fin∑i=1
gθi,j (L)xs,i
∂L∂θi ,j
=S∑
s=1
[xs,i ,0, . . . , xs,i ,K−1]T∂L∂ys,j
∂L∂xs,i
=Fout∑j=1
gθi,j (L)∂L∂ys,j
s = 1, . . . ,S - sample indexi = 1, . . . ,Fin - input feature map indexj = 1, . . . ,Fout - output feature map indexθi ,j - Fin × Fout Chebyshev coefficients vectors of order KL - Loss over a mini-batch of S samples
Graph CNNLearning Filters
ys,j =
Fin∑i=1
gθi,j (L)xs,i
∂L∂θi ,j
=S∑
s=1
[xs,i ,0, . . . , xs,i ,K−1]T∂L∂ys,j
∂L∂xs,i
=Fout∑j=1
gθi,j (L)∂L∂ys,j
O (K |ε|FinFoutS)
Easily paralleled
Graph PoolingCoarsening
I Multilevel clustering algorithm
I Reduce the size of the graph by a specified factor (2)
I Do all this efficiently
Graclus multilevel clustering algorithm
I Maximizing local normalized cut
I Greedily pick an unmarked vertex i and match it with an unmatched vertex jwhich maximizes the local normalized cut Wi ,j(1/di + 1/dj).
I Extremely fast.
I Dividing the number of nodes by approximately 2.
I Might generate singletons (non matched) nodes. Solved by using fake nodes.
Graph PoolingCoarsening
I Multilevel clustering algorithm
I Reduce the size of the graph by a specified factor (2)
I Do all this efficiently
Graclus multilevel clustering algorithm
I Maximizing local normalized cut
I Greedily pick an unmarked vertex i and match it with an unmatched vertex jwhich maximizes the local normalized cut Wi ,j(1/di + 1/dj).
I Extremely fast.
I Dividing the number of nodes by approximately 2.
I Might generate singletons (non matched) nodes. Solved by using fake nodes.
Graph PoolingFast Pooling
Example: Pooling by 4
I Graclus generates singletons: n0 = 8→ n1 = 5→ n2 = 3
I By adding fake nodes we get: n2 = 3→ n1 = 6→ n0 = 12
I z = [max(x0, x1),max(x4, x5, x6),max(x8, x9, x10)]
I Balanced binary trees ⇒ efficient on GPUs.
ExperimentsMNIST
Wi ,j = e−‖xi−xj‖22/σ2
I 28×28 pixels + 192 fake nodes ⇒ n = |V| = 976
I 8-NN graph ⇒ |ε| = 3198
I Based on LeNet-5 ⇒ K = 5
ExperimentsMNIST
Wi ,j = e−‖xi−xj‖22/σ2
I 28×28 pixels + 192 fake nodes ⇒ n = |V| = 976
I 8-NN graph ⇒ |ε| = 3198
I Based on LeNet-5 ⇒ K = 5
ExperimentsMNIST
Wi ,j = e−‖xi−xj‖22/σ2
I 28×28 pixels + 192 fake nodes ⇒ n = |V| = 976
I 8-NN graph ⇒ |ε| = 3198
I Based on LeNet-5 ⇒ K = 5
I Isotropic filters (no orientation)
I Uninvestigated optimizations and initializations
Experiments20NEWS
I 18,846 text documentsassociated with 20 classes
I 10K most common wordsfrom the 94K unique words⇒ n = |V| = 10K
I 16-NN,Wi ,j = e−‖zi−zj‖
22/σ
2
(zi - word2vec) ⇒|ε| = 132, 834
I x - bag-of-words model
Experiments20NEWS
Comparison with other methods (K = 5)
I Slightly worse than multinomial naiveBayes classifier.
I Defeats fully-connected networks withmuch less parameters.
Total training time divided by # ofgradient steps
I Scales as O(n) as opposed toO(n2)
Experiments20NEWS
Comparison with other methods (K = 5)
I Slightly worse than multinomial naiveBayes classifier.
I Defeats fully-connected networks withmuch less parameters.
Total training time divided by # ofgradient steps
I Scales as O(n) as opposed toO(n2)
ExperimentsGraph Quality
⇒ The data structure is important
⇒ A well constructed graph is important
⇒ Proper approximations (LSHForest) can be used for larger DBs.
Conclusions and Future Work
I Introduced a model with linear complexity.
I The quality of the input graph is of paramount importance.
I Local stationarity and compositionality are verified for text documents as long asthe graph is well constructed.
Recommended