Download pdf - Graph Neural Networkssrijan/teaching/cse6240/spring2020/slides/1… · neural networks: •Graph encoders idea is inspired by CNN on images enc(v)=z v enc(v)= multiple layers of non-linear

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 1

CSE 6240: Web Search and Text Mining. Spring 2020

Graph Neural Networks

Prof. Srijan Kumarhttp://cc.gatech.edu/~srijan


Today’s Lecture• Introduction to deep graph embeddings• Graph convolution networks• GraphSAGE


Goal: Node Embeddings

similarity(u, v) ⇡ z>v zuGoal:

Need to define!

Input network d-dimensional embedding space


Deep Graph Encoders• Encoder: Map a node to a low-dimensional

vector:• Deep encoder methods based on graph

neural networks:

• Graph encoders idea isinspired by CNN on images

enc(v) = zv

enc(v) =multiple layers of

non-linear transformations of graph structure

End-to-end learning on graphs with GCNs Thomas Kipf

Convolutional neural networks (on grids)

5

(Animation by Vincent Dumoulin)

Single CNN layer with 3x3 filter:

Image Graph


Idea from Convolutional Networks• In CNN, pixel representation is created by

transforming neighboring pixel representation

– In GNN, node representations are created by transforming neighboring node representation

• But graphs are irregular, unlike images– So, generalize convolutions beyond simple

lattices, and leverage node features/attributes• Solution: deep graph encoders


Deep Graph Encoders

…

Output: Node embeddings, embed larger network structures, subgraphs, graphs

• Once an encoder is defined, multiple layers of encoders can be stacked


Graph Encoder: A Naïve Approach

• Join adjacency matrix and features• Feed them into a deep neural network:

• Issues with this idea:– 𝑂(|𝑉|) parameters– Not applicable to graphs of different sizes– Not invariant to node ordering

End-to-end learning on graphs with GCNs Thomas Kipf

A B C D EABCDE

0 1 1 1 0 1 01 0 0 1 1 0 01 0 0 1 0 0 11 1 1 0 1 1 10 1 0 1 0 1 0

Feat

A naïve approach

8

• Take adjacency matrix and feature matrix

• Concatenate them

• Feed them into deep (fully connected) neural net

• Done?

Problems:

• Huge number of parameters • No inductive learning possible

?A

C

B

D

E

[A,X]


Graph Encoders: Two Instantiations1. Graph convolution networks (GCN): one

of the first frameworks to learn node embeddings in an end-to-end manner

– Different from random walk methods, which are not end-to-end

2. GraphSAGE: generalized GCNs to various neighborhood aggregations


Today’s Lecture• Introduction to deep graph embeddings• Graph convolution networks (GCN)• GraphSAGE

Main paper: “Semi-Supervised Classification with Graph Convolutional Networks”, Kipf and Welling, ICLR 2017


Content• Local network neighborhoods:

– Describe aggregation strategies– Define computation graphs

• Stacking multiple layers:– Describe the model, parameters, training– How to fit the model?– Simple example for unsupervised and

supervised training


Setup• Assume we have a graph 𝐺:

– 𝑉 is the vertex set– 𝑨 is the adjacency matrix (assume binary)– 𝑿 ∈ ℝ+×|-| is a matrix of node features

– Social networks: User profile, User image– Biological networks: Gene expression profiles– If there are no features, use:

» Indicator vectors (one-hot encoding of a node)» Vector of constant 1: [1, 1, …, 1]


Graph Convolutional Networks• Idea: Generate node embeddings based on

local network neighborhoods – A node’s neighborhood defines its computation

graph• Learn how to aggregate information from

the neighborhood to learn node embeddings

– Transform information from the neighbors and combine it:

• Transform “messages” ℎ/ from neighbors: 𝑊/ℎ/


Idea: Aggregate Neighbors• Intuition: Generate node embeddings based

on local network neighborhoods • Nodes aggregate information from their

neighbors using neural networks

INPUT GRAPH

TARGET NODE B

DE

F

CA

B

C

D

A

A

A

C

F

B

E

A

Neuralnetworks


Idea: Aggregate Neighbors• Intuition: Network neighborhood defines a

computation graph

Everynodedefinesacomputationgraphbasedonitsneighborhood


Deep Model: Many Layers• Model can be of arbitrary depth:

– Nodes have embeddings at each layer– Layer-0 embedding of node 𝒖 is its input feature, 𝒙𝒖– Layer-K embedding gets information from nodes

that are atmost K hops away

INPUT GRAPH

TARGET NODE B

DE

F

CA

B

C

D

A

A

A

C

F

B

E

A

xA

xB

xC

xExF

xA

xA

Layer-2

Layer-1Layer-0


Neighborhood Aggregation • Neighborhood aggregation: Key

distinctions are in how different approaches aggregate information across the layers

INPUT GRAPH

TARGET NODE B

DE

F

CA

B

C

D

A

A

A

C

F

B

E

A

?

?

?

?

What is in the box?


Neighborhood Aggregation• Basic approach: Average information from

neighbors and apply a neural network

INPUT GRAPH

TARGET NODE B

DE

F

CA

B

C

D

A

A

A

C

F

B

E

A

(1)averagemessagesfromneighbors

(2)applyneuralnetwork


The Math: Deep Encoder• Basic approach: Average neighbor

messages and apply a neural network– Note: Apply L2 normalization for each node embedding at

every layer

Averageofneighbor’spreviouslayerembeddings

Initial0-thlayerembeddings areequaltonodefeatures

EmbeddingafterKlayersofneighborhoodaggregation

Non-linearity(e.g.,ReLU)

Previouslayerembeddingofvh

0v = xv

h

kv = �

0

@Wk

X

u2N(v)

h

k�1u

|N(v)| +Bkhk�1v

1

A , 8k 2 {1, ...,K}

zv = h

Kv


GCN: Matrix Form• H(l) is the representation in lth layer• W0

(l) and W1(l) are matrices to be learned for

each layer• A = adjacency matrix, D = diagonal degree

matrix• GCN rewritten in vector form:


𝒛5

Training the Model• How do we train the model?

– Need to define a loss function on the embeddings


Model Parameters

• We can feed these embeddings into any loss function and run stochastic gradient descent to train the weight parameters

– Once we have the weight matrices, we can calculate the node embeddings

Trainableweightmatrices(i.e.,whatwelearn)h

0v = xv

h

kv = �

0

@Wk

X

u2N(v)

h

k�1u

|N(v)| +Bkhk�1v

1

A , 8k 2 {1, ...,K}

zv = h

Kv


Unsupervised Training• Training can be unsupervised or supervised• Unsupervised training:

– Use only the graph structure: “Similar” nodes have similar embeddings

– Common unsupervised loss function = edge existence

• Unsupervised loss function can be anything from the last section, e.g., a loss based on– Node proximity in the graph– Random walks


Supervised Training• Train the model for a

supervised task (e.g., node classification)

• Two ways:– Total loss =

supervised loss – Total loss =

supervised loss + unsupervised loss

E.g., Normal or anomalous node?


Model Design: Overview(1) Define a neighborhood

aggregation function

(2) Define a loss function on the embeddings

𝒛5


Model Design: Overview

(3) Train on a set of nodes, i.e., a batch of compute graphs


Model Design: Overview

(4) Generate embeddings for nodes as needed

Even for nodes we never trained on!


GCN: Inductive Capability• The same aggregation parameters are

shared for all nodes:– The number of model parameters is sublinear in |𝑉| and we can generalize to unseen nodes

INPUT GRAPH

B

DE

F

CA

Compute graph for node A Compute graph for node B

shared parameters

shared parametersWk Bk


Inductive Capability: New Nodes

Trainwithsnapshot NewnodearrivesGenerateembedding

fornewnode

zu

• Many application settings constantly encounter previously unseen nodes:

• E.g., Reddit, YouTube, Google Scholar• Need to generate new embeddings “on the fly”


Inductive Capability: New Graphs

Inductive node embedding Generalize to entirely unseen graphs

E.g., train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B

Train on one graph Generalize to new graph

zu


Summary So Far• Recap: Generate node embeddings by

aggregating neighborhood information– We saw a basic variant of this idea– Key distinctions are in how different approaches

aggregate information across the layers

• Next: GraphSAGE graph neural network architecture



• Main paper: Inductive Representation Learning on Large Graphs. William L. Hamilton, Rex Ying, Jure Leskovec. NeurIPS 2017.


GraphSAGE Idea• In GCN, we aggregated the neighbors’

messages as the (weighted) average of all neighbors. How can we generalize this?

INPUT GRAPH

TARGET NODE B

DE

F

CA

B

C

D

A

A

A

C

F

B

E

A

?

?

?

?

[Hamilton et al., NIPS 2017]


GraphSAGE Idea

INPUT GRAPH

TARGET NODE B

DE

F

CA

B

C

D

A

A

A

C

F

B

E

A

hkv = �

�⇥Ak · agg({hk�1

u , 8u 2 N(v)}),Bkhk�1v

⇤�

Any differentiable function that maps set of vectors in 𝑁(𝑢) to a single vector


Neighborhood Aggregation• Simple neighborhood aggregation:

• GraphSAGE:Concatenate neighbor embedding

and self embedding

hkv = �

�⇥Wk · agg

�{hk�1

u , 8u 2 N(v)}�,Bkh

k�1v

⇤�

hkv = �

0

@Wk

X

u2N(v)

hk�1u

|N(v)| +Bkhk�1v

1

A

Generalized aggregation


Neighbor Aggregation: Variants• Mean: Take a weighted average of

neighbors

• Pool: Transform neighbor vectors and apply symmetric vector function

• LSTM: Apply LSTM to reshuffled of neighbors

agg =X

u2N(v)

hk�1u

|N(v)|

Element-wisemean/max

agg = ��{Qhk�1

u , 8u 2 N(v)}�

agg = LSTM�[hk�1

u , 8u 2 ⇡(N(v))]�


Experiments: Dataset• Dynamic datasets:

– Citation Network: Predict paper category• Data from 2000-2005• 302,424 nodes• Train: data till 2004, test: 2005 data

– Reddit Post Network: Predict subreddit of post• Nodes = posts• Edges between posts if common users comment on

the post• 232,965 posts• Train: 20 days of data, test: next 10 days of data


Experiments: Results


Summary: GCN and GraphSAGE• Key idea: Generate node embeddings based

on local neighborhoods – Nodes aggregate “messages” from their neighbors

using neural networks• Graph convolutional networks:

– Basic variant: Average neighborhood information and stack neural networks

• GraphSAGE:– Generalized neighborhood aggregation