Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Machine Learning on Sequences
I GNNs are architectures specialized in learning data defined over graph supports
I Several processes have a sequential nature. To learn from them, we need dedicated architectures
1
Machine Learning on Sequences
I Often, we want to learn properties of a sequence ⇒ Is the particle entering the forbidden area?
x1 x2 x3 x4 x5 x6 x7 x8
I This problem is not just a simple sequence of classifications ⇒ yt = φ(xt)
I It is a (sequence of) classifications of a sequence ⇒ yt = φ(x1:t) = φ(xt , xt−1, . . . , x1)
2
Unbounded Memory Growth
I Predictions on a sequence depends on observation histories ⇒ yt = Φ(xt , xt−1, . . . , x1)
x1
y1
x2
y2
x3
y3
xt
yt
I Recurrent neural networks (RNNs) estimate a hidden state to avoid this unbounded memory growth
3
Markov Random Processes
I A stochastic process (random sequence) is said to be Markov or memoryless if
p(
xt+1
∣∣ x1:t
)= p
(xt+1
∣∣ xt)
I It is the same to condition on the current value xt or conditioning or on the whole trajectory x0:t
⇒ The future, given the present, is independent of the past
⇒ For predicting the future, knowledge of the past is irrelevant
I Outputs (e.g, trajectory categories) are conditionally independent ⇒ p(yt∣∣ xt) = p(yt
∣∣ x1:t)
4
Learning in a Markov Process
I In a memoryless Markov Process, learning is equivalent reduce to a sequence of learning problems
I State evolution is a chain of memoryless transitions. And outputs depend on the current state only
p(xt+1
∣∣ xt)xt xt+1
p(yt∣∣ xt) p(yt+1
∣∣ xt+1)yt yt+1
I An AI to predict yt mimics the conditional distribution of the observations. The past is irrelevant
5
Learning in a Markov Process
I In a memoryless Markov Process, learning is equivalent reduce to a sequence of learning problems
I State evolution is a chain of memoryless transitions. And outputs depend on the current state only
p(xt+1
∣∣ xt)xt xt+1
Φ(xt) Φ(xt)yt yt+1
I An AI to predict yt mimics the conditional distribution of the observations. The past is irrelevant
5
Recurrent Neural Networks
I Machine Learning in stochastic processes that are not Markov ⇒ The past is relevant in learning
6
All Processes are Markov. However Disguised
I The evolution of the trajectory is not a Markov Process if we observe positions only
I But it is Markov if we have access to velocities and accelerations ⇒ Hidden (unobserved) states
x1 x2 x3 x4 x5 x6 x7 x8
v1, u1 v2, u2 v3, u3 v4, u4 v5, u5 v6, u6 v7, u7 v8, u8
I All systems are Markov ⇒ We often lack enough information to observe their Markov structure
7
Hiden Markov Model
I Stochastic process xt follows a hidden Markov model if there exists a process zt such that
p(
zt+1
∣∣ z1:t
)= p
(zt+1
∣∣ zt)
and p(
xt∣∣ zt)
= p(
xt∣∣ z0:t
)
I The hidden state zt is a memoryless Markov stochastic process
I The observed state xt is conditionally independent. Depends only on the current hidden state zt
I Outputs are also conditionally independent ⇒ p(yt∣∣ zt) = p(yt
∣∣ z1:t)
8
Machine Learning on Hidden Markov Models
I In a hidden Markov model learning is not equivalent to a sequence of learning problems
I The AI can try to mimic the conditional distribution p(yt∣∣ zt). But we don’t have access to zt
p(zt+1
∣∣ zt)zt zt+1
p(yt∣∣ zt) p(yt+1
∣∣ zt+1)
yt yt+1
p(xt∣∣ zt) p(xt+1
∣∣ zt+1)
xt xt+1
I Recurrent Neural Network (RNN) ⇒ Use observed state xt to estimate hidden state zt
9
Machine Learning on Hidden Markov Models
I In a hidden Markov model learning is not equivalent to a sequence of learning problems
I The AI can try to mimic the conditional distribution p(yt∣∣ zt). But we don’t have access to zt
p(zt+1
∣∣ zt)zt zt+1
Φ(zt) Φ(zt)
yt yt+1
p(xt∣∣ zt) p(xt+1
∣∣ zt+1)
xt xt+1
zt zt+1
I Recurrent Neural Network (RNN) ⇒ Use observed state xt to estimate hidden state zt
9
Recurrent Neural Networks
I A recurrent neural network is made up of two separate learning parametrizations
⇒ Φ1(xt , zt−1) ⇒ From observed state xt ⇒ and hidden state zt−1 ⇒ to hidden state update zt
⇒ Φ2(zt) ⇒ From updated hidden state zt ⇒ to output estimate yt
Φ1(xt , zt−1)
xt
zt−1 zt Φ2(zt) yt
I It is a recurrent neural network because hidden states are fed-back as inputs for the next time step
10
Hidden State Update AI
I Use a perceptron for the AI that updates the hidden state ⇒ Φ1
(xt , zt−1
)= σ
(Axt + Bzt−1
)
σ(
Axt + Bzt−1
)
xt
zt−1 zt Φ2
(zt)
yt
I Number of learnable parameters ≡ Entries of A and B ⇒ Does not depend on the time index t
11
Output Prediction AI
I Use another perceptron for the AI that predicts the output ⇒ Φ1
(zt)
= σ(
Czt)
σ(
Axt + Bzt−1
)
xt
zt−1 zt Φ2
(zt)
σ(
Czt)
yt
I We can also use a multi-layer neural network for the output prediction AI ⇒ Φ1
(zt)
= Φ(
zt ; H)
12
Output Prediction AI
I Use another perceptron for the AI that predicts the output ⇒ Φ1
(zt)
= σ(
Czt)
σ(
Axt + Bzt−1
)
xt
zt−1 zt Φ(
zt ; H)
yt
I We can also use a multi-layer neural network for the output prediction AI ⇒ Φ1
(zt)
= Φ(
zt ; H)
12
Time Gating
I We discuss the problem of vanishing/exploding gradients in recurrent neural networks
I We introduce gating mechanisms in the form of long short-term memories and gated recurrent units
13
Vanishing/Exploding Gradients for Long Term Dependencies
I In some tasks, the RNN may have to learn how to model long term dependencies of length T
I This poses a challenge ⇒ the Jacobian ∂zT/∂B will depend on a chain of multiplications by B
zt+T−1
xt+T−1
BT
I If eigenvalues of B � 1, the gradients tend to vanish, leading to exponentially smaller weights B
I If eigenvalues of B � 1, the gradients tend to explode, leading to exponentially larger weights B
14
Vanishing/Exploding Gradients for Long Term Dependencies (Example)
I Consider a simplification of the RNN where we omit the nonlinear function σ(·) and the inputs xt
zt = Bzt−1
I At time t = T , the state variable zT depends on the T th power of the matrix B
zT = BT zt−T
I If B admits an eigendecomposition B = QΛQ>, the recurrence can be rewritten as
zT = QΛTQ>zt−T
⇒ Eigenvalues less than one will vanish and eigenvalues greater than one will explode
⇒ Any component of zt−T not aligned with the largest eigenvalues will be discarded
15
Gating Mechanism
I To address the issue of vanishing gradients, we add a gating mechanism to RNNs
I Gates are scalars in [0, 1] acting on the current input and on the previous state
⇒ Control how much of the input and past time information should be taken into account
I The value of each gate is updated at every step of the sequence
⇒ Allows creating paths through time with derivatives that neither vanish nor explode
⇒ Creates dependency paths that allow encoding both short and long term dependencies
16
Long Short-Term Memory (LSTM)
I The most popular gated RNN architecture is the Long Short-Term Memory (LSTM) cell
I Three gates: a forget gate ft ∈ [0, 1], an input gate gt ∈ [0, 1], and a cell output gate qt ∈ [0, 1]
I Let xt be the input, zt the state, and define the internal memory st of the LSTM cell
I Memory st updated by applying the forget gate to st−1 and the input gate to the state update
st = ftst−1 + gtσ (Axt + Bzt−1)
I State zt updated by applying the cell output gate to the internal cell memory st
zt = qtσ(st)
17
Gated Recurrent Unit (GRU)
I The Gated Recurrent Unit (GRU) is a second popular gated version of the RNN
I Slight variation of LSTM ⇒ single gate ut ∈ [0, 1] plays the role of input and forget gates
zt = utzt−1 + (1− ut)σ (Axt + rtBzt−1)
⇒ Reset gate rt ∈ [0, 1] controls contribution of previous state zt−1 to updated state
I Besides the LSTM and the GRU, many more variants of gating mechanisms for RNNs exist
18
Gate Computation
I In the LSTM and the GRU, the gates themselves are calculated as the outputs of RNNs
I For example, the forget gate ft of the LSTM has its own state variable z′t (as do all the other gates)
z′t = σ(A′xt + B′z′t−1
)I The forget gate ft is then calculated from the input xt and the state z′t as
ft = sigmoid(Uxt + Wz′t
)⇒ With U and W linear layers mapping the input and state features to a single scalar
⇒ And the sigmoid activation function ensuring gate values in the [0, 1] interval
19
Graph Recurrent Neural Networks
I We define Graph Recurrent Neural Networks (GRNNs) as particular cases of RNNs
20
From RNNs to GRNNs
I Consider a time varying process xt in which each of the signals is supported on shift operator S
xt−2 xt−1 xt
I A graph recurrent neural network (GRNN) combines
⇒ A GNN because xt is supported on a graph. ⇒ An RNN because xt is a sequence
21
A Recurrent Neural Network for Graph Signals
I An RNN has a hidden state zt updated with the perceptron ⇒ zt = σ(Axt + Bzt−1)
I An it has an output prediction yt given by the perceptron ⇒ yt = σ(Cxt)
Bzt−1
Axt
+ σ(·)
xt
zt−1 zt Czt σ(·) yt
I The observed state xt and the output yt are graph signals supported on the graph shift operator S
⇒ The hidden state zt is constructed to be a graph signal supported on the graph shift operator S
22
Graph Recurrent Neural Networks
I Hidden and observed state are propagated through graph filters to update the hidden state
A = A(S) =K−1∑k=0
akSk B = B(S)x =K−1∑k=0
bkSk
I The state update is ⇒ zt = σ
[A(S)xt + B(S)zt−1
]= σ
[K−1∑k=0
akSkxt +K−1∑k=0
bkSkzt−1
]
K−1∑k=0
bkSkzt−1
K−1∑k=0
akSkxt
+ σ(·)
xt
zt−1 zt
K−1∑k=0
akSkzt σ(·) yt
23
Graph Recurrent Neural Networks
I The hidden state zt is propagated through a graph filter to make a prediction yt of the output yt
C = C(S) =K−1∑k=0
ckSk
I The prediction of the output yt is given by ⇒ yt = σ
[C(S)zt
]= σ
[K−1∑k=0
ckSkzt
]
K−1∑k=0
bkSkzt−1
K−1∑k=0
akSkxt
+ σ(·)
xt
zt−1 zt
K−1∑k=0
akSkzt σ(·) yt
23
Multiple Feature GRNNs
I A GRNN is made up a hidden state update perceptron and an output prediction perceptron
zt = σ
[K−1∑k=0
akSkxt +K−1∑k=0
bkSkzt−1
]yt = σ
[K−1∑k=0
ckSkzt
]
I Each of these filters can be replaced by a MIMO filter to yield a GRNN with multiple features
Zt = σ
[K−1∑k=0
SkXtAk +K−1∑k=0
SkZt−1Bk
]Yt = σ
[K−1∑k=0
SkZtCk
]
I Multiple-feature hidden state Zt permits larger dimensionality relative to observed states
24
Spatial Gating
I We extend time gating to GRNNs to handle the problem of vanishing/exploding gradients
I We discuss long range graph dependencies and introduce node and edge gating
25
Gating in GRNNs
I Like RNNs, GRNNs may also experience the problem of vanishing/exploding gradients
⇒ Happens when eigenvalues of B(S) are much smaller/larger than 1
I Similarly to what we did for RNNs, we address it by adding gating operators to GRNNs
Zt = σ
(Q {AS(Xt)}+ Q {BS(Zt−1)}
)
I Input gate operator Q : RN×H → RN×H ⇒ controls the importance of the input Xt at time t
I Forget gate operator Q : RN×H → RN×H ⇒ controls the importance of the state Zt at time t
26
Time-Gated GRNNs
I First type of gating for GRNNs is time gating ⇒ simple extension of input and forget gates of RNNs
I In the Time-Gated GRNN, the input and forget gate operators are expressed as
Q {AS(Xt)} = qtAS(Xt), Q {BS(Zt)} = qtBS(Zt)
I Time gating multiplies the input and the state by scalar gates qt ∈ [0, 1] and qt ∈ [0, 1]
I A single scalar gate is applied to the whole graph signal ⇒ same gate value for all nodes
27
Long Range Spatial Dependencies
I Even if eigenvalues of B(S) ∼ 1 spatial imbalances can cause gradients to vanish in space
⇒ Some nodes/paths might get assigned more importance than others in long range exchanges
I Example: graphs with community structure, where some nodes are highly connected within clusters
⇒ Gradients of ZT depend on successive products of B(S) ⇒ successive products of S
⇒ For large T , the matrix entries in ST with highly connected nodes will get densely populated
⇒ Overshadows community structure ⇒ can’t encode long processes that are local on the graph
28
Spatial Gating
I Node and edge structure of the graph allows for other forms of gating ⇒ spatial gating
I Node gating ⇒ one input and one forget gate for each node of the graph
1
23
4
5 6
7
89
10
11 12
⇒1
23
4
5 6
7
89
10
11 12
I Spatial gating strategies help encode long range spatial dependencies in graph processes
29
Spatial Gating
I Node and edge structure of the graph allows for other forms of gating ⇒ spatial gating
I Edge gating ⇒ one input and one forget gate for each edge of the graph
1
23
4
5 6
7
89
10
11 12
⇒1
23
4
5 6
7
89
10
11 12
I Spatial gating strategies help encode long range spatial dependencies in graph processes
29
Node-Gated GRNNs
I In the Node-Gated GRNN, the input gate and forget gate operators are expressed as
Q {AS(Xt)} = diag(qt)AS(Xt), Q {BS(Zt)} = diag(qt)BS(Zt)
I Gating operators correspond to multiplication of the input and state by diagonal matrices
⇒ The diagonals are the input and forget vector gates qt ∈ [0, 1]N and qt ∈ [0, 1]N
I A scalar gate applied to each nodal component of the signal ⇒ different gate values for each node
30
Edge-Gated GRNNs
I In the Edge-Gated GRNN, the input gate and forget gate operators are expressed as
Q {AS(Xt)} = AS�Qt(Xt), Q {BS(Zt)} = BS�Qt
(Zt)
I Gating operators correspond to elementwise multiplication of the shift operator by gate matrices
⇒ The matrices multiplying the GSOs are the input and forget matrix gates Qt and Qt ∈ [0, 1]N×N
I Separate gate for each edge ⇒ control the amount of information transmitted across edges
31
Gate Computation
I Parameters of input and forget gate operators are the outputs of GRNNs themselves
I Input and forget gate states are expressed as
Zt = σ
(AS(Xt) + BS(Zt−1)
)Zt = σ
(AS(Xt) + BS(Zt−1)
)
⇒ Gate computation takes different forms depending on the type of gating
I In the case of time gating, the gates are calculated as
qt = sigmoid(cTvec(Zt)) qt = sigmoid(cTvec(Zt))
⇒ Where c ∈ RHN and c ∈ RHN are fully connected layers and the sigmoid ensures gates in [0, 1]
32
Gate Computation
I Parameters of input and forget gate operators are the outputs of GRNNs themselves
I Input and forget gate states are expressed as
Zt = σ
(AS(Xt) + BS(Zt−1)
)Zt = σ
(AS(Xt) + BS(Zt−1)
)
⇒ Gate computation takes different forms depending on the type of gating
I In the case of node gating, the gates are calculated as
qt = sigmoid(CS(Zt)
)qt = sigmoid
(CS(Zt)
)⇒ Where CS and CS are graph convolutions and the sigmoid ensures gates in [0, 1]N
32
Gate Computation
I Parameters of input and forget gate operators are the outputs of GRNNs themselves
I Input and forget gate states are expressed as
Zt = σ
(AS(Xt) + BS(Zt−1)
)Zt = σ
(AS(Xt) + BS(Zt−1)
)⇒ Gate computation takes different forms depending on the type of gating
I In the case of edge gating, the gates are calculated as
[Qt ]ij = sigmoid(
cT[δTi Zt C||δT
j Zt C]T)
[Qt ]ij = sigmoid(
cT[δTi Zt C||δT
j Zt C]T)
⇒ Where δi and δj are N-dimensional Dirac deltas; C ∈ RH×H′ , C ∈ RH×H′ are linear layers
⇒ And c ∈ R2H′×1 and c ∈ R2H′×1 are f.c. layers applied to concatenation || of features of i and j
32
Stability of GRNNs
I GRNNs can be seen as a time extension of GNNs, therefore they inherit their stability properties
33
Relative Perturbation Model
Definition (Relative perturbation matrices)
Given GSOs S and S, we define the set of relative perturbation matrices modulo permutation as
E(S, S) ={
E∈RN×N : PTSP = S + ES + SET,P ∈ P}
(1)
where P ={
P ∈ {0, 1}N×N : P1 = 1,PT1 = 1}.
I We consider that the distance between two graphs S and S is given by d(S, S) = minE∈E(S,S)
‖E‖
I Notice that if S is a permutation of the shift matrix S, then we have d(S, S) = 0
34
Integral Lipschitz filters
Definition (Integral Lipschitz filters)
A filter A(S) =∑K−1
k=0 akSk is integral Lipschitz if there exists C > 0 such that a(λ) =∑K−1
k=0 akλk
satisfies
|a(λ2)− a(λ1)| ≤ C|λ2 − λ1||λ1 + λ2|/2
(2)
for all λ1, λ2 ∈ R.
I Integral Lipschitz filters also satisfy |λa′(λ)| ≤ C , where a′(λ) is the derivative of a(λ)
I Recall that the frequency response of integral Lipschitz filters becomes flat for large λ
35
Assumptions
I We consider a GRNN with FX = 1 input feature, FZ = 1 state feature, and FY = 1 output feature
zt = σ(A(S)xt + B(S)zt−1
)yt = ρ (C(S)zt)
(A1) A, B and C are integral Lipschitz with constants CA, CB and CC and ‖A‖ = ‖B‖ = ‖C‖ = 1
(A2) Nonlinearities σ and ρ satisfy: |σ(b)− σ(a)| ≤ |b − a| for all a, b ∈ R, σ(0) = ρ(0) = 0
(A3) Initial hidden state is identically zero, i.e., z0 = 0, and the xt satisfy ‖xt‖ ≤ ‖x‖ = 1 for all t
36
GRNN stability theorem
Theorem (Stability of GRNNs)
Let S = VΛVH and S be the GSOs of the original and perturbed graph, and let E = UMUH ∈
E(S, S) such that d(S, S) ≤ ‖E‖ ≤ ε . Let yt and yt be the outputs of the GRNNs running on S
and S respectively, and satisfying assumptons (A1)-(A3). Then,
minP∈P‖yt − PTyt‖ ≤ C(1 +
√Nδ)(t2 + 3t)ε +O(ε2) (3)
where C = max{CA,CB,CC} and δ = (‖U− V‖+ 1)2 − 1.
37
Discussion
I GRNNs are stable to relative perturbations with constant C(1 +√Nδ)(T 2 + 3T ), T process length
I C could be set at a fixed value or learned from data through A, B and C ⇒ design parameter
I The term (1 + δ√N) is a property of the graph perturbation ⇒ cannot be controlled by design
I Eigenvector misalignment δ = (‖U− V‖+ 1)2 − 1 measures commutativity of matrices S and E
I Polynomial dependence on T ⇒ due to recurrence relationship in the computation of xt
38
Epidemic Modeling with GRNNs
I We use a GRNN, a GNN, and a RNN to track an epidemic on a high school friendship network
39
Epidemic Modeling
I Model the spread of an infectious disease over a friendship network as a graph process
I Graph is a symmetric friendship network corresponding to a high school in France
I Model the spread of the disease on the graph using Susceptible-Infectious-Removed (SIR) model
I Compare the performance of a GRNN, a RNN, and a GNN in predicting infections after 8 days
40
Friendship Network
I Real-world friendship network corresponding to 134 students from a high school in Marseille
I Each node of the graph represents a student
I Friendships are modeled as symmetric unweighted edges
I Isolated nodes are removed to make the graph fully connected
I Assumption: friends are likely to be in contact with each other
41
Susceptible-Infectious-Removed (SIR) Disease Model
I Process starts with random seed infections on day 0 ⇒ probability pseed = 0.05
I Each person is in one of the three SIR states ⇒ updated each day with the following rules
I Susceptible: can get the disease from an infected friend with probability pinf = 0.3
I Infectious: can spread the disease for 4 days after being infected, after which they recover
I Removed: have overcome the disease and can no longer spread it or contract it
InfectiousSusceptible Removed
p = 0.3 per day
per infected friend 4 days
42
Problem Setup
I Problem: given the node states, goal is to predict whether each node will be infected in 8 days
I Input: graph process xt where, at each time t, [xt ]i is given by
[xt ]i =
0, if student i is susceptible
1, if student i is infected
2, if student i is removed
I Output: binary graph process yt ⇒ our goal is only to track infections
[yt ]i =
{0, if student i is susceptible or removed
1, if student i is infected
I Given xt , xt+1, . . . , xt+7, we want to predict yt+8, yt+9, . . . , yt+15 ⇒ binary node classification
43
Objective Function
I Accuracy is not a good performance metric ⇒ does not distinguish true positives and true negatives
I In epidemic tracking, true positives are more important than true negatives ⇒ maximize F1 score
F1 = 2 · Precision · Recall
Precision + Recall
I Precision = True Positive/Predicted Positive
⇒ Proportion of correct positive predictions
I Recall = True Positive/All Actual Positive
⇒ Proportion of correctly predicted positives
ActualPositive Negative
PositiveTrue
PositiveFalse
PositivePredicted
NegativeFalse
NegativeTrue
Negative
I Loss function we minimize is 1− F1 ⇒ trade-off between minimizing FPs and FNs
44
Results
I We compare a GRNN with a GNN and a RNN, all with roughly the same number of parameters
⇒ In the GNN, the time instants become input features ⇒ parameters depend on T
⇒ In the RNN, the nodal components become input features ⇒ parameters depend on N
I GRNN improves upon RNN and GNN ⇒ exploits both spatial and temporal structure of the data
45