Machine Learning on Sequences - University of Pennsylvania

Machine Learning on Sequences

I GNNs are architectures specialized in learning data defined over graph supports

I Several processes have a sequential nature. To learn from them, we need dedicated architectures

1

Machine Learning on Sequences

I Often, we want to learn properties of a sequence ⇒ Is the particle entering the forbidden area?

x1 x2 x3 x4 x5 x6 x7 x8

I This problem is not just a simple sequence of classifications ⇒ yt = φ(xt)

I It is a (sequence of) classifications of a sequence ⇒ yt = φ(x1:t) = φ(xt , xt−1, . . . , x1)

2

Unbounded Memory Growth

I Predictions on a sequence depends on observation histories ⇒ yt = Φ(xt , xt−1, . . . , x1)

x1

y1

x2

y2

x3

y3

xt

yt

I Recurrent neural networks (RNNs) estimate a hidden state to avoid this unbounded memory growth

3

Markov Random Processes

I A stochastic process (random sequence) is said to be Markov or memoryless if

p(

xt+1

∣∣ x1:t

)= p

(xt+1

∣∣ xt)

I It is the same to condition on the current value xt or conditioning or on the whole trajectory x0:t

⇒ The future, given the present, is independent of the past

⇒ For predicting the future, knowledge of the past is irrelevant

I Outputs (e.g, trajectory categories) are conditionally independent ⇒ p(yt∣∣ xt) = p(yt

∣∣ x1:t)

4

Learning in a Markov Process

I In a memoryless Markov Process, learning is equivalent reduce to a sequence of learning problems

I State evolution is a chain of memoryless transitions. And outputs depend on the current state only

p(xt+1

∣∣ xt)xt xt+1

p(yt∣∣ xt) p(yt+1

∣∣ xt+1)yt yt+1

I An AI to predict yt mimics the conditional distribution of the observations. The past is irrelevant

5

Learning in a Markov Process

I In a memoryless Markov Process, learning is equivalent reduce to a sequence of learning problems

I State evolution is a chain of memoryless transitions. And outputs depend on the current state only

p(xt+1

∣∣ xt)xt xt+1

Φ(xt) Φ(xt)yt yt+1

I An AI to predict yt mimics the conditional distribution of the observations. The past is irrelevant

5

Recurrent Neural Networks

I Machine Learning in stochastic processes that are not Markov ⇒ The past is relevant in learning

6

All Processes are Markov. However Disguised

I The evolution of the trajectory is not a Markov Process if we observe positions only

I But it is Markov if we have access to velocities and accelerations ⇒ Hidden (unobserved) states

x1 x2 x3 x4 x5 x6 x7 x8

v1, u1 v2, u2 v3, u3 v4, u4 v5, u5 v6, u6 v7, u7 v8, u8

I All systems are Markov ⇒ We often lack enough information to observe their Markov structure

7

Hiden Markov Model

I Stochastic process xt follows a hidden Markov model if there exists a process zt such that

p(

zt+1

∣∣ z1:t

)= p

(zt+1

∣∣ zt)

and p(

xt∣∣ zt)

= p(

xt∣∣ z0:t

)

I The hidden state zt is a memoryless Markov stochastic process

I The observed state xt is conditionally independent. Depends only on the current hidden state zt

I Outputs are also conditionally independent ⇒ p(yt∣∣ zt) = p(yt

∣∣ z1:t)

8

Machine Learning on Hidden Markov Models

I In a hidden Markov model learning is not equivalent to a sequence of learning problems

I The AI can try to mimic the conditional distribution p(yt∣∣ zt). But we don’t have access to zt

p(zt+1

∣∣ zt)zt zt+1

p(yt∣∣ zt) p(yt+1

∣∣ zt+1)

yt yt+1

p(xt∣∣ zt) p(xt+1

∣∣ zt+1)

xt xt+1

I Recurrent Neural Network (RNN) ⇒ Use observed state xt to estimate hidden state zt

9

Machine Learning on Hidden Markov Models

I In a hidden Markov model learning is not equivalent to a sequence of learning problems

I The AI can try to mimic the conditional distribution p(yt∣∣ zt). But we don’t have access to zt

p(zt+1

∣∣ zt)zt zt+1

Φ(zt) Φ(zt)

yt yt+1

p(xt∣∣ zt) p(xt+1

∣∣ zt+1)

xt xt+1

zt zt+1

I Recurrent Neural Network (RNN) ⇒ Use observed state xt to estimate hidden state zt

9

Recurrent Neural Networks

I A recurrent neural network is made up of two separate learning parametrizations

⇒ Φ1(xt , zt−1) ⇒ From observed state xt ⇒ and hidden state zt−1 ⇒ to hidden state update zt

⇒ Φ2(zt) ⇒ From updated hidden state zt ⇒ to output estimate yt

Φ1(xt , zt−1)

xt

zt−1 zt Φ2(zt) yt

I It is a recurrent neural network because hidden states are fed-back as inputs for the next time step

10

Hidden State Update AI

I Use a perceptron for the AI that updates the hidden state ⇒ Φ1

(xt , zt−1

)= σ

(Axt + Bzt−1

)

σ(

Axt + Bzt−1

)

xt

zt−1 zt Φ2

(zt)

yt

I Number of learnable parameters ≡ Entries of A and B ⇒ Does not depend on the time index t

11

Output Prediction AI

I Use another perceptron for the AI that predicts the output ⇒ Φ1

(zt)

= σ(

Czt)

σ(

Axt + Bzt−1

)

xt

zt−1 zt Φ2

(zt)

σ(

Czt)

yt

I We can also use a multi-layer neural network for the output prediction AI ⇒ Φ1

(zt)

= Φ(

zt ; H)

12

Output Prediction AI

I Use another perceptron for the AI that predicts the output ⇒ Φ1

(zt)

= σ(

Czt)

σ(

Axt + Bzt−1

)

xt

zt−1 zt Φ(

zt ; H)

yt

I We can also use a multi-layer neural network for the output prediction AI ⇒ Φ1

(zt)

= Φ(

zt ; H)

12

Time Gating

I We discuss the problem of vanishing/exploding gradients in recurrent neural networks

I We introduce gating mechanisms in the form of long short-term memories and gated recurrent units

13

Vanishing/Exploding Gradients for Long Term Dependencies

I In some tasks, the RNN may have to learn how to model long term dependencies of length T

I This poses a challenge ⇒ the Jacobian ∂zT/∂B will depend on a chain of multiplications by B

zt+T−1

xt+T−1

BT

I If eigenvalues of B � 1, the gradients tend to vanish, leading to exponentially smaller weights B

I If eigenvalues of B � 1, the gradients tend to explode, leading to exponentially larger weights B

14

Vanishing/Exploding Gradients for Long Term Dependencies (Example)

I Consider a simplification of the RNN where we omit the nonlinear function σ(·) and the inputs xt

zt = Bzt−1

I At time t = T , the state variable zT depends on the T th power of the matrix B

zT = BT zt−T

I If B admits an eigendecomposition B = QΛQ>, the recurrence can be rewritten as

zT = QΛTQ>zt−T

⇒ Eigenvalues less than one will vanish and eigenvalues greater than one will explode

⇒ Any component of zt−T not aligned with the largest eigenvalues will be discarded

15

Gating Mechanism

I To address the issue of vanishing gradients, we add a gating mechanism to RNNs

I Gates are scalars in [0, 1] acting on the current input and on the previous state

⇒ Control how much of the input and past time information should be taken into account

I The value of each gate is updated at every step of the sequence

⇒ Allows creating paths through time with derivatives that neither vanish nor explode

⇒ Creates dependency paths that allow encoding both short and long term dependencies

16

Long Short-Term Memory (LSTM)

I The most popular gated RNN architecture is the Long Short-Term Memory (LSTM) cell

I Three gates: a forget gate ft ∈ [0, 1], an input gate gt ∈ [0, 1], and a cell output gate qt ∈ [0, 1]

I Let xt be the input, zt the state, and define the internal memory st of the LSTM cell

I Memory st updated by applying the forget gate to st−1 and the input gate to the state update

st = ftst−1 + gtσ (Axt + Bzt−1)

I State zt updated by applying the cell output gate to the internal cell memory st

zt = qtσ(st)

17

Gated Recurrent Unit (GRU)

I The Gated Recurrent Unit (GRU) is a second popular gated version of the RNN

I Slight variation of LSTM ⇒ single gate ut ∈ [0, 1] plays the role of input and forget gates

zt = utzt−1 + (1− ut)σ (Axt + rtBzt−1)

⇒ Reset gate rt ∈ [0, 1] controls contribution of previous state zt−1 to updated state

I Besides the LSTM and the GRU, many more variants of gating mechanisms for RNNs exist

18

Gate Computation

I In the LSTM and the GRU, the gates themselves are calculated as the outputs of RNNs

I For example, the forget gate ft of the LSTM has its own state variable z′t (as do all the other gates)

z′t = σ(A′xt + B′z′t−1

)I The forget gate ft is then calculated from the input xt and the state z′t as

ft = sigmoid(Uxt + Wz′t

)⇒ With U and W linear layers mapping the input and state features to a single scalar

⇒ And the sigmoid activation function ensuring gate values in the [0, 1] interval

19

Graph Recurrent Neural Networks

I We define Graph Recurrent Neural Networks (GRNNs) as particular cases of RNNs

20

From RNNs to GRNNs

I Consider a time varying process xt in which each of the signals is supported on shift operator S

xt−2 xt−1 xt

I A graph recurrent neural network (GRNN) combines

⇒ A GNN because xt is supported on a graph. ⇒ An RNN because xt is a sequence

21

A Recurrent Neural Network for Graph Signals

I An RNN has a hidden state zt updated with the perceptron ⇒ zt = σ(Axt + Bzt−1)

I An it has an output prediction yt given by the perceptron ⇒ yt = σ(Cxt)

Bzt−1

Axt

+ σ(·)

xt

zt−1 zt Czt σ(·) yt

I The observed state xt and the output yt are graph signals supported on the graph shift operator S

⇒ The hidden state zt is constructed to be a graph signal supported on the graph shift operator S

22


I Hidden and observed state are propagated through graph filters to update the hidden state

A = A(S) =K−1∑k=0

akSk B = B(S)x =K−1∑k=0

bkSk

I The state update is ⇒ zt = σ

[A(S)xt + B(S)zt−1

]= σ

[K−1∑k=0

akSkxt +K−1∑k=0

bkSkzt−1

]

K−1∑k=0

bkSkzt−1

K−1∑k=0

akSkxt

+ σ(·)

xt

zt−1 zt

K−1∑k=0

akSkzt σ(·) yt

23


I The hidden state zt is propagated through a graph filter to make a prediction yt of the output yt

C = C(S) =K−1∑k=0

ckSk

I The prediction of the output yt is given by ⇒ yt = σ

[C(S)zt

]= σ

[K−1∑k=0

ckSkzt

]

K−1∑k=0

bkSkzt−1

K−1∑k=0

akSkxt

+ σ(·)

xt

zt−1 zt

K−1∑k=0

akSkzt σ(·) yt

23

Multiple Feature GRNNs

I A GRNN is made up a hidden state update perceptron and an output prediction perceptron

zt = σ

[K−1∑k=0

akSkxt +K−1∑k=0

bkSkzt−1

]yt = σ

[K−1∑k=0

ckSkzt

]

I Each of these filters can be replaced by a MIMO filter to yield a GRNN with multiple features

Zt = σ

[K−1∑k=0

SkXtAk +K−1∑k=0

SkZt−1Bk

]Yt = σ

[K−1∑k=0

SkZtCk

]

I Multiple-feature hidden state Zt permits larger dimensionality relative to observed states

24

Spatial Gating

I We extend time gating to GRNNs to handle the problem of vanishing/exploding gradients

I We discuss long range graph dependencies and introduce node and edge gating

25

Gating in GRNNs

I Like RNNs, GRNNs may also experience the problem of vanishing/exploding gradients

⇒ Happens when eigenvalues of B(S) are much smaller/larger than 1

I Similarly to what we did for RNNs, we address it by adding gating operators to GRNNs

Zt = σ

(Q {AS(Xt)}+ Q {BS(Zt−1)}

)

I Input gate operator Q : RN×H → RN×H ⇒ controls the importance of the input Xt at time t

I Forget gate operator Q : RN×H → RN×H ⇒ controls the importance of the state Zt at time t

26

Time-Gated GRNNs

I First type of gating for GRNNs is time gating ⇒ simple extension of input and forget gates of RNNs

I In the Time-Gated GRNN, the input and forget gate operators are expressed as

Q {AS(Xt)} = qtAS(Xt), Q {BS(Zt)} = qtBS(Zt)

I Time gating multiplies the input and the state by scalar gates qt ∈ [0, 1] and qt ∈ [0, 1]

I A single scalar gate is applied to the whole graph signal ⇒ same gate value for all nodes

27

Long Range Spatial Dependencies

I Even if eigenvalues of B(S) ∼ 1 spatial imbalances can cause gradients to vanish in space

⇒ Some nodes/paths might get assigned more importance than others in long range exchanges

I Example: graphs with community structure, where some nodes are highly connected within clusters

⇒ Gradients of ZT depend on successive products of B(S) ⇒ successive products of S

⇒ For large T , the matrix entries in ST with highly connected nodes will get densely populated

⇒ Overshadows community structure ⇒ can’t encode long processes that are local on the graph

28

Spatial Gating

I Node and edge structure of the graph allows for other forms of gating ⇒ spatial gating

I Node gating ⇒ one input and one forget gate for each node of the graph

1

23

4

5 6

7

89

10

11 12

⇒1

23

4

5 6

7

89

10

11 12

I Spatial gating strategies help encode long range spatial dependencies in graph processes

29

Spatial Gating

I Node and edge structure of the graph allows for other forms of gating ⇒ spatial gating

I Edge gating ⇒ one input and one forget gate for each edge of the graph

1

23

4

5 6

7

89

10

11 12

⇒1

23

4

5 6

7

89

10

11 12

I Spatial gating strategies help encode long range spatial dependencies in graph processes

29

Node-Gated GRNNs

I In the Node-Gated GRNN, the input gate and forget gate operators are expressed as

Q {AS(Xt)} = diag(qt)AS(Xt), Q {BS(Zt)} = diag(qt)BS(Zt)

I Gating operators correspond to multiplication of the input and state by diagonal matrices

⇒ The diagonals are the input and forget vector gates qt ∈ [0, 1]N and qt ∈ [0, 1]N

I A scalar gate applied to each nodal component of the signal ⇒ different gate values for each node

30

Edge-Gated GRNNs

I In the Edge-Gated GRNN, the input gate and forget gate operators are expressed as

Q {AS(Xt)} = AS�Qt(Xt), Q {BS(Zt)} = BS�Qt

(Zt)

I Gating operators correspond to elementwise multiplication of the shift operator by gate matrices

⇒ The matrices multiplying the GSOs are the input and forget matrix gates Qt and Qt ∈ [0, 1]N×N

I Separate gate for each edge ⇒ control the amount of information transmitted across edges

31

Gate Computation

I Parameters of input and forget gate operators are the outputs of GRNNs themselves

I Input and forget gate states are expressed as

Zt = σ

(AS(Xt) + BS(Zt−1)

)Zt = σ


)

⇒ Gate computation takes different forms depending on the type of gating

I In the case of time gating, the gates are calculated as

qt = sigmoid(cTvec(Zt)) qt = sigmoid(cTvec(Zt))

⇒ Where c ∈ RHN and c ∈ RHN are fully connected layers and the sigmoid ensures gates in [0, 1]

32

Gate Computation



Zt = σ


)Zt = σ


)

⇒ Gate computation takes different forms depending on the type of gating

I In the case of node gating, the gates are calculated as

qt = sigmoid(CS(Zt)

)qt = sigmoid

(CS(Zt)

)⇒ Where CS and CS are graph convolutions and the sigmoid ensures gates in [0, 1]N

32

Gate Computation



Zt = σ


)Zt = σ


)⇒ Gate computation takes different forms depending on the type of gating

I In the case of edge gating, the gates are calculated as

[Qt ]ij = sigmoid(

cT[δTi Zt C||δT

j Zt C]T)

[Qt ]ij = sigmoid(

cT[δTi Zt C||δT

j Zt C]T)

⇒ Where δi and δj are N-dimensional Dirac deltas; C ∈ RH×H′ , C ∈ RH×H′ are linear layers

⇒ And c ∈ R2H′×1 and c ∈ R2H′×1 are f.c. layers applied to concatenation || of features of i and j

32

Stability of GRNNs

I GRNNs can be seen as a time extension of GNNs, therefore they inherit their stability properties

33

Relative Perturbation Model

Definition (Relative perturbation matrices)

Given GSOs S and S, we define the set of relative perturbation matrices modulo permutation as

E(S, S) ={

E∈RN×N : PTSP = S + ES + SET,P ∈ P}

(1)

where P ={

P ∈ {0, 1}N×N : P1 = 1,PT1 = 1}.

I We consider that the distance between two graphs S and S is given by d(S, S) = minE∈E(S,S)

‖E‖

I Notice that if S is a permutation of the shift matrix S, then we have d(S, S) = 0

34

Integral Lipschitz filters

Definition (Integral Lipschitz filters)

A filter A(S) =∑K−1

k=0 akSk is integral Lipschitz if there exists C > 0 such that a(λ) =∑K−1

k=0 akλk

satisfies

|a(λ2)− a(λ1)| ≤ C|λ2 − λ1||λ1 + λ2|/2

(2)

for all λ1, λ2 ∈ R.

I Integral Lipschitz filters also satisfy |λa′(λ)| ≤ C , where a′(λ) is the derivative of a(λ)

I Recall that the frequency response of integral Lipschitz filters becomes flat for large λ

35

Assumptions

I We consider a GRNN with FX = 1 input feature, FZ = 1 state feature, and FY = 1 output feature

zt = σ(A(S)xt + B(S)zt−1

)yt = ρ (C(S)zt)

(A1) A, B and C are integral Lipschitz with constants CA, CB and CC and ‖A‖ = ‖B‖ = ‖C‖ = 1

(A2) Nonlinearities σ and ρ satisfy: |σ(b)− σ(a)| ≤ |b − a| for all a, b ∈ R, σ(0) = ρ(0) = 0

(A3) Initial hidden state is identically zero, i.e., z0 = 0, and the xt satisfy ‖xt‖ ≤ ‖x‖ = 1 for all t

36

GRNN stability theorem

Theorem (Stability of GRNNs)

Let S = VΛVH and S be the GSOs of the original and perturbed graph, and let E = UMUH ∈

E(S, S) such that d(S, S) ≤ ‖E‖ ≤ ε . Let yt and yt be the outputs of the GRNNs running on S

and S respectively, and satisfying assumptons (A1)-(A3). Then,

minP∈P‖yt − PTyt‖ ≤ C(1 +

√Nδ)(t2 + 3t)ε +O(ε2) (3)

where C = max{CA,CB,CC} and δ = (‖U− V‖+ 1)2 − 1.

37

Discussion

I GRNNs are stable to relative perturbations with constant C(1 +√Nδ)(T 2 + 3T ), T process length

I C could be set at a fixed value or learned from data through A, B and C ⇒ design parameter

I The term (1 + δ√N) is a property of the graph perturbation ⇒ cannot be controlled by design

I Eigenvector misalignment δ = (‖U− V‖+ 1)2 − 1 measures commutativity of matrices S and E

I Polynomial dependence on T ⇒ due to recurrence relationship in the computation of xt

38

Epidemic Modeling with GRNNs

I We use a GRNN, a GNN, and a RNN to track an epidemic on a high school friendship network

39

Epidemic Modeling

I Model the spread of an infectious disease over a friendship network as a graph process

I Graph is a symmetric friendship network corresponding to a high school in France

I Model the spread of the disease on the graph using Susceptible-Infectious-Removed (SIR) model

I Compare the performance of a GRNN, a RNN, and a GNN in predicting infections after 8 days

40

Friendship Network

I Real-world friendship network corresponding to 134 students from a high school in Marseille

I Each node of the graph represents a student

I Friendships are modeled as symmetric unweighted edges

I Isolated nodes are removed to make the graph fully connected

I Assumption: friends are likely to be in contact with each other

41

Susceptible-Infectious-Removed (SIR) Disease Model

I Process starts with random seed infections on day 0 ⇒ probability pseed = 0.05

I Each person is in one of the three SIR states ⇒ updated each day with the following rules

I Susceptible: can get the disease from an infected friend with probability pinf = 0.3

I Infectious: can spread the disease for 4 days after being infected, after which they recover

I Removed: have overcome the disease and can no longer spread it or contract it

InfectiousSusceptible Removed

p = 0.3 per day

per infected friend 4 days

42

Problem Setup

I Problem: given the node states, goal is to predict whether each node will be infected in 8 days

I Input: graph process xt where, at each time t, [xt ]i is given by

[xt ]i =

0, if student i is susceptible

1, if student i is infected

2, if student i is removed

I Output: binary graph process yt ⇒ our goal is only to track infections

[yt ]i =

{0, if student i is susceptible or removed

1, if student i is infected

I Given xt , xt+1, . . . , xt+7, we want to predict yt+8, yt+9, . . . , yt+15 ⇒ binary node classification

43

Objective Function

I Accuracy is not a good performance metric ⇒ does not distinguish true positives and true negatives

I In epidemic tracking, true positives are more important than true negatives ⇒ maximize F1 score

F1 = 2 · Precision · Recall

Precision + Recall

I Precision = True Positive/Predicted Positive

⇒ Proportion of correct positive predictions

I Recall = True Positive/All Actual Positive

⇒ Proportion of correctly predicted positives

ActualPositive Negative

PositiveTrue

PositiveFalse

PositivePredicted

NegativeFalse

NegativeTrue

Negative

I Loss function we minimize is 1− F1 ⇒ trade-off between minimizing FPs and FNs

44

Results

I We compare a GRNN with a GNN and a RNN, all with roughly the same number of parameters

⇒ In the GNN, the time instants become input features ⇒ parameters depend on T

⇒ In the RNN, the nodal components become input features ⇒ parameters depend on N

I GRNN improves upon RNN and GNN ⇒ exploits both spatial and temporal structure of the data

45

Documents

Machine Learning on Sequences - University of Pennsylvania