Reinforcement Learning and Network Design

Reinforcement Learningand Network Design

Yuki OyamaShibaura Institute of Technology

Activity Landscape Design Lab.September 17, 2021

B e h a v i o r M o d e l i n g i n T r a n s p o r t a t i o n N e t w o r k sL e c t u r e S e r i e s # 3 - 1 ( 1 6 : 0 0 - 1 6 : 3 0 )

A road network example

The planner who aims to maximize efficiency wants to answer:• if a new road should be constructed• where and how large parking spaces should be placed• on which road and how much tolls should be charged• etc.

topology

capacity

control

These decisions will impact on travelers’ behavior

Let’s generalize the frameworkAn example of pricing (on which road a toll is installed)

Planner (Supply)

Users (Demand)

Objective functionDesign variables

Behavior distribution Network traffic

New toll installedon road A

Drivers changeroute choicesto avoid road A

Alternative routeof using road Ais congested

Wants to minimize total travel time

Let’s generalize the frameworkAn example of pricing (on which road a toll is installed)

Planner (Supply)

Users (Demand)

Objective functionDesign variables

Behavior distribution Network traffic

New toll installedon road A

Drivers changeroute choicesto avoid road A

Alternative routeof using road Ais congested

Total travel timein the network increased as a result

Network designis a demand-based planning of network topology & systems

Planner (Supply)

Users (Demand)

the planner wants to maximizeObjective function

the planner decides under some constraints

Design variables

travelers decide/changein reaction to the design of network

Behavior distributionthat appears as a collective state

Network traffic

[network topology & conditions]

[aggregation]

[indicators]

[review]What you (the planner) want to achieve by the project

How to predict the impact on behavior in a network?

Follows Magnanti and Wong (1984); Farahani et al. (2013)

Modeling behavior in a network

OD

Path choice model (logit) Traffic flow on paths(in the static case)

: choice set (set of paths) Not as easy as it looks…

Networks are generally complex…

O

D

O

D

O

D

O

D

The path set is almost impossible to define !!

* A description of more complex choices (e.g., time) needs additionaldimensions of network, which further increases the network size.

This is because a path is a combination of links in the network

Not combination but SEQUENCE

Sequence of choices

a

b

c

1

2

#

@

%

Choice

A

B

C

…

An approach is modeling based on Reinforcement Learning (RL)

This presentation shows a special case of RL for network path choice modeling

that models sequential decisions of agents.

How to model a sequence ?

⇒ what is link choice probability exactly?

A path r can be described as:

a sequence of links

Path choice probability:

: Link choice probability conditional on the previous link

What should be considered is ...

the outcome given by the product of link choice probabilities to be consistent with the original model, i.e.,

This is achieved by considering forward-looking mechanism

*when assuming logit model

Value function

1. Myopic2. Forward-looking

D

D

: Link choice utility: Value function

Goal is modeling mechanisms of behavior

c.f. Shortest Path (SP) problem:

Value function is the SP cost from aj to destination

Random utility

Generalization

Gumbel distribution has a nice property:

Value function is the solution to:

a system of linear equations.

Value function Weight incidence matrix Unit vector

(Recurrence relation)

Let’s check the consistency!

Link choice probability is given by:

*like logit by assuming

New deterministic utility

Then we have:

Path utility issum of link utilities

Exp. Max. of all possible paths

⇒ Consistent with logit model with the universal path set

= 1

What’s the point ?

Now you can model path choice behavior without explicitly defining choice set

1. Decompose path choice into sequential link choices:

2. Describe forward-looking behavioral mechanism by value function:

Recursively computed

This (efficient) computational method of modeling is called:

“Recursive Logit (RL) model”Named by Fosgerau et al. (2013)

Markov Decision Process (MDP)

To more generalize, define

• Action: choice behavior (what agent does)• State: situation (where agent is) that changes as result of action

*In path choice (recursive) modeling: Action is directly choice of State

Action

Statetransition

Initialstate

State transition probability

Discount factor

Reinforcement Learning approaches

MDP Reinforcement Learning

Recursive LogitInverse Reinforcement

Learning

Learn optimal action probability by a number of trials & results

General modeling framework of sequential decisions

Modeling action probability under RUM assumption Learn utility (reward) function

from actions of expertsParameter estimation

Dynamic Discrete Choice Model

Stochastic

Action = StateGumbel

Boltzmann policy(temp. = 1/scale)

See also: Mai and Jaillet (2020)

Now, we have link transition probabilities

Given OD demand q, we compute network traffic flows

: link flow (on a)

: transition flow (from link k to link a)

k

a

[1]

[2]

[1] and [2] reduces to:

Can be efficiently computed!

Another dimension may be needede.g., a planner may expect changes of visitors’ time-use in a city center

allows for integrated modeling of route, activity place and duration.

y

Space

x

tTime

yt

x

: Node

(a) (b) (c)

: Link

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

21 22 23 24 25

t=1

t=2

t=3

t=4

t=5

move from 8 to 9

from 9 to 14

at 14

from 13 to 8

move

from 14 to 13

move

move

stay

Time-structured network

A path

Network traffic:: no. people who visited space a

: total time spent at space a

represents multiple activities

Calculate indicators based on traffic

: total travel time experienced [min.]

Examples:

: total revenue the manager gains [JPY]

: consumer surplus (welfare)

Remark (again):

The choice of objective reflectswhat you (the planner) want to achieve through the project

Minimizing negative indicators is enough? What is a better/ideal city you think?

: total CO2 emission [g CO2]

A public project entails trade-offs of goals

A road closure may increase travel time of the network.

But the space can be utilized as a park that is good for activities, health and environment.

Of course, it requires a large capital cost, and the budget is limited.

Barcelona superblock@Bcomu Global

Weighted sum is enough ?

• Often, there is a clear trade-off between two objectives• Weight selection may lead to a biased policy decision

Multi-objective design

Social welfare(to be maximized)

Capital cost (to be minimized)

Ideal but never achieved

Reality

Another option

dominated solutions

Pareto frontier(set of non-dominated

solutions)

Budget constraint

Case study | A pedestrian activity network design

Department store Gintengai mall

Okaido mallCity hall

Department store

Park

1 2 3 4

5 6 7

8 9

10 11 12 13

14 15 16 17

18 19 20 21

1o

3o

4o

2o

n : Node for move/stayn : Node for only moveo : Start/End node

100mN

418

19-20 1

City center of Matsuyama city

• Design: expansion of walking space on each street [m.]

• Expectation: resistance decreases, and more places are visited


78.0 79.0 80.0 81.0 82.0 83.0 84.0

010

000

2000

030

000

4000

050

000

Average sojourn time

AcceptedRejected

Tota

l are

a of w

iden

ed si

dewa

lk

[*60s]

[m2 ]

(b)(a)

104 105 106 107 108 10910

000

2000

030

000

4000

050

000

Tota

l are

a of w

iden

ed si

dewa

lk[m

2 ]

AcceptedRejected

Total expected utility0

A(82.60, 6000) (105.30, 6000)

B

Goal I:Sojourn time maximization

• Clear trade-offs between goals and budget are observed.• Pareto frontier offers a variety of policies based on the investment level

Goal II:Expected utility maximization


+4

+4

+4

+x

+6

+3 +4+1

+1+3

+1+4 +3

1 2 3 4

5 6 7

8 9

10 11 12 13

14 15 16 17

18 19 20 21

1 2 3 4

5 6 7

8 9

10 11 12 13

14 15 16 17

18 19 20 21

5

0

15

20

10

1 4 7 13 15 17 18 19 20 21

: 100Link !ow

Network (solution)

[Upper level problem]

[Lower level problem]

: 250: 500: 1000

: di!. > +10: di!. < -10

: di!. > +0.5

: di!. < -0.5

: increased width [m]

[*60sec./person]

(2) Solution A Solution B(3)Staying node number

Activity duration

1 2 3 4

5 6 7

8 9

10 11 12 13

14 15 16 17

18 19 20 21

Original network(1)1 4 7 13 15 17 18 19 20 21 1 4 7 13 15 17 18 19 20 21

1 2 3 4

5 6 7

8 9

10 11 12 13

14 15 16 17

18 19 20 21

1 2 3 4

5 6 7

8 9

10 11 12 13

14 15 16 17

18 19 20 21

1 2 3 4

5 6 7

8 9

10 11 12 13

14 15 16 17

18 19 20 21

Goal I:Sojourn time

Goal II:Expected utility

• Reinforcement Learning is a general framework of modeling sequential decisions in networks.• You can model any “state-action network”• “State = action = space” is just an example

• Network design is a mathematical problem of behavior (in a network) based planning• Be thoughtful when you set an objective• Multi-objective design may fit in public projects

Summary & Remarks

Questions ?

[email protected]

mailto:[email protected]

References• Magnanti, T.L., Wong, R.T., 1984. Network design and transportation planning:

models and algorithms. Transportation Science 18 (1), 1–55.• Farahani, R. Z., Miandoabchi, E., Szeto, W. Y., & Rashidi, H. 2013. A review of

urban transportation network design problems. European Journal of Operational Research, 229(2), 281-302.

• Fosgerau, M., Frejinger, E., Karlstrom, A., 2013. A link based network route choice model with unrestricted choice set. Transportation Research Part B: Methodological 56, 70–80.

• Mai, T., & Jaillet, P., 2020. A Relation Analysis of Markov Decision Process Frameworks. arXiv:2008.07820.

• Oyama, Y., 2017. A Markovian route choice analysis for trajectory-based urban planning. PhD thesis, The University of Tokyo.

• Oyama, Y., Hato, E., 2019. Prism-based path set restriction for solving Markovian traffic assignment problem. Transportation Research Part B: Methodological 122, 528–546.

• ⼤⼭雄⼰, ⽻藤英⼆, 2017. 多⽬的最適化に基づく歩⾏者の活動ネットワークデザイン. 都市計画論⽂集 52(3): 810-817.

Appendix | Design levels and examples

Figure 1 in Farahani et al. (2013)

Appendix | Solution algorithms (metaheuristics)

Figure2 in Farahani et al. (2013)

SA: Simulated Annealing; GA: Genetic Algorithm; TS: Tabu Search; AC: Ant Colony; PSO: Particle Swarm Optimization; SS: Scatter Search; H: Hybrid metaheuristics

Documents

Reinforcement Learning and Network Design