Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Reinforcement Learningand Network Design
Yuki OyamaShibaura Institute of Technology
Activity Landscape Design Lab.September 17, 2021
B e h a v i o r M o d e l i n g i n T r a n s p o r t a t i o n N e t w o r k sL e c t u r e S e r i e s # 3 - 1 ( 1 6 : 0 0 - 1 6 : 3 0 )
A road network example
The planner who aims to maximize efficiency wants to answer:• if a new road should be constructed• where and how large parking spaces should be placed• on which road and how much tolls should be charged• etc.
topology
capacity
control
These decisions will impact on travelers’ behavior
Let’s generalize the frameworkAn example of pricing (on which road a toll is installed)
Planner (Supply)
Users (Demand)
Objective functionDesign variables
Behavior distribution Network traffic
New toll installedon road A
Drivers changeroute choicesto avoid road A
Alternative routeof using road Ais congested
Wants to minimize total travel time
Let’s generalize the frameworkAn example of pricing (on which road a toll is installed)
Planner (Supply)
Users (Demand)
Objective functionDesign variables
Behavior distribution Network traffic
New toll installedon road A
Drivers changeroute choicesto avoid road A
Alternative routeof using road Ais congested
Total travel timein the network increased as a result
Network designis a demand-based planning of network topology & systems
Planner (Supply)
Users (Demand)
the planner wants to maximizeObjective function
the planner decides under some constraints
Design variables
travelers decide/changein reaction to the design of network
Behavior distributionthat appears as a collective state
Network traffic
[network topology & conditions]
[aggregation]
[indicators]
[review]What you (the planner) want to achieve by the project
How to predict the impact on behavior in a network?
Follows Magnanti and Wong (1984); Farahani et al. (2013)
Modeling behavior in a network
OD
Path choice model (logit) Traffic flow on paths(in the static case)
: choice set (set of paths) Not as easy as it looks…
Networks are generally complex…
O
D
O
D
O
D
O
D
The path set is almost impossible to define !!
* A description of more complex choices (e.g., time) needs additionaldimensions of network, which further increases the network size.
This is because a path is a combination of links in the network
Not combination but SEQUENCE
Sequence of choices
a
b
c
1
2
#
@
%
Choice
A
B
C
…
An approach is modeling based on Reinforcement Learning (RL)
This presentation shows a special case of RL for network path choice modeling
that models sequential decisions of agents.
How to model a sequence ?
⇒ what is link choice probability exactly?
A path r can be described as:
a sequence of links
Path choice probability:
: Link choice probability conditional on the previous link
What should be considered is ...
the outcome given by the product of link choice probabilities to be consistent with the original model, i.e.,
This is achieved by considering forward-looking mechanism
*when assuming logit model
Value function
1. Myopic2. Forward-looking
D
D
: Link choice utility: Value function
Goal is modeling mechanisms of behavior
c.f. Shortest Path (SP) problem:
Value function is the SP cost from aj to destination
Random utility
Generalization
Gumbel distribution has a nice property:
Value function is the solution to:
a system of linear equations.
Value function Weight incidence matrix Unit vector
(Recurrence relation)
Let’s check the consistency!
Link choice probability is given by:
*like logit by assuming
New deterministic utility
Then we have:
Path utility issum of link utilities
Exp. Max. of all possible paths
⇒ Consistent with logit model with the universal path set
= 1
What’s the point ?
Now you can model path choice behavior without explicitly defining choice set
1. Decompose path choice into sequential link choices:
2. Describe forward-looking behavioral mechanism by value function:
Recursively computed
This (efficient) computational method of modeling is called:
“Recursive Logit (RL) model”Named by Fosgerau et al. (2013)
Markov Decision Process (MDP)
To more generalize, define
• Action: choice behavior (what agent does)• State: situation (where agent is) that changes as result of action
*In path choice (recursive) modeling: Action is directly choice of State
Action
Statetransition
Initialstate
State transition probability
Discount factor
Reinforcement Learning approaches
MDP Reinforcement Learning
Recursive LogitInverse Reinforcement
Learning
Learn optimal action probability by a number of trials & results
General modeling framework of sequential decisions
Modeling action probability under RUM assumption Learn utility (reward) function
from actions of expertsParameter estimation
Dynamic Discrete Choice Model
Stochastic
Action = StateGumbel
Boltzmann policy(temp. = 1/scale)
See also: Mai and Jaillet (2020)
Now, we have link transition probabilities
Given OD demand q, we compute network traffic flows
: link flow (on a)
: transition flow (from link k to link a)
k
a
[1]
[2]
[1] and [2] reduces to:
Can be efficiently computed!
Another dimension may be needede.g., a planner may expect changes of visitors’ time-use in a city center
allows for integrated modeling of route, activity place and duration.
y
Space
x
tTime
yt
x
: Node
(a) (b) (c)
: Link
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
t=1
t=2
t=3
t=4
t=5
move from 8 to 9
from 9 to 14
at 14
from 13 to 8
move
from 14 to 13
move
move
stay
Time-structured network
A path
Network traffic:: no. people who visited space a
: total time spent at space a
represents multiple activities
Calculate indicators based on traffic
: total travel time experienced [min.]
Examples:
: total revenue the manager gains [JPY]
: consumer surplus (welfare)
Remark (again):
The choice of objective reflectswhat you (the planner) want to achieve through the project
Minimizing negative indicators is enough? What is a better/ideal city you think?
: total CO2 emission [g CO2]
A public project entails trade-offs of goals
A road closure may increase travel time of the network.
But the space can be utilized as a park that is good for activities, health and environment.
Of course, it requires a large capital cost, and the budget is limited.
Barcelona superblock@Bcomu Global
Weighted sum is enough ?
• Often, there is a clear trade-off between two objectives• Weight selection may lead to a biased policy decision
Multi-objective design
Social welfare(to be maximized)
Capital cost (to be minimized)
Ideal but never achieved
Reality
Another option
dominated solutions
Pareto frontier(set of non-dominated
solutions)
Budget constraint
Case study | A pedestrian activity network design
Department store Gintengai mall
Okaido mallCity hall
Department store
Park
1 2 3 4
5 6 7
8 9
10 11 12 13
14 15 16 17
18 19 20 21
1o
3o
4o
2o
n : Node for move/stayn : Node for only moveo : Start/End node
100mN
418
19-20 1
City center of Matsuyama city
• Design: expansion of walking space on each street [m.]
• Expectation: resistance decreases, and more places are visited
Case study | A pedestrian activity network design
78.0 79.0 80.0 81.0 82.0 83.0 84.0
010
000
2000
030
000
4000
050
000
Average sojourn time
AcceptedRejected
Tota
l are
a of w
iden
ed si
dewa
lk
[*60s]
[m2 ]
(b)(a)
104 105 106 107 108 10910
000
2000
030
000
4000
050
000
Tota
l are
a of w
iden
ed si
dewa
lk[m
2 ]
AcceptedRejected
Total expected utility0
A(82.60, 6000) (105.30, 6000)
B
Goal I:Sojourn time maximization
• Clear trade-offs between goals and budget are observed.• Pareto frontier offers a variety of policies based on the investment level
Goal II:Expected utility maximization
Case study | A pedestrian activity network design
+4
+4
+4
+x
+6
+3 +4+1
+1+3
+1+4 +3
1 2 3 4
5 6 7
8 9
10 11 12 13
14 15 16 17
18 19 20 21
1 2 3 4
5 6 7
8 9
10 11 12 13
14 15 16 17
18 19 20 21
5
0
15
20
10
1 4 7 13 15 17 18 19 20 21
: 100Link !ow
Network (solution)
[Upper level problem]
[Lower level problem]
: 250: 500: 1000
: di!. > +10: di!. < -10
: di!. > +0.5
: di!. < -0.5
: increased width [m]
[*60sec./person]
(2) Solution A Solution B(3)Staying node number
Activity duration
1 2 3 4
5 6 7
8 9
10 11 12 13
14 15 16 17
18 19 20 21
Original network(1)1 4 7 13 15 17 18 19 20 21 1 4 7 13 15 17 18 19 20 21
1 2 3 4
5 6 7
8 9
10 11 12 13
14 15 16 17
18 19 20 21
1 2 3 4
5 6 7
8 9
10 11 12 13
14 15 16 17
18 19 20 21
1 2 3 4
5 6 7
8 9
10 11 12 13
14 15 16 17
18 19 20 21
Goal I:Sojourn time
Goal II:Expected utility
• Reinforcement Learning is a general framework of modeling sequential decisions in networks.• You can model any “state-action network”• “State = action = space” is just an example
• Network design is a mathematical problem of behavior (in a network) based planning• Be thoughtful when you set an objective• Multi-objective design may fit in public projects
Summary & Remarks
References• Magnanti, T.L., Wong, R.T., 1984. Network design and transportation planning:
models and algorithms. Transportation Science 18 (1), 1–55.• Farahani, R. Z., Miandoabchi, E., Szeto, W. Y., & Rashidi, H. 2013. A review of
urban transportation network design problems. European Journal of Operational Research, 229(2), 281-302.
• Fosgerau, M., Frejinger, E., Karlstrom, A., 2013. A link based network route choice model with unrestricted choice set. Transportation Research Part B: Methodological 56, 70–80.
• Mai, T., & Jaillet, P., 2020. A Relation Analysis of Markov Decision Process Frameworks. arXiv:2008.07820.
• Oyama, Y., 2017. A Markovian route choice analysis for trajectory-based urban planning. PhD thesis, The University of Tokyo.
• Oyama, Y., Hato, E., 2019. Prism-based path set restriction for solving Markovian traffic assignment problem. Transportation Research Part B: Methodological 122, 528–546.
• ⼤⼭雄⼰, ⽻藤英⼆, 2017. 多⽬的最適化に基づく歩⾏者の活動ネットワークデザイン. 都市計画論⽂集 52(3): 810-817.
Appendix | Design levels and examples
Figure 1 in Farahani et al. (2013)
Appendix | Solution algorithms (metaheuristics)
Figure2 in Farahani et al. (2013)
SA: Simulated Annealing; GA: Genetic Algorithm; TS: Tabu Search; AC: Ant Colony; PSO: Particle Swarm Optimization; SS: Scatter Search; H: Hybrid metaheuristics