Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Abstractions for devising
compact controllers for
MDPs
Kolman Vornovitsky
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Abstractions for devising
compact controllers for
MDPs
Research Thesis
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Computer Science
Kolman Vornovitsky
Submitted to the Senate of
the Technion — Israel Institute of Technology
Av 5770 Haifa July 2011
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
The research thesis was carried out under the supervision of Associate Pro-
fessor Carmel Domshlak, Faculty of Industrial Engineering and Manage-
ment, Technion.
The generous financial support of the Technion is gratefully acknowledged.
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Contents
Abstract 1
Abbreviations and Notations 3
1 Introduction 4
2 Background 8
2.1 Value iteration algorithm . . . . . . . . . . . . . . . . . . . . 13
2.2 (Modified) Policy iteration algorithm . . . . . . . . . . . . . . 14
2.3 Planning for Factored MDPs . . . . . . . . . . . . . . . . . . 15
2.4 Deterministic planning . . . . . . . . . . . . . . . . . . . . . . 17
3 Merge-and-Shrink Compression of MDPs 23
3.1 Stochastic transition graphs . . . . . . . . . . . . . . . . . . . 25
3.2 Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Generic abstraction algorithm . . . . . . . . . . . . . . . . . . 29
3.3.1 Using abstraction as a controller for MDP . . . . . . . 32
3.4 Merge and shrink strategies . . . . . . . . . . . . . . . . . . . 33
3.4.1 Shrink strategies . . . . . . . . . . . . . . . . . . . . . 34
3.4.2 Merge strategies . . . . . . . . . . . . . . . . . . . . . 36
3.5 Edge abstraction . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Experiments 43
4.1 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.1 Blocksworld . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.2 Exploding blocksworld . . . . . . . . . . . . . . . . . . 44
4.1.3 Boxworld . . . . . . . . . . . . . . . . . . . . . . . . . 45
i
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
4.1.4 Triangle tireworld . . . . . . . . . . . . . . . . . . . . 45
4.1.5 Rectangle tireworld . . . . . . . . . . . . . . . . . . . . 46
4.1.6 Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.7 Search and rescue . . . . . . . . . . . . . . . . . . . . 47
4.1.8 Sysadmin . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Setup and environment . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 Summary and Future work 54
Bibliography 55
Abstract in Hebrew א
ii
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
List of Figures
2.1 All instantiations of action AC in the tireworld example. The
labels on the transitions are the effect (partial variable assign-
ment), its probability, and its reward, in that order. . . . . . 12
2.2 Simulation of the value iteration algorithm on the running
Tireworld example. V0 is a set of arbitrary initial values (not
necessarily 0). Vi and πi are the iterating value and policy.
V5 = V6 means the algorithm has converged and that V6 =
V ∗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Transition graph of the tireworld problem. State 〈A, 1, 0〉means location = A, has spare = 1 and flat tire = 0. 〈A, 0, 0〉is the initial state of the problem. An edge label AC1, 1, 0.6
refers to effect 1 of action AC with reward 1 and probability
0.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Abstraction of transition graph from left to right where s(s1) =
s(s2) = s(s3) and s(t1) = s(t2). Note that action a becomes
two different actions as1 and as3 . Rewards and probabilities
are preserved. . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Atomic projection of the tireworld problem on variable location
πlocation on the left side and variable flat tire πflat tire on the
right side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Synchronized product of two atomic projections of the tire-
world problem πlocation ⊗ πflat tire . . . . . . . . . . . . . . . . 29
3.5 Abstraction of TA = πlocation ⊗ πflat tire, first case. . . . . . . 31
3.6 Abstraction of TA = πlocation ⊗ πflat tire, second case. . . . . . 32
3.7 Final abstraction of full tireworld problem, first case. . . . . . 32
3.8 Final abstraction of full tireworld problem, second case. . . . 33
iii
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
3.9 Example of dominated actions. B = {b, c} dominates a. . . . 41
3.10 Merge-and-shrink algorithm extended with dominated actions
elimination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1 Results for the blocksworld domain. *Average was taken on
4/8 planners. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Results for the exploding blocksworld domain. *Average was
taken on 5/8 planners. . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Results for the boxworld domain. *Average was taken on 2/8
planners. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Results for the joint Tireworld domains as presented and
tested in the competition. Instances 1 to 10 are Triangu-
lar and 11 to 15 are Rectangular. *Average was taken on 4/8
planners. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Results for the schedule domain. *Average was taken on 4/8
planners. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6 Results for the search and rescue domain. *Average was taken
on 2/8 planners. . . . . . . . . . . . . . . . . . . . . . . . . . 53
iv
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Abstract
The ability to plan a course of action is crucial for intelligent systems. It
involves the representation of actions and world models, reasoning about
the effects of actions, and techniques for efficiently searching the space of
possible plans. Planning under uncertainty is captured by decision-theoretic
planning (DTP), where the actions have stochastic effects, and the goal is to
devise a policy of acting with a high expected utility. This is in contrast to
deterministic planning, where a sequence of actions that translates an agent
from an initial state to a goal state is sought. The classic mathematical
framework for decision-theoretic planning is that of the Markov decision
process (MDP).
Research on AI planning, reasoning under uncertainty, and decision anal-
ysis and operations has given rise to the interesting insight that real-world
DTP problems typically exhibit considerable structure. One of the most
popular structures is exploited by the use of variable-based representations
to describe problems, as is common practice in planning. Such variable
based representations allow for compact description of the huge state space,
but cast doubt upon the viability of most standard solutions for MDPs, i.e.,
algorithms which compute the optimal policy assuming an explicit state
space.
Over the last two decades, some works have presented solutions for MDPs
with variable based representation. Factored MDPs (Boutilier et al. [10])
use implicit variable based state space and a dynamic Bayesian network
(DBN) as a compact representation of the transition model. Most works
have approximated the solution using an approximate value function with
compact representation given by a linear combination of possibly non-linear,
basis functions (Bellman et al. 1963 [3]; Sutton, 1988 [31]; Tsitsiklis et al.
1997 [35]). Guestrin et al. 2003 [16] proposed a similar solution, built on
1
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
the idea of Koller & Parr (1999, 2000) [23, 24] and using factored (linear)
functions, where each basis function is restricted to some small subset of
the state variables. Dean & Givan 1997 [13] offered a somewhat different
approach in which the model is compressed much the same way that a finite
state machine is reduced to its equivalent minimal final state machine.
While previous works on factored MDPs approximate the value function
of the original MDP, in this work we explore a different approach of calcu-
lating the exact value function of an approximated MDP. We exploit and
extend a technique known as over-approximating abstractions to approxi-
mately solve the exponential state space MDP problem.
An abstraction can, in general, be seen as a mapping that reduces the
size of the state space by compacting several states into one. If the ab-
stract state space is made small enough, the standard solutions for explicit
state space become feasible for it as well. Over-approximating abstractions,
adapted to the context of deterministic planning by Helmert, Haslum &
Hoffmann [17], are based on the merge-and-shrink methodology introduced
by Drager, Finkbeiner & Podelski [20]. We depart from these methodologies,
adapting the merge-and-shrink abstraction technique to devise compact con-
trollers for MDPs. We introduce the notion of action abstractions to extend
the merge-and-shrink abstraction technique for MDPs and for deterministic
planning. Finally, we provide a clear testbed evaluation for our methods
and compare them to other state-of-the-art approaches.
2
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Abbreviations and Notations
V = {v1, ..., vn} — Set of variables
Dv — Finite domain of variable v
S — Set of all states
A — Set of actions where a ∈ A is a pair 〈pre, E〉pre — Partial assignment over VE — Effects of an action; a set of partial assignments over Vfs,a : E → S — fs,a(e) = s′ Transition function, s calculates the destination
state s′ of effect e of action a at state s
Pa — Probability of effect e to occur, denoted by Pa(e)Ra — Probability of effect e, denoted by Ra(e)Π = 〈V,A, s0,P,R〉 — PSAS+ MDP problem
P — Probability distribution for all actions; P(e) = Pa(e) where
e ∈ E and a = 〈pre, E〉 ∈ AR — Reward function for all actions; R(e) = Ra(e) where
e ∈ E and a = 〈pre, E〉 ∈ Aπ : S → A — Policy for a PSAS+ MDP problem; π(s) = a is
an action the agent should choose when at state s
V π(s) — Expected value at state s using policy π
V ∗(s) — Optimal expected value at state s
T = 〈S,L,A, so, S∗〉 — Transition graph
T = 〈T,R, P 〉 — Stochastic transition graph with rewards
T (Π) — Stochastic transition graph with rewards for a PSAS+
MDP problem
α — Abstraction function α : S → S′
TA — Abstract stochastic transition graph with rewards
N — Limit on number of states
M — Limit on number of edges
3
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Chapter 1
Introduction
The ability to plan a course of action is crucial for intelligent systems, in-
creasing their autonomy and flexibility through the construction of sequences
of actions to achieve their goals. Planning has been studied in the context
of artificial intelligence for over three decades [27]. Planning techniques
have been applied to a variety of tasks, including robotics, process plan-
ning, Web-based information gathering, autonomous agents, and spacecraft
mission control. Planning involves representation of actions and world mod-
els, reasoning about actions’ effects, and devising techniques for efficiently
searching the space of possible plans [1, 36].
Deterministic planning focuses on translating the agent from one state to
another while assuming only one outcome for each action [1, 36]. A solution
for deterministic planning is a sequence of applicable actions from the initial
state to a goal state. An optimal solution minimizes the number of actions
needed to achieve the goal. A more general formalization of planning, which
includes planning under uncertainty, is decision-theoretic planning (DTP)
[7]. The aim of DTP is to form courses of action (plans or policies) that
have high expected utility rather than plans that are guaranteed to achieve
certain goals with a minimal number of actions. Most sequential decision
problems can be captured semantically by the Markov decision processes
(MDP) model [2, 19, 4, 25].
The classic approaches to solving MDP are well-known dynamic pro-
gramming algorithms such as value iteration [2] and policy iteration [19].
Those algorithms compute the optimal decision an agent has to make for
every state to obtain the highest expected value possible. Many other robust
4
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
methods for optimal policy construction have been developed in the opera-
tions research (OR) community, including modified policy iteration [26] and
asynchronous versions of the well-known value and policy iteration algo-
rithms [4, 6]. Common to all these methods is that they require explicit
enumeration of the underlying state space of the MDP.
Using MDPs as a model for solving planning problems has illuminated
a number of interesting connections between techniques for solving decision
problems. Those techniques came from AI planning, reasoning under un-
certainty, decision analysis and OR. One of the most interesting insights
emerging from this body of work is that real-world DTP problems typically
exhibit considerable structure, and thus can be solved using special-purpose
methods that recognize and exploit that structure. Variable based problem
representation is one of the most popular of these structures, and its use
is common practice in planning. While variable based representation high-
lights the problem’s special structure and allows it to be exploited computa-
tionally, it casts doubt on the viability of standard solutions for MDPs. The
standard solutions usually assume explicit state space, whereas for variable-
based problem representations the state space becomes exponential in the
number of variables.
Over the last two decades, some works have presented solutions for MDPs
with variable based representation. Factored MDPs (Boutilier et al. [10])
use implicit, variable-based state space and a dynamic Bayesian network
(DBN) that allow a compact representation of the transition model. Some
works have approximated the solution by approximating the value function
using a linear combination of potentially non-linear basis functions. This
technique was used by Bellman et al. 1963 [3]; Sutton, 1988 [31]; Tsitsiklis
et al. 1997 [35]. Guestrin et al. 2003 [16] used basis functions, with each
such function taking a small subset of the state variables as parameters.
Dean & Givan 1997 [13] proposed a somewhat different approach that is
based on minimizing the model in the same way that a finite state machine
is reduced to its equivalent minimal final state machine. Their algorithm
takes as input an implicit MDP model in factored form and tries to produce
an explicit, reduced model whose size is within a polynomial factor of the
size of the factored representation. The algorithm cannot guarantee that
the model’s size will always be reduced as required.
This work proposes a different approach for handling the exponential
5
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
state space. Instead of approximating the value function on the original,
large MDP as previous works do, we propose calculating the exact value
function of an approximated MDP, which we construct using all the state
variables of the large original MDP. The state space of the resulting ap-
proximated MDP is explicitly upper-bounded by a predefined parameter, in
order to enable the standard MDP algorithms to solve it optimally.
This idea was inspired by some recent advances in heuristic search for
deterministic planning. The standard, generally viable approach to solv-
ing deterministic planning problems is search in one form or another, with
heuristics being the most important general method for improving search ef-
ficiency. Heuristics are functions that approximately estimate the distance
to a goal state in the search space. These functions help in navigating
the search process. One method for devising a good heuristics is over-
approximating abstractions [21, 12, 14]. Very roughly, an abstraction is a
mapping that reduces the size of the state space by contracting several states
into one. By making the abstract space small enough, it becomes feasible to
perform various reachability-analysis tasks on it by using explicit methods
such as breadth-first search or Dijkstra’s algorithm. Those analyses on the
abstractions allow building viable heuristics for the full exponential state
space.
While abstractions in deterministic planning are used mainly (if not ex-
clusively) for deriving informative admissible estimates of the distances from
a state to the goal, our goal is to use abstractions to “compress” the state
space while preserving as much as possible those original problem proper-
ties that have the greatest influence on the solution. The compression is
parametrized by the resource limitations. At the extremes, this abstraction
will correspond to abstracting all the states to one state, and to abstracting
nothing at all. The interesting cases are, of course, in the middle, when
we do have some realistically small but still non-negligible memory, and the
task is to use it the best way possible.
On the technical side, we generalize the merge-and-shrink abstraction
technique introduced by Drager, Finkbeiner & Podelski [20] in the con-
text of verification of systems of concurrent automata, and further extended
and adapted to the context of deterministic planning by Helmert, Haslum
& Hoffmann [17]. The computational feasibility of this approach rests on
interleaving the composition of various system properties (a.k.a. state vari-
6
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
ables) with abstraction of the intermediate composites. As Helmert et al.
[17] show, it allows very accurate heuristics to be obtained from relatively
compact abstractions. The greater flexibility offered by not restricting ab-
stractions solely to projections on system properties is, however, a mixed
blessing. The already-hard problem of selecting—from the vast number of
possible abstractions—a good one, becomes even harder.
The contributions of our work are as follows. First, we provide rigorous
semantics for the merge and shrink operators on structured MDPs. Second,
we suggest effective and semantically justifiable strategies for both state
contraction (shrink) and state-space refinement (merge). We analyze the
relative attractiveness of the proposed strategies on different problems.
Sometimes abstracting the space state is not enough. In deterministic
planning, the number of possible effects is S · A, where S is the number of
states and A is the number of different actions. In DTP, each action may
have a number of outcomes, and thus the number of effects is even larger.
In addition to adopting state abstraction, we have extended the merge-and-
shrink technique both for MDPs and for deterministic planning with action
abstraction techniques. Those techniques allow us to cope more efficiently
with resource limitations by merging or even removing some effects from the
abstract model, sometimes without loss of any viable information.
Finally, we provide a clear testbed evaluation for our methods and com-
pare them to other state-of-the-art approaches. This empirical study of the
effectiveness of (approximately) solving structured MDPs is the main focus
of our work. In order to evaluate our algorithm, we used DTP tasks from
the fully observable probabilistic track of the 2008 international planning
competition [11]. Six domains were tested against eight algorithms from the
same competition. All the domains are MDPs, but some exhibit planning-
like goals and structure. In three of them, our approach exhibited better
performance than other state-of-the-art algorithms. Two planning-like do-
mains were very problematic for our approach. The other domains had
fair results. Overall, our approach appears comparable to state-of-the-art
algorithms.
7
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Chapter 2
Background
A planning problem is usually given by a description of states and condi-
tioned transitions of some system. An initial state and set of goal states
are usually given. Each state has its own feasible set of actions. An action
can be applied only if it is in the feasible set of the system’s current state.
Those actions translate the system from one state to another. A solution to
a planning problem is a sequence of actions that translate the system from
the initial state into one of the goal states.
This work focuses on Decision-theoretic planning (DTP). The goal of
DTP is to form courses of action (plans or policies) in stochastic environ-
ments that have high expected utility rather than plans that translate the
system from its initial state to the goal state. Most sequential decision prob-
lems with full observability can be viewed as instances of Markov decision
processes (MDPs).
Definition 1 State space
1. V = {v1, ..., vn} is a set of state variables. Dv is a finite domain of
a variable v.
2. A partial variable assignment over V is a function s on a subset
of V such that s(v) ∈ Dv wherever s(v) is defined.
3. If s(v) is defined for all v ∈ V, s is called a state or full variable
assignment. The set of all states is denoted by S.
8
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
4. We say that a partial variable assignment p1 agrees with another par-
tial variable assignment p2 iff for every variable v such that p1(v) and
p2(v) are defined, p1(v) = p2(v).
An action is instigated by an agent in order to change the system’s state.
We assume that the agent has control over what actions are taken and when,
though the effects of taking an action might not be perfectly predictable.
We also assume that not all actions can be applied to every state.
Definition 2 Actions and transition function
A is a set of actions, where an action a is a tuple 〈pre, E,P〉
1. pre is a partial variable assignment called a precondition.
2. E is a set of partial variable assignments called effects.
3. P : E → [0..1] a probability distribution over E.
4. a is said to be applicable in state s if pre agrees with s. By As we
denote the set of actions applicable in state s.
Let there be a state s such that an action a = 〈pre, E,P〉 is applicable
at s. Then fs,a : E → S is a transition function such that, fs,a(e) = s′
where s′ agrees with e and for every variable v such that e is not defined,
s(v) = s′(v).
An agent can choose an applicable action at the current state, (possibly)
changing it to some other state. As mentioned, the likelihood of this change
is given by a stochastic transition function. This function uses the Markov
assumption, which says that knowledge of the present state renders infor-
mation about past states and agent choices irrelevant. Thus, the stochastic
transition function only depends on the state and the action chosen by the
agent and not on the previous states or choices.
In theory, the number of effects for each action is limited by |S| but in
practice we assume that the number of effects per action is limited by some
constant. Thus we define the stochastic transition function over the effects
of an action and not the destination states.
Rewards are used to evaluate agent performance. Generally the rewards
are defined over the entire state space but we will assume rewards which
9
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
are state independent. The reward function R associates a reward with the
outcome of performing an action a (which is one of a’s effects).
Definition 3 PSAS+ Markov decision process.
A PSAS+ Markov decision process or PSAS+ MDP for short is a
tuple Π = 〈V,A, s0,R〉 with the following components:
1. V is a set of state variables.
2. A is a set of actions
3. s0 ∈ S is an initial state.
4. R is a real-valued function from the union of all actions’ effects.
Example 1 Let us consider the following problem (based on the IPC 2008
tireworld domain, from the planning-under-uncertainty category [11]). You
have a car and you need to get from location A to C, which can be done
directly or through another place called B. Each time you travel you have a
chance of getting a flat tire. You are also allowed a spare tire, which you
can use to replace your flat tire, and you can pick up spare tires at place B.
You have to get to C without any flat tires. Getting to C from A will give
you 1 reward point, and getting to C from B will give you 1.5 points. You
cannot travel if you have a flat tire.
The appropriate PSAS+ MDP Π = 〈V,A, s0,R〉 is as follows:
1. V = {location, has spare, flat tire} where the last two are binary
variables and the location is one of {A,B,C}. We will use the (`, s, f)
notation, where f, s ∈ {0, 1, ?} and ` ∈ {A,B,C, ?}, to specify states
and partial variable assignments of the state space. For example, the
state (C, 0, 1) means you are at location C, you have no spare, and
you do have a flat tire. (C, ?, ?) is a partial assignment meaning that
the variable location is set to C and the others are not defined.
2. A = {AB,AC,BC,LT,CT} is the set of actions. AB, AC and BC
denote three travel actions, LT denotes the action of loading the tire
into the car, and CT denotes the action of changing a flat tire. Each
action a is tuple 〈pre,E,P〉.AB = 〈(A, ?, 0), {(B, ?, 1), (B, ?, 0)}, {P(B, ?, 1) = 0.4, P(B, ?, 0) =
10
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
0.6}〉AC = 〈(A, ?, 0), {(C, ?, 1), (C, ?, 0)}, {P(C, ?, 1) = 0.4, P(C, ?, 0) =
0.6}〉BC = 〈(B, ?, 0), {(C, ?, 1), (C, ?, 0)}, {P(C, ?, 1) = 0.4, P(C, ?, 0) =
0.6}〉LT = 〈(B, 0, ?), {(B, 1, ?)}, {P(B, 1, ?) = 1}〉CT = 〈(?, 1, 1), {(?, 0, 0)}, {P(?, 0, 0) = 1}〉.
3. s0 = 〈A, 0, 0〉.
4. RAB(B, ?, 1) = 0 and RAB(B, ?, 0) = 0
RAC(C, ?, 1) = 0 and RAC(C, ?, 0) = 1
RBC(C, ?, 1) = 0 and RBC(C, ?, 0) = 1.5
RLT (B, 1, ?) = 0
RCT (?, 0, 0) = 0.
Let us consider the action AC. The precondition of action AC is a
partial variable assignment (A, ?, 0), meaning that this action is applicable
in two states, (A, 0, 0) and (A, 1, 0). There are two effects for this action,
(C, ?, 1) and (C, ?, 0), meaning that its outcome has two possible destination
states. Using the transition function we see that
f(A,0,0),AC(C, ?, 1) = (C, 0, 1) and
f(A,0,0),AC(C, ?, 0) = (C, 0, 0).
The chance of the (C, ?, 1) transition/outcome is 0.4 and the reward gained
is RAC(C, ?, 1) = 0.
See figure 2.1 for an illustration of all possible outcomes of the AC action.
Our system evolves in stages, where a choice of the agent marks the
transition from one stage, t, to the next stage, t + 1. This is analogous to
the passage of time. The problem facing the decision maker is to select an
action to be performed at each stage.
Our objective is to maximize the total expected reward associated with
the course of actions, and we would like to evaluate the agent’s performance
over an unlimited number of stages. In this case the total reward may be
unbounded, meaning that any infinite sequence of effects could be arbitrarily
11
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
(A, 1, 0)
(A, 0, 0)(C, 0, 0)
(C, 0, 1)
(C, 1, 1)
(C, 1, 0)
(C, ?, 0), 0.6,+1
(C, ?, 1), 0.4, 0
(C, ?, 1), 0.4, 0
(C, ?, 0), 0.6,+1
Figure 2.1: All instantiations of action AC in the tireworld example. Thelabels on the transitions are the effect (partial variable assignment), its
probability, and its reward, in that order.
good or bad if it is executed for long enough. In this case it may be necessary
to adopt a different means of evaluation. The most common practice in this
respect is to introduce a discount factor. The discount factor ensures that
rewards gained at later stages are worth less than those gained at earlier
stages. Similarly to Bellman [2], we will define the expected total discount
reward value function:
V (seq) =
∞∑t=0
γtR(efft)
where seq = (eff1,eff2, ...) is a sequence of effects to occur and γ is a fixed
discount factor (0 < γ < 1). This formulation is a particularly simple and
elegant way to ensure a bounded measure of value over an infinite number
of stages.
Note that in classical planning, a solution corresponds to a sequence
of actions. This is not possible in the DTP case because the outcome of
performing an action is not known. Thus, the solution to our problems will
be a function from all states to actions.
Definition 4 Policy
12
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
A policy is a function π : S → A. If the current state is s, the agent is
prescribed by the policy to perform the action π(s).
Putting things together, in what follows we show the problem of building
a policy that maximizes the discounted sum of expected rewards over an
infinite number of stages. It is known that there always exists an optimal
policy for such problems [19]. Intuitively, we can see that this is the case
because no matter what stage the process is in, an infinite number of stages
remain. Thus the optimal action at any state is independent of the stage. In
the case of an infinite horizon, Howard showed [19] that the value function
of any policy π satisfies the following recurrence:
Definition 5 Value function of policy π
A value function V π : S → R for policy π is defined recursively:
V π(s) =∑e∈EP(e) ( Ra(e) + γV π(fs,a(e)) )
where π(s) = a, E is a set of effects, P is the probability distribution on
effects of that action a, and 0 < γ < 1.
The optimal value function satisfies a very similar recurrence:
Definition 6 Optimal value function
V ∗(s) = maxa∈As
∑e∈EP(e) ( Ra(e) + γV ∗(fs,a(e)) )
We would like the agent to adopt a policy that either maximizes this
expected value or, in a satisficing context, promises the acceptably high
expected value to be as high as possible.
2.1 Value iteration algorithm
Bellman showed [2] that the value of a fixed policy π can be evaluated using
successive approximations. This method is almost identical to the recurring
value function. His algorithm begins with an arbitrary assignment of values
to V π0 (s), after which it calculates the following step using the following
recurring function:
13
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
V πt+1(s) =
∑e∈EP(e) ( Ra(e) + γV π
t (fs,a(e)) ).
The sequence of functions V πt converges linearly to the true value function
V π.
This is called the value iteration algorithm. This algorithm can also
be altered slightly so that it builds optimal policies. The optimal version of
Bellman’s algorithm starts with a value function V0 that assigns an arbitrary
value to each s ∈ S. Given value estimate Vt(s) for each state s, Vt+1(s) is
calculated as:
Vt+1(s) = maxa∈As
∑e∈EP(e) ( Ra(e) + γVt(fs,a(e)) ).
The sequence of functions Vt converges linearly to the optimal value function
V ∗(s). After some finite number of iterations n, the choice of maximizing
action for each s forms an optimal policy π, and Vn approximates its value.
Example 2 Figure 2.2 illustrates the application of the value iteration al-
gorithm on the tireworld MDP from Example 1. We used γ = 0.9 as our
discount factor. As described in the algorithm above, Vi is value function
over a set of states. This function will converge to V ∗. In this example we
chose the initial value for all states to be 0. πi is the policy which maximizes
the value, and πi(s) is the maximum argument action from the recurring
function above at stage i. Some states don’t have an optimal action because
they are dead ends, i.e., As = ∅. Each line corresponds to one state.
2.2 (Modified) Policy iteration algorithm
As an alternative to value iteration, Howard [19]] introduced the policy-
iteration algorithm. Rather than iteratively improving the estimated value
function, the new algorithm modifies the policies directly. It begins with an
arbitrary policy π0, then iterates, computing πi+1 from πi. Each iteration of
the algorithm comprises two steps, policy evaluation and policy improvement:
1. (Policy evaluation) For each s ∈ S, compute the value function V πi(s)
based on the current policy πi.
2. (Policy improvement) For each s ∈ S, find the actions a∗ that maxi-
mize
14
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
State V0 π0 V1 π1 V2 π2 V3 π3 V4 π4 V5 π5 V6 = V ∗
(A,0,0) 0 AC 0.6 AC 0.6 AC 0.6 AB 0.748 AB 0.748 AB 0.748(A,0,1) 0 - 0 - 0 - 0 - 0 - 0 - 0(A,1,0) 0 AC 0.6 AC 0.6 AB 0.777 AB 0.777 AB 0.777 AB 0.777(A,1,1) 0 CT 0 CT 0.54 CT 0.54 CT 0.54 CT 0.673 CT 0.673(B,0,0) 0 BC 0.9 BC 0.9 BC 0.9 BC 0.9 BC 0.9 BC 0.9(B,0,1) 0 LT 0 LT 0 LT 0.729 LT 0.729 LT 0.729 LT 0.729(B,1,0) 0 BC 0.9 BC 0.9 BC 0.9 BC 0.9 BC 0.9 BC 0.9(B,1,1) 0 CT 0 CT 0.81 CT 0.81 CT 0.81 CT 0.81 CT 0.81(C,0,0) 0 - 0 - 0 - 0 - 0 - 0 - 0(C,0,1) 0 - 0 - 0 - 0 - 0 - 0 - 0(C,1,0) 0 - 0 - 0 - 0 - 0 - 0 - 0(C,1,1) 0 CT 0 CT 0 CT 0 CT 0 CT 0 CT 0
Figure 2.2: Simulation of the value iteration algorithm on the runningTireworld example. V0 is a set of arbitrary initial values (not necessarily
0). Vi and πi are the iterating value and policy. V5 = V6 means thealgorithm has converged and that V6 = V ∗.
Qi+1(a, s) =∑e∈EPs,a(e) ( Ra(e) + γV πi(fs,a(e)) ).
If Qi+1(a∗, s) > V πi(s), then πi+1(s) = a∗; otherwise πi+1(s) = πi(s).
The algorithm iterates until πi+1(s) = πi(s) for all states s.
The policy evaluation phase can be carried out in different ways. The
standard approach is by solving a system of linear equations. Another ap-
proach is to compute some (usually small) number of iterations of successive
approximations (i.e., value iteration for fixed policy π). Then, during the
policy improvement phase, the algorithm updates the policy according to the
scheme above. A generalization of this algorithm is called modified policy
iteration (Puterman & Shin, 1978) [26]. Both value-iteration and policy-
iteration are special cases of modified policy iteration, corresponding to the
number of iterations t in the policy evaluation phase, with the algorithm
performing value iteration at t = 0 and policy iteration at t =∞.
2.3 Planning for Factored MDPs
The field of MDPs was formalized by Bellman [2] in the 1950’s. The im-
portance of solving large MDPs was recognized by Bellman when he sug-
gested value function approximation [3]. Within the AI community, value
15
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
function approximation developed concomitantly with the notion of value
function representations for Markov chains. Sutton’s seminal paper on tem-
poral difference learning [31], which addressed the use of value functions for
prediction but not planning, assumed a very general representation of the
value function and noted the connection to general function approximators
such as neural networks. Several important developments gave the AI com-
munity deeper insight into the relationship between function approximation
and dynamic programming.
Tsitsiklis et al. [35] and, independently, Gordon [15] popularized the
analysis of approximate MDP methods via the contraction properties of
the dynamic programming operator and function approximator. Tsitsiklis
established a general convergence result for linear value function approx-
imators and TD(λ), and Bertsekas and Tsitsiklis [5] unified a large body
of work on approximate dynamic programming under the name of Neuro-
dynamic Programming, also providing many novel and general error analy-
ses. Approximate linear programming for MDPs using linear value function
approximation was introduced by Schweitzer and Seidmann [29].
Tatman and Shachter [33] considered factored approach of additive de-
composition of value nodes in influence diagrams. A number of approaches
to factoring of general MDPs have been explored in the literature. The use of
factored representations such as dynamic Bayesian networks was pioneered
by Boutilier et al. [9] and has developed steadily in recent years. These
methods rely on the use of context-specific structures such as decision trees
or analytic decision diagrams (ADDs) (Hoey et al., [18]) to represent both
the transition dynamics of the DBN and the value function. The algorithms
use dynamic programming to partition the state space, representing the par-
tition using a tree-like structure that branches on state variables and assigns
values at the leaves. The tree is grown dynamically as part of the dynamic
programming process and the algorithm creates new leaves as needed: A
leaf is split by the application of a DP operator when two states associ-
ated with that leaf turn out to have different values in the backprojected
value function. This process can also be interpreted as a form of model
minimization (Dean & Givan, [13]). The number of leaves in a tree used to
represent a value function determines the computational complexity of the
algorithm. It also limits the number of distinct values that can be assigned
to states: since the leaves represent a partitioning of the state space, every
16
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
state maps to exactly one leaf. However, as was recognized early on, there
are trivial MDPs which require exponentially large value functions. This ob-
servation led to a line of approximation algorithms aimed at limiting the tree
size (Boutilier & Dearden, [8]) and, later, limiting the ADD size (St-Aubin,
Hoey, & Boutilier, [30]). Kim and Dean (2001) also explored techniques for
discovering tree-structured value functions for factored MDPs. While these
methods permit good approximate solutions to some large MDPs, their com-
plexity is still determined by the number of leaves in the representation and
the number of distinct values than can be assigned to states is still limited
as well. Tadepalli and Ok [32] were the first to apply linear value function
approximation to Factored MDPs. Linear value function approximation is
a potentially more expressive approximation method because it can assign
unique values to every state in an MDP without requiring storage space that
is exponential in the number of state variables. Schuurmans and Patrascu
[28], based on our earlier work on max-norm projection using cost networks
and linear programs, independently developed an alternative approach to
approximate linear programming using a cost network. Later Guestrin et
at. [16] embedded a cost network inside a single linear program. By contrast,
Guestrin et al. method is based on a constraint generation approach, using
a cost network to detect constraint violations. When constraint violations
are found, a new constraint is added, repeatedly generating and attempting
to solve LPs until a feasible solution is found.
Our approach is different, instead of approximating the value function
on the original, large MDP as previous works do, we propose calculating the
exact value function of an approximated MDP, which we construct using all
the state variables of the large original MDP. The state space of the resulting
approximated MDP is explicitly upper-bounded by a predefined parameter,
in order to enable the standard MDP algorithms to solve it optimally.
2.4 Deterministic planning
The value iteration and (modified) policy iteration algorithms for solving
MDPs both require explicit enumeration of the underlying state space. How-
ever, our state space grows exponentially with the number of state variables,
making the above algorithms virtually useless. Deterministic planning usu-
ally has the same property of exponential state space. A deterministic plan-
17
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
ning problem is usually given by variable based representation of the state
space of some system, initial state, and set of goal states. Each state defines
its own feasible set of actions. Actions translate the system from one state
to another while assuming always one outcome for each action. A solution
to a deterministic planning problem is a sequence of actions that can be
performed to translate the system from the initial state into one of the goal
states. An optimal solution minimizes the total cost of the actions along the
goal-achieving action sequence.
Definition 7 Deterministic planning task
A deterministic planning task is a tuple Π = 〈V,A, s0, s∗〉 with the
following components:
1. V is a set of state variables.
2. A is a set of actions, such that each action a is a pair 〈pre,eff〉 where
both are partial variable assignments.
3. s0 ∈ S is an initial state.
4. s∗ is a partial variable assignment, such that a state s is a goal state
if it agrees with s∗.
Solutions to planning problems are paths from the initial state to a goal
state in the transition graph.
Definition 8 Transition graph
A transition graph is a tuple T = 〈S,L,A, s0, S∗〉 where
1. S is a finite set of states.
2. L is a finite set of transition labels.
3. A ⊆ S × L× S is a set of (labeled) transitions also called edges.
4. S∗ is a set of goal states.
A path from s0 to s∗ is a plan. A plan is optimal iff its length is minimal.
18
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
One way to build heuristics that more efficiently estimate the distance
to a goal state is to use optimal solutions to a relaxation of the problem,
which is easier to solve than the original.
Problem relaxation may mean ignoring some of the problem constraints.
Helmert, Haslum & Hoffmann [17] showed a general algorithm for creating
consistent heuristics for deterministic planning. Their heuristic is the op-
timal cost of the solution to an abstraction generated by generalizing the
“merge-and-shrink” abstraction technique introduced by Drager, Finkbeiner
& Podelski [20].
A general abstraction is a mapping that reduces the size of the state
space by ”collapsing” several states into one abstract state. Projections
are a form of abstraction heuristics that ignore completely all but a subset
of the state variables of the planning task. States that do not differ on the
chosen variables are ”collapsed” together in the abstract space. The merge-
and-shrink technique allows composition of general state abstractions, which
include but are not limited to projections.
Definition 9 An abstraction of a transition graph T = 〈S,L,A, so, S∗〉 is
a pair 〈T ′, α〉, where T ′ = 〈S′, L′, A′, s′o, S′∗〉 and α : S → S′, such that:
1. L′ = L.
2. 〈α(s), e, α(s′)〉 ∈ A′ for all 〈s, e, s′〉 ∈ A.
3. s′0 = α(s0).
4. α(s∗) ∈ S′∗ for all s∗ ∈ S∗.
Definition 10 Synchronized product - composition of abstractions.
Let there be two abstractions of transition graphs T ′ = 〈〈S′, L,A′, s′0, S′∗〉, α′〉and T ′′ = 〈〈S′′, L,A′′, s′′0, S′′∗ 〉, α′′〉. The synchronized product T ′ ⊗ T ′′ is
T = 〈〈S,L,A, s0, S∗〉, α〉, where
1. S = S′ × S′′.
2. 〈(s′, s′′), e, (t′, t′′)〉 ∈ A iff 〈s′, e, t′〉 ∈ A′, 〈s′′, e, t′′〉 ∈ A′′.
3. s0 = (s′0, s′′0).
4. α(s) = (α′(s), α′′(s)).
19
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
5. (s′∗, s′′∗) ∈ S∗ for every s′∗ ∈ S′∗ and s′′∗ ∈ S′′∗ .
Helmert et al. [17] formalized projections and abstractions for SAS+
representation of planning tasks. They proved that composition of orthog-
onal projections of all variables is isomorphic to the transition graph of the
entire planning task. Their generalized model may produce as a special case
projections of the planning task but this model also allows composition of
partial abstractions. The greater flexibility offered by not restricting ab-
stractions to projections has pros and cons. While very accurate, and while
relevant heuristics with small abstractions can be generated with this ap-
proach, the problem of selecting a good abstraction from a huge number of
abstractions becomes even harder.
The algorithm of Helmert et al. computes abstractions. It maintains
a pool of (orthogonal) abstractions, which initially consists of all atomic
projections. Atomic projections are projections that ignore all but one state
variable. Starting with this pool, the algorithm performs one of two possible
operations until a single abstraction remains: it merges (i.e., composes) two
abstractions by replacing them with their synchronized product (similar to
the product of automatons) or it shrinks an abstraction by replacing it with
a homomorphism of itself (”collapsing” several states into one).
To keep time and space requirements, the algorithm enforces an explicit
limit on the size of the computed abstractions, which is specified as an
input parameter N . Before a product of two abstractions is computed, the
algorithm shrinks one or both of the abstractions such that their product
will not exceed N . Assuming that N is polynomially bounded by the input
size, and that the abstraction and merging strategy are computed efficiently,
the algorithm requires only polynomial time and space.
The merging strategy suggested and evaluated by Helmert et al. in [17] is
called linear merging. It maintains a single non-atomic abstraction, called a
current abstraction. Initially it is also an atomic projection. In each iteration
this abstraction is merged with a different atomic projection. The strategy
defines an order in which projections are merged using two rules, both of
which are based on a causal graph, which is a directed graph derived from
a planning task. The nodes of this graph are variables and the semantics
of an edge from variable v to v′ is that some value v is required to cause
some changes to v′. The first rule is to choose, if possible, a variable from
which there is an edge in the causal graph to one of the previously added
20
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
variables. The second rule, applied when the first one cannot be, is to add
a variable for which a goal value is defined.
The shrinking strategy is used to keep the size of the synchronized prod-
uct of two abstractions A⊗ πv below the bound N . Because the algorithm
uses a linear merging strategy, the current abstraction A will always be
shrunk. The current abstraction size should be set to Nsize(πv)
. The current
abstractions may be shrunk to this size by a sequence of combinations of
two abstract states s and s′ to one state {s, s′} each time.
Combining pairs of states at random may cause nonexisting short-cuts
when state s, which is close to an initial state, is combined with state s′,
which is close to a goal state. Thus, Helmert et al. use the following shrink-
ing strategy in their algorithm: they define the h-value of an abstract state
s to be the length of a shortest path from s to the closest abstract goal
state, in the abstract transition graph. Similarly, the g-value of s is defined
to be the length of a shortest path from the abstract initial state to s, and
the f-value is the sum of both. Helmert’s algorithm tries to preserve h and
g values because the f-value of an abstract state is a lower bound on the
f-value associated with the corresponding node in A∗ search. The A∗ al-
gorithm expands all search nodes n with f(n) < L∗ and no search node
with f(n) > L∗, where L∗ is the optimal solution length for the task. Thus,
abstract states with high f-values are expected to be encountered less often
during search. Therefore, combining them is less likely to lead to a loss of
important information.
In keeping with the above intuition, the algorithm of Helmert et al.
uses the following shrinking strategy. First, it partitions all abstract states
into buckets. Two states are placed in the same bucket iff their g and h
values are identical. We say that bucket B is more important than bucket
B′ iff the states in B have a lower f-value than the states in B′. For tie
breaking, the algorithm uses a higher h-value. The algorithm selects the
least important bucket and combines two of its states, chosen uniformly at
random. Otherwise, if all buckets contain exactly one state, it combines the
states in the two least important buckets.
The algorithm of Helmert et al. has been evaluated empirically against
state-of-the-art optimal planners [17, 22]. The domains were taken from the
International Planning Competition (IPC) from different years [17, 22]. The
results show that this approach is more than competitive with the current
21
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
state-of-the-art cost-optimal planners.
22
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Chapter 3
Merge-and-Shrink
Compression of MDPs
Value iteration and policy iteration algorithms compute the optimal decision
an agent has to make for every state to obtain the highest expected value.
Those algorithms require explicit enumeration of the underlying MDP state
space. In structured MDPs, state space grows exponentially with the num-
ber of variables; for as few as 40 Boolean variables we will have to store 240
values in order to evaluate the value function with the algorithms described
above. This does not include the time complexity, which directly depends on
the number of transitions—limited in our case by k|A||S|, where k = O(1)
is an upper bound on the number of effects per action.
Another approach is thus required to solve structured MDPs with vari-
able based representation. Over the last decade, approaches in that direction
have been proposed. Factored MDPs (Boutilier et al. [10]) use implicit vari-
able based state space. A dynamic Bayesian network (DBN) can then allow
a compact representation of the transition model. Some works have ap-
proximated the solution using an approximate value function with compact
representation. A common choice is to use the linear value function as an
approximation value function that is a linear combination of possibly non-
linear basis functions (Bellman et al. 1963 [3]; Sutton, 1988 [31]; Tsitsiklis
et al. 1997 [35]).
Guestrin et al. 2003 [16] proposed an interesting variant of that ap-
proach. Built on the idea of Koller & Parr (1999, 2000) [23, 24], this
23
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
approach uses factored (linear) functions, where each basis function is re-
stricted to some small subset of the state variables. Those functions are
assumed to be part of the input, together with the problem definition. They
proposed two algorithms for solving such MDPs and introduced a novel lin-
ear programming decomposition technique used by both. This technique
reduces a structured linear programming problem with exponentially many
constraints to equivalent, polynomially-sized ones.
Dean & Givan 1997 [13] offered a somewhat different approach that
is based on minimizing the model much the same way that a finite state
machine is reduced to its equivalent minimal final state machine. Their
algorithm takes as input an implicit MDP model in factored form and tries to
produce an explicit, reduced model whose size is within a polynomial factor
of the size of the factored representation. The algorithm cannot guarantee
that the model’s size will always be reduced to the required size, but an
optimal solution to the reduced model is the same as for the original model.
The aforementioned works try to solve factored MDPs by approximating
the value function on the basis of various assumptions about its structure.
Dean & Givan 1997 [13] reduce the MDP model so it can be evaluated
optimally, but they cannot guarantee success. Our approach, outlined in
what follows, takes a different course: we calculate the exact value function
of an approximated MDP, using all the state variables, in polynomial time.
In particular,
• we don’t assume any structure of the value function;
• we can abstract our model to any constant, thus allowing us to solve
the abstracted model optimally;
• we guarantee that the abstraction process is polynomial in the size of
the structured problem description;
• the approximated value function is guaranteed to be an admissible
estimate of the true value function;
Our approach adapts the merge-and-shrink abstraction idea to the se-
mantics of MDPs by
• providing semantics for merge and shrink operators for the MDP tran-
sition graph;
24
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
• generalizing the synchronized product operation;
• defining abstraction for stochastic transition graphs with rewards;
• providing meaningful strategies for merging and shrinking stochastic
transition graphs with rewards.
3.1 Stochastic transition graphs
The semantics of a PSAS+ MDP is given by mapping it to a stochastic
transition graph with rewards.
Definition 11 A stochastic transition graph with rewards is a tuple
〈T,R, P 〉 where
1. T = 〈S,L,A, s0〉 is a transition graph.
2. R : A→ R is a real-valued reward function from transitions.
3. P : A→ R is a real-valued function from transitions such that Ps(e) =
P (s, e, s′) is a probability function where s is one state and e ∈ E is a
set of effects of one action a.
Note that we omit the set of goal states S∗ from the transition graph
because it is irrelevant for MDPs. We denote the stochastic transition graph
with rewards associated with a PSAS+ MDP Π = 〈V,A, s0,R〉 by T (Π) =
〈T,R, P 〉, where T = 〈S,L,A, s0〉 such that:
1. S = S.
2. L = {e|〈pre, E〉 ∈ A, e ∈ E} is the set of all effect labels. Similar effects
of distinct actions will have different effect labels.
3. 〈s, e, s′〉 ∈ A iff e ∈ E, 〈pre, E,P〉 ∈ A, pre agrees with s and fs,a(e) =
s′.
4. P (〈s, e, s′〉) = P(e), such that 〈pre, E,P〉 ∈ A and e ∈ E.
5. R(〈s, e, s′〉) = R(e).
25
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
A, 1, 1
A, 1, 0
A, 0, 0
A, 0, 1
B, 0, 1B, 0, 0
start
B, 1, 0
B, 1, 1
C, 0, 0
C, 0, 1
C, 1, 1
C, 1, 0
CT, 0, 1
CT, 0, 1
AB1, 0, 3/5
CT, 0, 1
BC1, 1, 3/5
AB2, 0, 2/5
BC2, 0, 2/5
AC2, 0, 2/5
AC1, 1, 3/5
BC1, 1, 3/5
AC2, 0, 2/5
BC2, 0, 2/5
LT, 0, 1
AC1, 1, 3/5
AB1, 0, 3/5
LT, 0, 1AB2, 0, 2/5
Figure 3.1: Transition graph of the tireworld problem. State 〈A, 1, 0〉means location = A, has spare = 1 and flat tire = 0. 〈A, 0, 0〉 is the
initial state of the problem. An edge label AC1, 1, 0.6 refers to effect 1 ofaction AC with reward 1 and probability 0.6.
See Figure 3.1 for the transition graph of the tireworld problem.
Note that we can similarly denote a PSAS+ MDP associated with a
stochastic transition graph with rewards T by Π(T ).
Definition 12 Let there be a stochastic transition graph with rewards T .
The value of state s of T is the value of the same state s of PSAS+ MDP
Π(T ).
Note that one state can be a source state of two transitions which have
identical labels (effects of an action) but two different target states. We
exploit this flexibility later in our shrinking mechanism. In practice, the
number of transitions substantially dominates the number of states, and
thus maintaining transitions consumes most of the memory.
3.2 Abstractions
Abstractions of transition graphs are the core of our approach to construct-
ing a small approximation of our MDP. Abstraction implies that some in-
formation or some constraints will be ignored in order to obtain a smaller
representation of the transition graph. Formally, we define abstractions of
transition graphs as follows:
Definition 13 Abstraction.
26
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
An abstraction of a stochastic transition graph with rewards T = 〈T,R, P 〉is a tuple TA = 〈T ′, R′, P ′, α〉 where α : S → S′ is a function called abstrac-
tion mapping.
1. T ′ is the abstraction of the transition graph T with different labeling:
〈α(s), es, α(s′)〉 ∈ A′ for all 〈s, e, s′〉 ∈ A;
2. R′(α(s), es, α(t)) = R(s, e, t);
3. P ′(α(s), es, α(t)) = P (s, e, t).
Definition 14 Abstract value function
Let Π be a PSAS+ task with state set S, and let TA = 〈T,R, P, α〉 be
an abstraction of its stochastic transition graph with rewards T (Π). The
abstract value function VT (s) is the function which assigns to each state
s ∈ S the value of α(s) of T.
When two states are merged, and same action is applicable in both. The
actions will be considered different in the abstraction. The effects will have
the same probability and reward, but will be considered to be of two different
actions. See Figure 3.2 for an example. When both actions have the same
source and all the effects have the same destination states accordingly, we
can dispose of one of the actions. This will not change the value function.
Note that our notion of abstraction is transitive.
Our abstraction mechanism is based on projection abstractions, formally
defined as follows:
Definition 15 Projection
Let Π = 〈V,A, s0,R〉 be a PSAS+ MDP with state set S, and let V ⊆ Vbe a subset of its variables.
A homomorphism on a stochastic transition graph with rewards T defined
by a mapping α such that α(s) = α(s′) iff s(v) = s′(v) for all v ∈ V is called
projection onto variable set V , denoted by πV .
If V is a singleton set, π{v} is called an atomic projection, also denoted
by πv.
The abstractions on which we base our approximation are constructed
by interleaving a composition of abstractions with further abstraction of the
27
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
s3
s2
s1
t2
t3
t1
t4
a3, 3, 4/9
a1, 1, 2/9
a2, 2, 3/9
a3, 3, 4/9
a2, 2, 3/9
a1, 1, 2/9s1, s2, s3
t3
t4
t1, t2
as11 , 1, 2/18
as12 , 2, 4/18
as31 , 1, 2/18
as13 , 3, 3/18
as33 , 3, 3/18
as32 , 2, 4/18
Figure 3.2: Abstraction of transition graph from left to right wheres(s1) = s(s2) = s(s3) and s(t1) = s(t2). Note that action a becomes twodifferent actions as1 and as3 . Rewards and probabilities are preserved.
composites. Composing here means extending, by means of probabilities
and rewards, the standard operation of forming the synchronized product.
The extended synchronized product is defined as follows:
Definition 16 Synchronized product
Let there be two abstractions of stochastic transition graphs with rewards,
T ′A = 〈T ′, R′, P ′, α′〉 and T ′′A = 〈T ′′, R′′, P ′′, α′′〉.The synchronized product of T ′A and T ′′A is defined as T ′A⊗T ′′A = 〈T,R, P, α〉,
where
1. T = T ′ ⊗ T ′′;
2. R(〈(s′, s′′), e, (t′, t′′)〉) = R′(s′, e, t′);
3. P (〈(s′, s′′), e, (t′, t′′)〉) = P ′(s′, e, t′).
Definition 17 Relevant variables, orthogonal abstractions
Let Π be a PSAS+ MDP with variable set V, and let TA be an abstraction
of T (Π). We say that TA depends on variable v ∈ V iff there exist states s
and s′ with α(s) 6= α(s′) and s(v′) = s′(v′) for all v′ ∈ V \ {v}. The set of
relevant variables for TA, denoted by varset(TA), is the set of variables in Von which TA depends.
Abstractions TA and T ′A are orthogonal iff varset(TA) ∩ varset(T ′A) = ∅.
28
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
A, ?, ? C, ?, ?
B, ?, ?
start
BC1, 1, 3/5
BC2, 0, 2/5
CT, 0, 1
LT, 0, 1CT, 0, 1
CT, 0, 1AB1, 0, 3/5
AB2, 0, 2/5AC1, 1, 3/5
AC2, 0, 2/5
start ?, ?, 0 ?, ?, 1
CT, 0, 1
AB1, 0, 3/5
AC1, 1, 3/5
BC1, 1, 3/5
LT, 0, 1
AB2, 0, 2/5
AC2, 0, 2/5
BC2, 0, 2/5
LT, 0, 1
Figure 3.3: Atomic projection of the tireworld problem on variable locationπlocation on the left side and variable flat tire πflat tire on the right side
A, ?, 1
A, ?, 0
start
B, ?, 0
B, ?, 1
C, ?, 1
C, ?, 0CT, 0, 1
BC2, 0, 2/5
CT, 0, 1
CT, 0, 1
LT, 0, 1
LT, 0, 1
AB1, 0, 3/5
AB2, 0, 2/5
AC2, 0, 2/5
AC1, 1, 3/5
BC1, 1, 3/5
Figure 3.4: Synchronized product of two atomic projections of thetireworld problem πlocation ⊗ πflat tire
The synchronized product of two orthogonal abstract stochastic transi-
tion graphs with rewards of T (Π) is an abstract stochastic transition graph
with rewards of T (Π). Clearly, projections satisfy varset(πV ) = V , so projec-
tions onto disjoint variable sets are orthogonal. Moreover, varset(TA⊗T ′A) =
varset(TA) ∪ varset(T ′A).
In other words, the synchronized product of all atomic projections {πv|v ∈V} of a PSAS+ task Π is equal to the full transition graph T (Π).
3.3 Generic abstraction algorithm
Atomic projections and synchronized products can fully capture the state
transition semantics of PSAS+ MDP. However, for interesting problems we
cannot compute the products of all atomic projections, as the size of the
29
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Generic algorithm compute-abstraction (Π, N);abs←− {πv|v ∈ V \ Vinit};TA ←− πVinit ;while |abs| > 0 do
Select refinement projection πv ∈ abs;Contract TA until size(TA) · size(πv) ≤ N ;abs←− abs \ {πv};Refine: TA ←− TA ⊗ πv;
endreturn A
Algorithm 1: Algorithm for computing an abstraction for PSAS+
MDP Π, with a bound N on the number of abstract states
transition graph grows exponentially in the number of atomic projections in
the synchronized product. When the graph becomes too large to be stored
in memory, we must shrink the created abstraction by replacing it with a
homomorphism of itself, in order to create an abstraction that includes all
the problem variables.
The goal of the algorithm is to bound the abstract transition graph to a
constant size. At the same time, values calculated on the abstract transition
graph should be as similar as possible to the original transition graph values
(values of original transition graphs are mapped to the abstract transition
graph using the α functions as described in the next section). This depends
in turn on the merging and shrinking strategies.
• The merging strategy corresponds to the order in which the cross-
product operations are carried out. We would like to choose those
variables which will help us contract states while decreasing the accu-
racy of our solution as little as possible.
• The shrinking strategy corresponds to choosing which states should be
combined (i.e., abstracted). We would like to choose states to contract
such that the solution of the abstraction will be as accurate as possible.
Example 3 Figure 3.1 depicts the transition graph of the tireworld PSAS+
MDP. Each label corresponds to one effect of the PSAS+ MDP. The edge
labels have 3 parts: the action name with the effect number, the probability of
30
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
B, ?, ?
A, ?, 1
A, ?, 0
start C, ?, 1
C, ?, 0CT, 0, 1
CT, 0, 1
LT, 0, 1CT, 0, 1
AB1, 0, 3/5
AB2, 0, 2/5
AC2, 0, 2/5
BC2, 0, 2/5
AC1, 1, 3/5
BC1, 1, 3/5
Figure 3.5: Abstraction of TA = πlocation ⊗ πflat tire, first case.
that effect to occur, and a reward. Figure 3.3 depicts the abstract transition
graphs of the original PSAS+ MDP. The graphs are atomic projections on
the variables flat tire and location.
Now we will simulate the generic algorithm. The algorithm inputs are
the tireworld problem MDP Π from example 1 and N = 10 as the limit on
the number of states.
As the first abstraction we select TA = πlocation; see Figure 3.3. The
two other abstractions, πflat tire and πhasspare, are members of the abs set.
Then we select πflat tire ∈ abs; we don’t contract any states, because 3× 3 ≤10. Next we remove πflat tire from abs and update TA by computing the
synchronized product of TA ⊗ πflat tire; the result is shown in Figure 3.3.
Now we select the last πhas spare ∈ abs and empty abs. But before we update
TA, we will compute the contraction operation (because 6 × 2 > 10). We
will calculate an abstract transition graph using the following abstraction
function α:
1. First case - α(〈B, ?, 0〉) = α(〈B, ?, 1〉) = 〈B, ?, ?〉;
2. Second case - α(〈A, ?, 0〉) = α(〈C, ?, 1〉) = 〈A, ?, 0C, ?, 1〉.
We obtain as a result the abstract graphs shown in Figures 3.5 and 3.6.
Now we update TA by computing the synchronized product with the last
projection πhas spare and obtain the final abstract transition graph of our
PSAS+ MDP, shown in figures 3.7 and 3.8.
31
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
A, ?, 0 C, ?, 1
A, ?, 1
start
B, ?, 0
B, ?, 1
C, ?, 0
CT, 0, 1
AB1, 0, 3/5
AC2, 0, 2/5
AB2, 0, 2/5LT, 0, 1
CT, 0, 1
LT, 0, 1
BC2, 0, 2/5BC1, 1, 3/5
AC1, 1, 3/5
CT, 0, 1
Figure 3.6: Abstraction of TA = πlocation ⊗ πflat tire, second case.
A, 1, 1
A, 1, 0
A, 0, 0
A, 0, 1
start
B, 0, ?
B, 1, ?
C, 0, 0
C, 0, 1
C, 1, 1
C, 1, 0
CT, 0, 1
CT, 0, 1
BC2, 0, 2/5
BC1, 1, 3/5
LT, 0, 1
BC1, 1, 3/5
AC2, 0, 2/5
CT, 0, 1
AC1, 1, 3/5
AC2, 0, 2/5
AB1, 0, 3/5
AB2, 0, 2/5
AC1, 1, 3/5
BC2, 0, 2/5
AB1, 0, 3/5
AB2, 0, 2/5
Figure 3.7: Final abstraction of full tireworld problem, first case.
3.3.1 Using abstraction as a controller for MDP
What remains is to calculate the value function for the abstract transition
graph. This can be done using standard value calculation methods similar
to the ones in example 2. Then, using the abstract value function, we can
approximate the value of any state of the original MDP.
This is done by storing all the α mapping functions we used to abstract
our transition graph. Each merge operation requires α, the abstraction
mapping function. We represent it as a table of size O(N). There are |V|variables and thus at most |V| merge operations. Thus the mapping can
be stored in O(N |V|) space. Every concrete state in the original MDP has
its mapping in the abstraction. The tables allow us to retrieve the value
32
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
A, 1, 1
B, 0, 0
A, 0, 1
B, 0, 1
A, 1, 0 C, 1, 1
start
B, 1, 0
B, 1, 1
C, 0, 0
C, 1, 0A, 0, 0 C, 0, 1
CT, 0, 1
BC1, 1, 3/5
BC2, 0, 2/5
AB2, 0, 2/5
CT, 0, 1
LT, 0, 1
AB1, 0, 3/5
BC1, 1, 3/5
AC2, 0, 2/5
AB1, 0, 3/5
BC2, 0, 2/5
AB2, 0, 2/5
AC1, 1, 3/5
LT, 0, 1
AC1, 1, 3/5
AC2, 0, 2/5
CT, 0, 1
Figure 3.8: Final abstraction of full tireworld problem, second case.
of the abstract state for every concrete state mapped to it in O(|V|) time.
Furthermore, given a concrete state, we can estimate the best action in the
following way. For each action applicable in this concrete state. We retrieve
values of all its successor states and calculate the expected reward. The
action with the highest expected reward is considered best.
3.4 Merge and shrink strategies
The two key parameters to the general merge-and-shrink framework are the
strategies for which states shall be abstracted and in what order, and in what
order the partial abstractions should be combined. The general guidelines
for these two strategies in the case of MDPs are based on the following two
properties of our merge and shrink operators:
1. State contraction at shrink only extends the node to node stochastic
reachability within the abstract transition system, and
2. Stochastic paths eliminated from the abstract transition system at
merge correspond to shortcuts that are not present in the original
transition system.
These two properties are exemplified in Examples 4 and 5.
Example 4 In this example we assume that all transitions correspond to
different actions, each having a single effect. Contraction of states s3 and
s4, resulting in replacing them with a new abstract state s′, creates a new
path in the transition system from s1 to s5. In particular, that means that
33
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
the value of s′ will be the maximum of the original values of s3 and s4, and
that values of s1 and s2 as a result of this contraction can only increase.
s4
s1
s2
s5
s6
s3 s1
s2 s5
s6
s′
Example 5 In the tireworld Example 1, consider the projection on the
location variable (Figure 3.3, left side). Consider now the state (A, ?, ?),
abstracting four original states, and the edge from (A, ?, ?) to (B, ?, ?) la-
beled with AB (that is, with the action “move from A to B”). Note that this
action is inapplicable in the state (A, 0, 1) because one cannot drive with a
flat tire. Thus, this edge corresponds to a shortcut with respect to the origi-
nal transition system. After the synchronized product of this projection and
the atomic projection πflat tire, we distinguish between the states (A, ?, 1)
and (A, ?, 0) (see Figure 3.4), and the edge labeled with AB is now outgoing
only from (A, ?, 0).
The two aforementioned properties of our merge and shrink operators
ensure that, at each step of the merge and shrink processes, the values of
states in the created abstractions always upper-bound (or over-approximate)
the values of their prototype states in the original problem. That is, for
each state s of our MDP, and each intermediate abstraction T created by
our merge and shrink operators, we have VT (α(s)) ≥ V ∗(s). A construc-
tive conclusion from that is that the merge and shrink operators should be
chosen with the goal of lowering the abstract value function. Likewise, if
MDP solving is devoted to a specific initial state, then, at any stage of the
merge-and-shrink process, we can safely eliminate all the components of the
abstract transition systems that are not reachable from the abstract initial
state.
3.4.1 Shrink strategies
Given a pair of an abstract transition system TA and an atomic projection πvto be merged, the purpose of the shrinking stage is to reduce the size of TAso that the result of the synchronized product between TA and and πv will
34
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
fit the memory bound, expressed via a fixed bound N on the number of ab-
stract states. Hence, the process will involve n = |TA| − b N|πv |c| contractions
of some pairs of states. The high-level objective for the shrinking process
explained above brought us to suggest and evaluate two simple strategies for
selecting pairs of states for contraction, called below minimum-value con-
traction and minimum-difference contraction. As a baseline for evaluation
we used random shrinking in which pairs of states for contraction are se-
lected purely at random; both our not random strategies outperformed the
random shrinking by a vast margin.
1. Minimum value contraction. The rationale behind this strategy is
related to the “high -f -values first” strategy of Helmert et al. for deter-
ministic planning. As our abstractions consistently over-approximate
the state values in the original system, abstract states with (relatively)
low values abstract the states of the original system that are not likely
to be visited along the optimal policy. Hence, contracting states with
lower values corresponds to coarsening the abstractions in the regions
of the transition system that are less likely to be relevant at the ex-
ecution stage. The minimum-value strategy thus simply corresponds
to
(a) computing the value function of the abstraction TA, and
(b) contracting n pairs of states with the lowest values.
In our experiments, even though minimum-value contraction substan-
tially outperformed the random contraction baseline, it resulted in con-
tracting many pairs of states with (relatively low but) quite different
values. As combining states with different values implies propagation
of higher values to the predecessors of these states, such contractions
may substantially increase the values of some abstract states in the
system. This aspect of the shrinking process in MDPs brought us to
consider an alternative strategy.
2. Minimum difference contraction. The intuitive assumption be-
hind the minimum-difference strategy is that abstract states with sim-
ilar values are more likely to represent states with similar values in the
original system. In particular, combining abstract states with identi-
35
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
cal values will not increase our estimate of the value function. Given
that, the minimum-difference strategy comprises
(a) computing the value function of the abstraction TA,
(b) contracting n pairs of states s, s′ with the lowest difference |VTA(s)−VTA(s′)|, with the ties being broken towards the lower values of
the value function.
As we expected, this strategy in our experiments not only outper-
formed the purely random baseline, but also the minimal-value strat-
egy inspired by the decisions of Helmert et al. for the case of deter-
ministic planning.
3.4.2 Merge strategies
A merge step of the merge-and-shrink algorithm corresponds to replacing the
running abstraction TA and an atomic projection πv with the synchronized
product of the two. While the running abstraction is unique, πv should
be selected among the different atomic projections, and this choice may
dramatically affect the quality of the final abstraction. In these terms,
the overall merge-and-shrink algorithm can be seen as parametrized with
a merging strategy, corresponding to the order in which the atomic projec-
tions are merged with the incrementally constructed running abstraction.
Here as well, the high-level expectations from a successful merge operation
brought us to suggest and evaluate two simple merging strategies, called
below variable lookahead and symmetric variable refinement. As a baseline
for evaluation we used random selection of the atomic projections which, as
expected, leads to very poor abstractions of the original MDPs.
1. Variable lookahead. This strategy corresponds to a simple myopic op-
timization of the choice of atomic abstraction—the chosen one maximizes
the value of the initial state in the resulting synchronized product.
(a) For each atomic projection πv ∈ abs, compute T vA = TA ⊗ πv.(b) Select atomic projection πv minimizing VT v
A(α(s0)), the value of the
abstract initial state.
The minimization criterion in step 2 stems from the over-approximating
nature of our abstractions. Note that, overall, this projection selection
36
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
strategy is computationally quite involved as it requires Θ(|V|) compu-
tations of synchronized product and value function.
2. Symmetric variable refinement
Consider the boxworld domain with variables and their domains denoted
as follows.
(a) V = {box1, ..., boxk, truck1, ..., truckl, plane1, ..., planem}
(b) Dboxi = city1, ..., cityn, truck1, .., truckl, plane1, ..., planem.
(c) Dtrucki = Dplanej = city1, ..., cityn.
(d) A = {
load box i on truck j at city k,
unload box i from truck j at city k,
load box i on plane j at city k,
unload box i from plane j at city k,
drive truck i,
f ly plane j }
Note that each of those actions is instantiated for every box, city,
plane and truck accordingly, i.e., load box 7 on truck 2 at city 4.
The goal is to move boxes from their initial locations (=cities) to their
destination locations. Each box can be loaded to any truck or plane.
Trucks and planes move between cities. Reward is gained when a box
gets to its destination city. The box stays there and can never be moved.
Now we have to choose the order of all variables, boxes, planes and
trucks. But there is no a priori reason why we should choose one truck
variable trucki over the other truckj or why we should choose boxi over
boxj . In general, an abstract transition graph of this problem starts to
make sense when the variables of at least one box and one vehicle (truck or
plane) are in the abstraction. Thus any two boxi variables are considered
symmetric variables. The same is true for trucki and planei. Therefore a
logical order of variables for the linear merge strategy would be a round
robin on the groups of symmetric variables. For this domain, the order
will be box1, truck1, plane1, box2, truck2, plane2,...
37
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
To formulate this intuition of symmetric variables we used a simple
heuristic. We counted two measurements for each variable. As defined,
variables are part of each action precondition and action set of effects.
The first measurement is the number of preconditions a variable is de-
fined for. The second measurement is the number of effects a variable is
defined for. Every two variables having the same number of preconditions
and effects are considered symmetric.
In practice, this heuristic divided the variables into very semantically logi-
cal groups of equivalence for all domains! This strategy worked faster and
was similarly effective as the aforementioned variable lookahead merging
strategy.
3.5 Edge abstraction
In stochastic planning, the edge density in the state model is substantially
higher than in deterministic, classical planning where at each state, each
action induces just a single edge, corresponding to the single effect of that
action. Empirically, one of the most problematic aspects of the merge-and-
shrink abstractions for MDPs is that, even if the number of states in the
abstract state space is reasonably bounded, abstracting a typical PSAS+
MDP usually yields a huge amount of transitions, a few orders of magni-
tude more than the number of abstract states. This is problematic for two
reasons. First, the space complexity of the abstractions grows very quickly,
allowing us to work only with abstract state spaces having a relatively small
number of states. Second, the time complexity of computing value functions
for the intermediate abstractions grows at a similar rate as it is linear in the
number of transitions in the abstraction.
Considering this bottleneck of the number of edges in the abstract transi-
tion graphs, we aimed at identifying conditions under which we could safely
reduce the number of transitions. Below we describe our findings, and note
that all the conditions/properties described below are applicable, and thus
potentially useful, to deterministic planning as well.
Definition 18 Fixed action, effect and transition Let TA be the cur-
rent abstraction at an iteration of the merge-and-shrink algorithm with a
linear merge strategy on a PSAS+ task Π = 〈V,A, s0,R〉. Let V ′ ⊆ V be
38
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
the subset of all state variables v such that πv has already been merged into
TA. For each action a = 〈pre, E〉 ∈ A, and each effect e ∈ E,
1. e is called fixed if, for each variable v for which e is defined, v ∈ V ′.
2. a is called partially fixed if, for all e ∈ E, e is fixed.
3. a is called fixed if a is partially fixed and, for each variable v for which
pre is defined, v ∈ V ′.
A simple and yet very helpful observation is that, as the process of merge
and shrink continues, all merges will keep duplicating each fixed transition
because all the variables involved in the preconditions and effects of the
respective actions are already represented in the current abstraction.
Example 6 Consider the following example over boolean variables x =
{x0, x1}, y = {y0, y1} and z = {z0, z1}, and three simple deterministic ac-
tions
1. a = 〈{x = x0}, {x = x1}〉
2. b = 〈{y = y0}, {x = x1}〉
3. c = 〈{x = x0}, {y = y0}〉
Let us simulate the merge-and-shrink process, starting from the projec-
tion πx to the x variable as the current abstraction:
x0 x1a
b
bc
Notice that the transition labeled with a is fixed, because all of its pre-
condition and effect variables are in the current abstraction. Transitions
labeled b are not fixed because their precondition is not fixed but their effect
is fixed. Finally, the transition labeled c is not fixed, because its effect is not
fixed, though its precondition is fixed. Now, suppose that the next merge step
brings in the atomic abstraction πy:
39
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
x1y0
x1y1x0y1
x0y0
a
c
c
a
b
b
It is easy to see that all the fixed transitions labeled a are copied as is. All
fixed effects are also copied (labeled with b) but only to the upper part, because
the precondition was not fixed. Observing that all the transitions are now
fixed, consider the outcome of the next merge with the atomic abstraction
πz:
x0y0z0
x0y1z0 x0y1z1
x0y0z1 x1y0z1x1y0z0
x1y1z1x1y1z0
c
c
b b
a
b
c
c
a
b
a a
The example above illustrates that the evolution of fixed transitions
within the merge-and-shrink process is fully predictable, that is, we can
fully characterize their source and destination states after all future merge
and shrink steps. With care, this information can be used to eliminate or
merge transitions that are known not to influence the value function of the
final abstraction.
Coming back to the example above, consider the transitions from state
x0y0z0 to state x1y0z0 labeled with a and b, while keeping in mind that
40
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
s
t2
t1
a1, 0, 0.3
b2, 0, 0.5
c2, 0, 0.1
a2, 0, 0.7
b1, 0, 0.5
c1, 0, 0.9
Figure 3.9: Example of dominated actions. B = {b, c} dominates a.
our example corresponds to a special case in which there is exactly one
effect for each action. Given that both these transitions are fixed, if they
are associated with a similar reward, we can safely discard one of them
because this will not change the value function. Following this intuition,
below we formalize a criterion for a safe elimination of transitions from the
abstraction.
Definition 19 Dominated actions
Let TA be a current abstraction at an iteration of the merge-and-shrink
process, s be a state in TA, a be an action, and {s1, . . . , sn} be a set of all
states s′ in TA such that there is a transition labeled with a from s to s′. We
say that a set of actions B dominates an action a on state s if and only if
1. all the transitions from s labeled with B are to {s1, . . . , sn}, and
2. for any possible value assignment to states {s1, . . . , sn}, the expected
value of the action a will be lower than that of at least one action from
B.
41
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Generic algorithm compute-abstraction (Π, N);abs←− {πv|v ∈ V \ Vinit};TA ←− πVinit ;while |abs| > 0 do
Remove dominated actions TA;Select merging projection πv ∈ abs;Shrink TA until size(TA) · size(πv) ≤ N ;abs←− abs \ {πv};Merge: TA ←− TA ⊗ πv;
endreturn T
Figure 3.10: Merge-and-shrink algorithm extended with dominated actionselimination.
Example 7 Dominated actions
Consider the abstraction depicted in Figure 3.9. The action set B =
{b, c} dominates the action a at state s because (i) if V (t1) > V (t2) then b
has a higher value than a, and (ii) if V (t1) < V (t2), then c has a higher
value than a. Hence, if all the aforementioned transitions are fixed, then
we can safely remove the transition induced by a without affecting the value
function.
Given the notion of action domination at a state, we can now enhance
the algorithm by removing dominated actions. Specifically, if action a is
dominated at state s by a set of action B, and also (i) a is partially fixed
and (ii) all actions B are fixed, we can safely erase all transitions of action
a from state s without affecting the value function at all. The extended
algorithm appears in Figure 3.10.
42
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Chapter 4
Experiments
We have evaluated our framework on a set of benchmark domains from
the fully observable probabilistic track of the 2008 international planning
competition [11]. Most domains presented at the competition are planning-
like problems, which means that the reward is not zero only at the one
final action leading to the “goal” state. Those planning-like problems fall
under the definition of MDP tasks, but are very rare for standard MDPs
which generally offer a non-zero reward function, and have many possible
opportunities to gain reward. We compare our abstraction algorithm to
other state-of-the-art planners that participated in that competition.
The competition consisted of seven domains, each with fifteen problem
instances. Those domains were presented in PPDDL language [37]. Our
PSAS+ language is MVStrips based, thus requiring us to convert the do-
mains. A simple converter was made to translate each domain instance
automatically.
We next describe the actions, using only general names for each. Each
such action name can be instantiated for each different variable value. For
example, action pick up from table refers to all the different possibilities of
pick up boxi from table for every box variable i. Or unload box on truck at city
can be instantiated to unload boxi on truckj at cityk for all possible vari-
ables values i, j, k. Thus, the number of instantiated actions is polynomial in
the size of variable domains. In one domain it is exponential in the number
of variables.
43
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
4.1 Domains
4.1.1 Blocksworld
Variables:
V = {b1, ..., bn, hand, clearb1 , ..., clearbn , goal}Dbi = {b1, ..., bn, on− table, on− hand}Dhand = {b1, ..., bn, empty}Dcleari = Dgoal = {0, 1}Actions:
pick up block from, pick up block from table, put block on, put block on table,
pick up tower, put tower down on block, put tower down on table,goal.
This domain is similar to the classical Blocksworld domain. Besides the
standard actions of picking up blocks and putting them down, one could
pick up and put down a tower of two blocks. There is a chance that the
blocks will be scattered on the table. The likelihood of that is higher when
putting a tower down rather than a single block on another block or the
table. The actions cost some reward and the only reward received is when
reaching one specific combination of blocks. For our algorithm this domain
turned out to be extremely hard due to its “classical planning-like” nature,
induced by just a single reward at the end of the road.
4.1.2 Exploding blocksworld
Variables:
V = {b1, ..., bn, clearb1 , ..., clearbn , detonatedb1 , ..., detonatedbn ,destroyedb1 , ..., destroyedbn , hand, table ok, goal}Dbi = {b1, ..., bn, on− table, on− hand}Dcleari = Ddetonatedi = Ddestroyedi = {0, 1}Dhand = {b1, ..., bn, empty}Dtable ok = {0, 1}Dgoal = {0, 1}Actions: pick up block, pick up block from table, put down block on, put down block on table.
This domain is similar to the Blocksworld domain but without towers.
Actions might result in detonating a block. This will result in a dead end be-
44
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
cause positive rewards are not reachable. This is a very classical “planning-
like” domain, with one positive reward at the end and no negative rewards
at all.
4.1.3 Boxworld
Variables:
V = {box1, ..., boxk, truck1, ..., truckl, plane1, ..., planem}Dboxi = {city1, ..., cityn, truck1, .., truckl, plane1, ..., planem}.Dtrucki = Dplanej = {city1, ..., cityn}.Actions:
load box on truck at city, unload box from truck at city, load box on plane at city,
unload box from plane at city,
drive truck, fly plane, goal.
Any box can be loaded to any truck or plane. Trucks and planes move
between cities. Trucks may arrive at the wrong city. Each time a box is
placed at its destination city, a reward is gained. At the end, when all boxes
have reached their destinations, an additional reward is gained. All rewards
are positive.
4.1.4 Triangle tireworld
Variables:
V = {location, flat tire, has spare, spare atloc1 , ..., spare atlocm}Dlocation = {1, .., n}.Dflat tire = Dhas spare = Dspare atloci = {0, 1}.Actions:
move car,load tire,change tire
This domain is described in detail in examples 1 and 2. In short, a car
has to get from one location to another location, moving one location at
a time. The locations form a triangle-like map and some locations have
spare tires; see figure 3.1. There is a chance of getting a flat tire while
driving. One positive reward is gained at the end. This problem is sensitive
to value-iteration discount factor γ. To solve it optimally we used γ = 0.995.
45
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
4.1.5 Rectangle tireworld
Variables:
V = {x, y, dead, goal}Dx = Dy = {1, .., n}.Ddead = Dgoal = {0, 1}.Actions: move car, move car diagonal,ghost teleport,goal
This problem is similar to the triangle tireworld problem, but the loca-
tion is described as two variables x and y. Those locations form a rectangle-
like map. Additionally, some locations are not passable and some are deadly.
Driving diagonally is dangerous. The goal is to cross the map to the other
side. The rectangle-like problem has only four variables and is not interest-
ing (no abstraction is needed).
4.1.6 Schedule
Variables:
V = {C0, ..., Cn, C0 served, ..., Cn served, P0, ..., Pm, P0 dropped, ..., Pm dropped,
Phase,Alive,Goal}DCi = {P1, .., Pm, None}.DPi = {U0, .., Uk, Available}.DPhase = {C0, ..., Cn, P0, ..., Pm, Cleanup}.DCi served = DPi dropped = DAlive = DGoal = {0, 1}.Actions: process arrivals, time update, reclaim packet, packet serve, goal
This is a very complex domain with 7 different groups of variables. The
point is to serve all the classes (Ci variables). A class is served if it serves a
packet (Pi variable). The problem simulates a state machine, such that the
state of the machine is determined by the Phase variable. The Phase vari-
able always changes in the same way: C0, C1, ..., P0, P1, ..., Cleanup,C0, C1, ...
and so on. In each round of the phase variable (state C0, ..., Cn of the state
machine), packets are assigned to classes stochastically. Usually there is a
small number of classes and larger number of packets, and the system must
choose which packet to serve out of several. Thus, some packets have to wait
to be served (U0, ..., Uk are actually time limits after which the packets are
dropped). The time limit of the packets is updated at the P0, ..., Pm states.
46
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
The Cleanup state actually changes the Ci served variables, which lead to
the goal state. Only one class may serve one packet each phase round. All
packets may be dropped after some time. When a packet’s value is set to U0
it is being reclaimed, and this operation may shut down the system. This
is a hard domain, with one positive reward at the end when all classes are
served and no negative rewards.
4.1.7 Search and rescue
Variables:
V = {at, explored1, ..., exploredn, landable1, ..., landablen, human−onboard, human−alive, human− rescued, on− ground,mission− ended}Dat = {base, zone1, ..., zonen}.Dlandablei = Dexploredi = Dhuman−onboard = Dhuman−alive = Dhuman−rescued =
Dhuman−alive = Don−ground = Dmission−endedonboard = {0, 1}.Actions: go to, explore, land, take off, end mission
The goal of this problem is to rescue a person by choosing a good loca-
tion for a rescue helicopter to land. The problem is described by possible
locations where the helicopter may land, which are the domain values of the
at variable. The possible action is exploring those locations to check whether
they are landable. When the helicopter find a place to land, the human is
rescuedand brought by the helicopter to the base. There is a chance that
the human will die on the way. This is a very symmetric and easy-to-solve
problem, with one positive reward at the end and some negative rewards if
the human dies on the way.
4.1.8 Sysadmin
This domain is a bit different than the others. A network of computers
has to be managed. They have a chance of crashing. At each phase, any
computer might fail, depending on network topology. This domain has a
small number of variables (number of computers). But in order to convert
it to PSAS+ an exponential number of effects is needed. We did not test
this domain at all. If number of effects were small enough not to explode
the 4GB limit, we would solve this problem optimally because it has a small
number of states (abstraction is not needed).
47
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
4.2 Setup and environment
We used an environment similar to that of the competition. Forty minutes
were given for each domain instance. This includes creating the mapping of
an instance and running simulations to evaluate performance. We used one
core of a Core(TM)2 Quad CPU Q8200 @2.33GHz computer with a 4GB
memory limit.
Two domains (Seach and Rescue and Triangle Tireworld) were tested
with the MDPSim 2.2.2 used in the competition. We wrote an appropriate
planner to read our abstraction. This planner determines the best expected
action using our abstraction. Finally it converts states and actions from
PPDDL to PSAS+ and vice versa. Additionally we simulated these two
domains with our own PSAS+ simulator to validate the correctness of our
simulator. The other domains were tested with the PSAS+ simulator to
save programming time.
4.3 Results
The results of the six domains are depicted in Figures 4.1-4.6. The x-axis
in the graphs spans the different problem instances within the respective
domain, where a higher instance number corresponds to larger and more
complex problems, while the y-axis corresponds to the reward achieved by
different planners, averaged over 100 runs per problem instance.
1. RFF-BG/PG [34] - The winner of the fully observable probabilistic
track of the 2008 international planning competition
2. Among the planners that participated in the competition, the best-
performing planner for the domain in question
3. The average performance of the planners from the competition that
were capable of dealing with the domain in question.
4. Our abstraction algorithm with minimum difference value contraction
(MSVC) and, symmetric variable refinement (SVRS), and abstraction
size of 500.
5. MSVC, SVRS and abstraction size of 2000.
48
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−100
−80
−60
−40
−20
0
20
40
Domain instance number
Ave
rage
rew
ard
Blocksworld domain
RFF−BG/PGDomain best − RFF−BG/PGDomain average*MSVC SVRS 500MSVC SVRS 2000MSVC SVRS 8000
Figure 4.1: Results for the blocksworld domain. *Average was taken on4/8 planners.
6. MSVC, SVRS and abstraction size of 8000.
As mentioned, the Blocksworld domain has one reward at the end. Each
instance has a different goal reward, and some action cost reward. That
is why many results are negative. Our algorithm did well on the first,
smaller instances but could not solve any of the harder ones. In the ex-
plosive Blocksworld domain, the reward was always 1 and there was no
penalty for actions. But similarly to Blocksworld domain, there was one
reward at the end, which made this domain even harder for our algorithm.
While the Boxworld domain is similar to Blocksworld in that moving
the boxes changes the system state, it is a bit more symmetric and, more
importantly, rewards are gained in intermediate phases, meaning that each
time one box is in its place, we get a reward. Each instance had different
rewards per box, starting from 1 and ending with 100 per box and an addi-
tional 1000 points for all boxes. Our algorithm performed better than the
others on this domain: it was the only one capable of solving the largest
instances, 14 and 15.
49
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Domain instance number
Ave
rage
rew
ard
Explosive blocksworld domain
RFF−BG/PGDomain best − HMDPPDomain average*MSVC SVRS 500MSVC SVRS 2000MSVC SVRS 8000
Figure 4.2: Results for the exploding blocksworld domain. *Average wastaken on 5/8 planners.
The two Tireworld domains are presented together as in the competition.
The first 10 instances span from rewards of 0 to 100 and the last 5 from 0
to 1000. The Triangle Tireworld domain is similar to blocksworld because it
has one reward at the end. But because of the close-to-1 gamma (discount
factor), we could propagate the goal reward to other states and the algorithm
chose correctly which states to merge. Thus our results are optimal in most
cases.
Rectangle Tireworld is a bad example with which to compare our algo-
rithm because it has only 4 variables, each with huge domains. Thus, if
we can read the input file, we can solve the problem optimally; otherwise,
we cannot. Instance number 15 was too large (having more than 1 million
possible actions) for anyone to solve; the PSAS+ representation of that
problem weighed 0.5 GB. The schedule domain was limited to 100 reward
points when all classes were served, and our algorithm exhibited average
performance on that domain. The search and rescue domain had a poten-
tial of 2000 reward points. This domain was easy but many planners could
not solve it, probably due to advanced PPDDL language constraints. All
50
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
500
1000
1500
2000
2500
3000
Domain instance number
Ave
rage
rew
ard
Boxworld domain
RFF−BG/PGDomain best − RFF−BG/PGDomain average*MSVC SVRS 500MSVC SVRS 2000MSVC SVRS 8000
Figure 4.3: Results for the boxworld domain. *Average was taken on 2/8planners.
planners that managed to parse the problem could solve it.
51
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
200
400
600
800
1000
1200
1400
Domain instance number
Ave
rage
rew
ard
Two tireworld domain
RFF−BG/PGDomain best − HMDPP/RFF−BG/PGDomain average*MSVC SVRS 500MSVC SVRS 2000MSVC SVRS 8000
Figure 4.4: Results for the joint Tireworld domains as presented and testedin the competition. Instances 1 to 10 are Triangular and 11 to 15 are
Rectangular. *Average was taken on 4/8 planners.
52
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
20
40
60
80
100
120
140
Domain instance number
Ave
rage
rew
ard
Schedule domain
RFF−BG/PGDomain best − SEHDomain average*MSVC SVRS 500MSVC SVRS 2000MSVC SVRS 8000
Figure 4.5: Results for the schedule domain. *Average was taken on 4/8planners.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
500
1000
1500
2000
Domain instance number
Ave
rage
rew
ard
Search and rescue domain
RFF−BG/PGDomain best − FSP−RBHDomain average*MSVC SVRS 500MSVC SVRS 2000MSVC SVRS 8000
Figure 4.6: Results for the search and rescue domain. *Average was takenon 2/8 planners.
53
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Chapter 5
Summary and Future work
Over the last two decades, numerous works have presented algorithmic ap-
proaches for factored MDPs. While those works on factored MDPs approx-
imate the value function of the original MDP, in this work we explored a
different approach of calculating the exact value function of an approximated
MDP. We exploited and extended a technique known as over-approximating
abstractions to approximately solve the exponential state space MDP prob-
lem.
An abstraction can, in general, be seen as a mapping that reduces the
size of the state space by compacting several states into one. If the abstract
state space is made small enough, the standard solutions for explicit state
space become feasible for it as well. We have adapted the merge-and-shrink
abstraction technique to devise compact controllers for MDPs, by suggest-
ing effective and semantically justifiable strategies for both state contraction
(shrink) and state-space refinement (merge). To cope with the huge number
effects in factored MDPs, we introduced the notion of action abstractions
to extend the merge-and-shrink abstraction technique both for MDPs and
for deterministic planning. This technique allows us to cope more efficiently
with resource limitations by merging or even removing some effects from
the abstract model, sometimes without loss of any valuable information. Fi-
nally, we provided a clear testbed evaluation for our methods and compared
them to other state-of-the-art approaches. The evaluation was carried out
on planning-like MDP domains, which gave our algorithm a natural disad-
vantage because its focus is on the general case of factored MDPs rather
than on planning-like domains. This disadvantage exhibited itself mostly by
54
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
the rewards, where planning-like domains have only one goal reward at the
goal state. Having said that, our approach appears comparable to state-of-
the-art algorithms.
Future work:
1. To address the natural disadvantage of our algorithm on the planning-
like domains, we need to use new heuristics. Those heuristics should
increase the goal state value propagation.
2. Introduce new ”safe” ways of reducing the abstraction. Goal unreach-
able states are those which cannot lead to other states that gain a
reward. An example of a safe way of reducing the abstraction is to
combine goal unreachable states into one dead-end state.
3. New edge abstraction techniques should be used to solve the largest
pitfall in the merge and shrink framework for MDPs: the hundreds
of millions of transitions. Those techniques will reduce the memory
consumption and time consumed. Example 6 gives us a feel for the
problem. That example shows fixed transitions which are defined in
definition 18. Because those transitions’ precondition and effect vari-
ables are already inside the abstraction, they are duplicated over and
over again. Note that in this example we only see transition duplica-
tion induced from one boolean variable. Other domains with multival-
ued variables and many different effects greatly increases the number
of transitions. One way to handle this is to statically analyze fixed
actions.
55
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Bibliography
[1] J. Allen. Readings in planning. Morgan Kaufmann Publishers
Inc. San Francisco, CA, USA, 1994.
[2] R. Bellman. Dynamic Programming. Princeton University Press,
Princeton, NJ., 1957.
[3] R. Bellman, R. Kalaba, and B. Kotkin. Polynomial
Approximation–A New Computational Technique in Dynamic
Programming: Allocation Processes, volume 17. American Math-
ematical Society, 1963.
[4] D.P. Bertsekas. Dynamic programming. Prentice-Hall Englewood
Cliffs, NJ, 1987.
[5] D.P. Bertsekas and J.N. Tsitsiklis. Neuro-dynamic programming:
an overview. In Decision and Control, 1995., Proceedings of the
34th IEEE Conference on, volume 1, pages 560–564. IEEE, 1996.
[6] D.P. Bertsekas and J.N. Tsitsiklis. Neuro-dynamic programming:
an overview. In AIChE Symposium Series, pages 92–96. Citeseer,
2002.
[7] C. Boutilier, T. Dean, and S. Hanks. Decision theoretic planning:
Structural assumptions and computational leverage. Journal of
Artificial Intelligence Research, 11(1):94, 1999.
[8] C. Boutilier and R. Dearden. Approximating value trees in
structured dynamic programming. In MACHINE LEARNING-
INTERNATIONAL WORKSHOP THEN CONFERENCE-,
pages 54–62. Citeseer, 1996.
56
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
[9] C. Boutilier, R. Dearden, and M. Goldszmidt. Exploiting struc-
ture in policy construction. In International Joint Conference on
Artificial Intelligence, volume 14, pages 1104–1113. LAWRENCE
ERLBAUM ASSOCIATES LTD, 1995.
[10] C. Boutilier, R. Dearden, and M. Goldszmidt. Stochastic dy-
namic programming with factored representations. Artificial In-
telligence, 121(1-2):49–107, 2000.
[11] D.B.O. Buffet. International planning competition uncertainty
part: Benchmarks and results. In IPPC, 2008.
[12] J. Culberson and J. Schaeffer. Pattern databases. Computational
Intelligence, 14(4):318–334, 1998.
[13] T. Dean and R. Givan. Model minimization in Markov decision
processes. In Proceedings of the National Conference on Artificial
Intelligence, pages 106–111. Citeseer, 1997.
[14] S. Edelkamp. Planning with pattern databases. In Proceedings of
the European Conference on Planning (ECP), pages 13–34, 2001.
[15] G.J. Gordon. Stable function approximation in dynamic program-
ming. In Twelfth International Conference on Machine Learning,
1995.
[16] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman. Efficient
solution algorithms for factored MDPs. Journal of Artificial In-
telligence Research, 19(10):399–468, 2003.
[17] M. Helmert, P. Haslum, and J. Hoffmann. Flexible abstraction
heuristics for optimal sequential planning. In Proc. ICAPS, pages
176–183, 2007.
[18] J. Hoey, R. St-Aubin, A. Hu, and C. Boutilier. Spudd: Stochastic
planning using decision diagrams. In Proceedings of the Fifteenth
Conference on Uncertainty in Artificial Intelligence, pages 279–
288. Citeseer, 1999.
[19] R.A. Howard. Dynamic programming and Markov process. MIT
press, 1960.
57
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
[20] A. Podelski K. Drager, B. Finkbeiner. Directed model checking
with distance-preserving abstraction. In Lecture Notes in Com-
puter Science, pages 19–34. Springer, 2006.
[21] M. Katz and C. Domshlak. Structural patterns heuristics via
fork decomposition. In Proceedings of the 18th International Con-
ference on Automated Planning and Scheduling (ICAPS), pages
182–189, 2008.
[22] M. Katz and C. Domshlak. Structural-pattern databases. In
Proceedings of the 19th International Conference on Automated
Planning and Scheduling (ICAPS), pages 186–193, 2009.
[23] D. Koller and R. Parr. Computing factored value functions for
policies in structured MDPs. In International Joint Conference
on Artificial Intelligence, volume 16, pages 1332–1339. Citeseer,
1999.
[24] D. Koller and R. Parr. Policy iteration for factored MDPs. In In
Proceedings of the Sixteenth Conference on Uncertainty in Artifi-
cial Intelligence UAI-00, pages 326–334, 2000.
[25] M.L. Puterman. Markov decision processes. Wiley-Interscience,
2005.
[26] M.L. Puterman and M.C. Shin. Modified policy iteration algo-
rithms for discounted Markov decision problems. Management
Science, 24(11):1127–1137, 1978.
[27] S.J. Russell and P. Norvig. Artificial intelligence: a modern ap-
proach. Prentice hall, 2009.
[28] D. Schuurmans and R. Patrascu. Direct value-approximation for
factored mdps. In Proc. NIPS, volume 14, 2001.
[29] P.J. Schweitzer and A. Seidmann. Ge neralized polynomial ap-
proximations in markovian decision processes. Journal of mathe-
matical analysis and applications, 110(2):568–582, 1985.
58
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
[30] R. St-Aubin, J. Hoey, and C. Boutilier. Apricodd: Approximate
policy construction using decision diagrams. Advances in Neural
Information Processing Systems, pages 1089–1096, 2001.
[31] R.S. Sutton. Learning to predict by the methods of temporal
differences. Machine learning, 3(1):9–44, 1988.
[32] P. Tadepalli and D.K. Ok. Scaling up average reward rein-
forcement learning by approximating the domain models and the
value function. In MACHINE LEARNING-INTERNATIONAL
WORKSHOP THEN CONFERENCE-, pages 471–479. Citeseer,
1996.
[33] J.A. Tatman and R.D. Shachter. Dynamic programming and in-
fluence diagrams. Systems, Man and Cybernetics, IEEE Trans-
actions on, 20(2):365–379, 1990.
[34] F. Teichteil-Konigsbuch, U. Kuter, and G. Infantes. Incremental
plan aggregation for generating policies in mdps. In Proceedings
of the 9th International Conference on Autonomous Agents and
Multiagent Systems: volume 1-Volume 1, pages 1231–1238. In-
ternational Foundation for Autonomous Agents and Multiagent
Systems, 2010.
[35] J.N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference
learning with function approximation. IEEE Transactions on Au-
tomatic Control, 42(5):674–690, 1997.
[36] Q. Yang. Intelligent Planning: A Decomposition and Abstraction
Based Approach (Artificial Intelligence). Springer–Verlag Berlin
Heidelberg, 1997.
[37] H.L.S. Younes and M.L. Littman. Ppddl1. 0: The language for
the probabilistic part of ipc-4. In Proc. International Planning
Competition, pages 70–73. Citeseer, 2004.
59
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
60
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
בקרים ליצירת הפשטה שיטותמרקוביים החלטה לתהליכי קומפקטיים
וורנוביצקי קולמן
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
בקרים ליצירת הפשטה שיטות
מרקוביים החלטה לתהליכי קומפקטיים
מחקר על חיבור
התואר לקבלת הדרישות של חלקי מילוי לשם
המחשב במדעי למדעים מגיסטר
וורנוביצקי קולמן
לישראל טכנולוגי מכון – הטכניון לסנט הוגש2011 יולי חיפה ה'תש"ע אדר
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
וניהול. לתעשיה מפקולטה דומשלק כרמל פרופ' בהנחיית נעשה המחקר
בהשתלמותי. הנדיבה הכספית התמיכה על לטכניון מודה אני
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
תקציר
את מעלה זאת יכולת אוטונומיות. למערכות הכרחית הינה התנהגות תכנון יכולתתכנון המטרה. להשגת פעולות רצף של בניה בעזרת המערכת של והעצמאות הגמישות.[21] עשורים משלושה למעלה כבר המלאכותית הבינה תחום במסגרת נלמד התנהגותתהליכים תכנון רובוטיקה, הכוללות שונות, מערכות בהרבה יושמו תכנון טכניקותהתנהגות תכנון חלל. משימות ובקרת עצמאים סוכנים אינטרנטי, מידע איסוף עסקיים,חיפוש המאפשרות שונים ואלגוריתמים הבעיה, ועולם הפעולות את המייצג מודל כולל
.[25 ,1] הפתרונות במרחב יעיל
תוצאה יש הסוכן לרשות שעומדת פעולה לכל בו במודל מתרכז דטרמיניסטי תכנוןשל רצף הוא זה מסוג בעיות עבור הפתרון .[25 ,1] מראש. וידוע אחת אפשריתהינו אופטימלי פתרון כאשר מטרה, למצב ההתחלתי ממצב לביצוע הניתנות פעולותלמצב הסוכן את להעביר מנת על שנדרשות הפעולות מספר את למינימום המביא פתרון
המטרה.
decision החלטות מונחה תכנון הינו סטוכסטי, תכנון הכולל תכנון של כללי יותר מקרהאו (תוכניות פעולות סכימת לייצר היא DTP מטרת .theoretic planning DTP
בניגוד זאת הסוכן. של מטרה פונקציית מעל גבוהה תוחלת משיגות אשר הסטרטגיות)ביותר הקטנה כמות עם מסוים מטרה מצב השגת הינה המטרה בו דטרמיניסטי לתכנוןשל מודל בעזרת סמנטית לתאר ניתן הסדרתיות ההחלטה הבעיות רוב את פעולות. של
.[2, 4, 14, 19] Markov decision processes, MDP מרקוביים תהליכים
value iteration כגון דינאמיים תכנות אלגוריתמי הם MDP לפתרון הקלסית הגישההאופטימלית ההחלטה את מחשבים אלו אלגוריתמים .[14] policy iterationו־ [2]
האפשרית ביותר הגבוהה התוחלת את להשיג מנת על ומצב מצב בכל לבצע צריך שסוכןפותחו אסטרטגיות של אופטימלית לבניה אחרים רבים פתרונות המטרה. פונקציית מעלmodified אלגוריתם בפרט הכוללים ,(operations research) ביצועים חקר של בתחוםpolicy iterationו־ value iteration של אסינכרוניות וגרסאות [20] policy iteration
א
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
המצבים מרחב של מפורשת ספירה דורשות שהן הוא אלו שיטות לכל המשותף .[4, 5]
.MDPה־ של
לפתרון טכניקות בין שילובים מספר האיר בעיות לפתרון כמודל MDP ב־ השימושקבלת תכנון, מלאכותית, בינה של תחום מתת הגיעו אלו טכניקות החלטה. בעיותמכל הנובעות המעניינות מההבחנות אחד ביצועים. וחקר וודאות, אי תחת החלטותיכולות ולכן מבניות ידי על מאופיינות החלטה בעיות לרוב כי העובדה הינה אלו עבודותמהמאפיינים אחד זאת. מבניות לנצל בנויים אשר יעודיים אלגוריתמים בעזרת להיפתרלרוב בו משתמשים אשר משתנים ידי על המצבים מרחב ייצוג הינו ביותר המפורסמיםלמטרות בה להשתמש ניתן אשר לבעיה מבניות מגדיר במשתנים ששימוש בזמן בתכנון.מכיוון זאת .MDP ל־ רגילים פתרונות של ביכולתם ספק מעלה גם הוא חישוביות,המצבים מרחב שייצוג בעוד מפורש, מצבים מרחב כלל בדרך מניחים רגילים שפתרונות
המפורשת. בצורתו המשתנים במספר כתלות לאקספוננציאלי הופך משתנים בעזרת
מרחב עם MDP בעיות של פתרון על עבודות כמה נכתבו האחרונים עשורים בשניFactored עבור ,Boutilier .Factored MDPs הנקראים משתנים מבוסס מצביםdynamic דינאמיות בייס ורשתות משתנים מבוסס מצבים במרחב משתמש [7] MDPs
עבודות המצבים. מודל של קומפקטי ייצוג לאפשר מנת על Bayesian network (DBN)
של לינארית קומבינציה בעזרת המטרה פונקציית שיערוך ע"י הפתרון את שיערכו אחרותBellman et al. 1963 [3]; השתמשו זאת בשיטה לינאריות). דוקא (לאו בסיס פונקציותהשתמש [12] Sutton, 1988 [22]; Tsitsiklis et al. 1997 [24]. Guestrin et al. 2003
מהמשתנים קטנה קבוצה תת כפרמטרים קבלה מהן אחת כל כאשר בסיס בפונקציותהמבוסס שונה פתרון הציעו [10] Dean & Givan 1997 המצבים. מרחב את שמגדיריםשקולה מינימאלית למכונה מצבים מכונת לדחיסת דומה זאת שיטה המודל. דחיסת עלבצורתו מובנה מצבים מרחב בעלת MDP בעיית כקלט מקבל שלהם האלגוריתם לה.המודל בגודל פולינומיאלית החסום מצומצם מפורש מודל ממנו לייצר ומנסה המופשטתלגודל הדחיסה הצלחת להבטיח יכול זה אלגוריתם אין כללי באופן כמובן, המופשט.
כנדרש. פולינומיאלי
את לשערך במקום אקספוננציאלי. מצבים במרחב לטיפול שונה גישה מציגה שלנו עבודהאנחנו הקודמות, עבודות שמבצעות כמו המקורית MDP בעיית של המטרה פונקצייתאשר משוערכת, MDP בעיית עבור אופטימלית מדוייקת מטרה פונקציית לחשב מציעיםמרחב המקורי. MDP ה־ בעיית מרחב את שמגדירים המשתנים כל בעזרת נבנתהזאת מראש, מוגדר פרמטר ידי על חסומה המשוערכת MDP ה־ בעיית של המצביםבאופן המשוערכת MDP ה־ בעיית את לפתור רגילים לאלגוריתמים לאפשר מנת על
אופטימלי.
ב
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
לתכנון קבילות חיפוש היוריסטיקות של בתחום ההתקדמות בהשראת נבנה זה רעיוןבצורה חיפוש, הינה דטרמיניסטיות בעיות לפתרון הסטנדרטית הגישה דטרמיניסטי.לשיפור ביותר החשובה הכללית השיטה הינן היוריסטיקות כאשר אחרת, או כזוהמטרה למצב המרחק את שמעריכות פונקציות הינן היוריסטיקות החיפוש. יעילותלבניית אחת שיטה החיפוש. אלגוריתם בניתוב עוזרות אלו פונקציות החיפוש. במרחב.[9, 11, 16] over-approximating abstractions הפשטה הינה קבילות היוריסטיקותמצבים קבוצות כיווץ ידי על המצבים מרחב את שמקטין מיפוי היא הפשטה כללי, באופןשיקולים להפעיל ישיר באופן ניתן המצבים, מרחב הקטנת בעזרת בודדים. למצביםאו לרוחב וסריקה לעומק סריקה כמו ידועים כלים ע"י לדוגמא המכווץ, המבנה כל עלהיוריסטיקות של בניה מאפשרים הפשטות על אלו שיקולים .Dijkstra של האלגוריתם
שלמים. אקספוננציאליים מצבים מרחבי על
של ליצירה ורק, אך לא אם בעיקר, משמשות דטרמיניסטי בתכנון שהפשטות בעודהמטרה המטרה. אל כלשהוא ממצב המרחקים של קבילות אינפורמטיביות הערכותככל שמירה תוך המצבים, מרחב את לדחוס מנת על בהפשטות להשתמש היא שלנועל הרבה הכי משפיעים אשר המקורית הבעיה של המקוריים המאפיינים על האפשרבקיצוניות הקיצוניים, במקרים המשאבים. מגבלות ידי על מאופיינת הדחיסה הפתרון.אף תדחוס לא השנייה ובקיצוניות אחד, למצב המצבים כל את תדחוס ההפשטה האחת,אך קטנה, זיכרון כמות לנו יש כאשר ביותר, המעניינים כמובן הם ביניים מקרי מצב.
ביותר. הטובה בצורה בו להשתמש היא והמשימה מבוטלת, לא
merge-and-shrink ודחיסה מיזוג של ההפשטה טכניקת את מכלילים אנו הטכני, מהצדאוטומטים מערכות אימות של מתחום [15] Drager, Finkbeiner & Podelski שלHelmert, ידי על דטרמיניסטית תכנון של בהקשר והורחבה אומצה אשר מבוזרים,של הרכב על נשענת זו גישה של החישובית הכדאיות .[13] Haslum & Hoffmann
שמראה כמו ביניים. מצבי של הפשטה עם מצב( )משתני השונים המערכת מאפייניהמתקבלות מאוד מדויקות היוריסטיקות של יצירה מאפשרת זאת הרכבה [13] Helmert
הפשטות בבחירת גדולה גמישות מאפשרת הטכניקה יחסית. קומפקטיות הפשטות בתוךבחירת של קשה בעיה חסרון. גם מכילה זאת גמישות בלבד. הטלה הפשטות לא שהם
יותר. עוד קשה נעשית הפשטות, של עצום מספר מיני אחת הפשטה
עבור קפדנית סמנטיקה מספקים אנו ראשית, כדלקמן. הן שלנו העבודה של התרומותיעילה הסטרטגיה מציעים אנו שנית, מובנים. MDPs על ודחיסה המיזוג של אופרטורים(מיזוג). המצבים מרחב ועידון (דחיסה) מצבים דחיסת עבור סמנטית מבחינה המוצדקת
שונות. בעיות על המוצעת האסטרטגיות של היעילות את מנתחים אנחנו
המעברים מספר דטרמיניסטי, בתכנון מספיק. אינו המצבים מרחב פישוט לפעמים
ג
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011
הפעולות מספר הוא Aו־ המצבים מספר הוא S כאשר S · A ידי על חסום האפשרייםאפילו הוא המעברים מספר וכך אפשרויות, מספר ייתכנו פעולה לכל DTP ב־ השונות.merge- ה־ טכניקת את הרחבנו המצבים, הפשטת שיטת לאימוץ בנוסף יותר. גדולפעולות. הפשטת של טכניקה עם דטרמיניסטי תכנון ועבור MDPs עבור and-shrink
או מיזוג ידי משאבים מגבלות עם יותר יעיל באופן להתמודד לנו מאפשרת זאת טכניקהחשוב מידע איבוד ללא לפעמים זאת המופשט, מהמודל פעולות כמה של הסרה אפילו
כלשהו.
אחרות השיטות עם אותם ומשווים שלנו השיטות של אוולואציה על מדווחים אנו לבסוף,הוא מובנים, MDPs עבור (יעיל) פתרון של זה אמפירי מחקר בתחום. ביותר הטובותהשתמשנו שלנו, האלגוריתם את להעריך מנת על שלנו. העבודה של העיקרי המוקדב־2008 בתכנון הבינליאומית התחרות של ההסתברותי המסלול של וכללים בתשתיתשמונה נגד מהתחרות הבעיות של שונות משפחות בשש נבחן שלנו האלגוריתם .[8]
התכנון בתחום מקורם חלקם אבל MDPs הם התחומים כל מתחרים. אלגוריתמיםמשפחות שש מתוך בשלוש יחיד. מטרה מצב מונחה למבנה מוגבלים ולכן הדטרמיניסטימהתחרות. הטובים אלגוריתמים מאשר יותר טובים ביצועים הציגה שלנו הגישה הבעיות,
ד
Tec
hnio
n -
Com
pute
r Sc
ienc
e D
epar
tmen
t - M
.Sc.
The
sis
MSC
-201
1-21
- 2
011