Upload
alexandrina-alexander
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
h That tireless teacher who gets to class early and stays late and dips into her own pocket to buy supplies because she believes that every child is her charge -- she’s marching. (Applause.) That successful businessman who doesn’t have to, but pays his workers a fair wage and then offers a shot to a man, maybe an ex-con, who’s down on his luck -- he’s marching.
h (Cheers, applause.) The mother who pours her love into her daughter so that she grows up with the confidence to walk through the same doors as anybody’s son -- she’s marching.
h (Cheers, applause.) The father who realizes the most important job he’ll ever have is raising his boy right, even if he didn’t have a father, especially if he didn’t have a father at home -- he’s marching.
h The graduate student who not only posts his comment, but comments on other peoples comments on Piazza, he is marching.
1
8/28/2013
50 years since the man’s dream
h That tireless teacher who gets to class early and stays late and dips into her own pocket to buy supplies because she believes that every child is her charge -- she’s marching. (Applause.) That successful businessman who doesn’t have to, but pays his workers a fair wage and then offers a shot to a man, maybe an ex-con, who’s down on his luck -- he’s marching.
h (Cheers, applause.) The mother who pours her love into her daughter so that she grows up with the confidence to walk through the same doors as anybody’s son -- she’s marching.
h (Cheers, applause.) The father who realizes the most important job he’ll ever have is raising his boy right, even if he didn’t have a father, especially if he didn’t have a father at home -- he’s marching.
h The graduate student who not only posts his comment, but comments on other peoples comments on Piazza, he is marching.
2
8/28/2013
50 years since the man’s dream
h That tireless teacher who gets to class early and stays late and dips into her own pocket to buy supplies because she believes that every child is her charge -- she’s marching. (Applause.) That successful businessman who doesn’t have to, but pays his workers a fair wage and then offers a shot to a man, maybe an ex-con, who’s down on his luck -- he’s marching.
h (Cheers, applause.) The mother who pours her love into her daughter so that she grows up with the confidence to walk through the same doors as anybody’s son -- she’s marching.
h (Cheers, applause.) The father who realizes the most important job he’ll ever have is raising his boy right, even if he didn’t have a father, especially if he didn’t have a father at home -- he’s marching.
h The graduate student who not only posts his comment, but comments on other peoples comments on Piazza, he is marching.
3
8/28/2013
50 years since the man’s dream
A: A Unified Brand-name-Free Introduction to Planning Subbarao Kambhampati
Environment
actio
n
per
cep
tio
n
Goals
(Static vs. Dynamic)
(Observable vs. Partially Observable)
(perfect vs. Imperfect)
(Deterministic vs. Stochastic)
What action next?
(Instantaneous vs. Durative)
(Full vs. Partial satisfaction)
The
$$$$
$$ Q
uest
ion
Representation Mechanisms: Logic (propositional; first order) Probabilistic logic
Learning the models
Search Blind, InformedSAT; Planning Inference Logical resolution Bayesian inference
How the course topics stack up…
Topics Covered in CSE471• Table of Contents (Full Version)• Preface (html); chapter map
Part I Artificial Intelligence 1 Introduction 2 Intelligent Agents Part II Problem Solving 3 Solving Problems by Searching 4 Informed Search and Exploration 5 Constraint Satisfaction Problems 6 Adversarial Search Part III Knowledge and Reasoning 7 Logical Agents 8 First-Order Logic 9 Inference in First-Order Logic 10 Knowledge Representation Part IV Planning 11 Planning (pdf) 12 Planning and Acting in the Real World
• Part V Uncertain Knowledge and Reasoning 13 Uncertainty 14 Probabilistic Reasoning 15 Probabilistic Reasoning Over Time 16 Making Simple Decisions 17 Making Complex Decisions Part VI Learning 18 Learning from Observations 19 Knowledge in Learning 20 Statistical Learning Methods 21 Reinforcement Learning Part VII Communicating, Perceiving, and Acting 22 Communication 23 Probabilistic Language Processing 24 Perception 25 Robotics Part VIII Conclusions 26 Philosophical Foundations 27 AI: Present and Future
Topics Covered in CSE471• Table of Contents (Full Version)• Preface (html); chapter map
Part I Artificial Intelligence 1 Introduction 2 Intelligent Agents Part II Problem Solving 3 Solving Problems by Searching 4 Informed Search and Exploration 5 Constraint Satisfaction Problems 6 Adversarial Search Part III Knowledge and Reasoning 7 Logical Agents 8 First-Order Logic 9 Inference in First-Order Logic 10 Knowledge Representation Part IV Planning 11 Planning (pdf) 12 Planning and Acting in the Real World
• Part V Uncertain Knowledge and Reasoning 13 Uncertainty 14 Probabilistic Reasoning 15 Probabilistic Reasoning Over Time 16 Making Simple Decisions 17 Making Complex Decisions Part VI Learning 18 Learning from Observations 19 Knowledge in Learning 20 Statistical Learning Methods 21 Reinforcement Learning Part VII Communicating, Perceiving, and Acting 22 Communication 23 Probabilistic Language Processing 24 Perception 25 Robotics Part VIII Conclusions 26 Philosophical Foundations 27 AI: Present and Future
Agent Classification in Terms of State Representations
Type State representation Focus
Atomic States are indivisible;No internal structure
Search on atomic states;
Propositional(aka Factored)
States are made of state variables that take values(Propositional or Multi-valued or Continuous)
Search+inference in logical (prop logic) and probabilistic (bayes nets) representations
Relational States describe the objects in the world and their inter-relations
Search+Inference in predicate logic (or relational prob. Models)
First-order +functions over objects Search+Inference in first order logic (or first order probabilistic models)
Pendulum Swings in AI
• Top-down vs. Bottom-up• Ground vs. Lifted representation
– The longer I live the farther down the Chomsky Hierarchy I seem to fall [Fernando Pereira]
• Pure Inference and Pure Learning vs. Interleaved inference and learning
• Knowledge Engineering vs. Model Learning vs. Data-driven Inference
• Human-aware vs. Stand-Alone vs. Human-driven(!)
16
Markov Decision Processes
Atomic Model for stochastic environments with generalized rewards
• Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld
• Some slides from Mausam/Kolobov Tutorial; and a couple from Terran Lane
Atomic Model for Deterministic Environments and Goals of AttainmentDeterministic worlds + goals of attainmenth Atomic model: Graph search
h Propositional models: The PDDL planning that we discussed..
h What is missing?5 Rewards are only at the end (and
then you die). g What about “the Journey is the
reward” philosophy?
5 Dynamics are assumed to be Deterministic
g What about stochastic dynamics?
17
Atomic Model for stochastic environments with generalized rewards
Stochastic worlds +generalized rewards
h An action can take you to any of a set of states with known probability
h You get rewards for visiting each state
h Objective is to increase your “cumulative” reward…
h What is the solution?
18
A: A Unified Brand-name-Free Introduction to Planning Subbarao Kambhampati
Environment
actio
n
per
cep
tio
n
Goals
(Static vs. Dynamic)
(Observable vs. Partially Observable)
(perfect vs. Imperfect)
(Deterministic vs. Stochastic)
What action next?
(Instantaneous vs. Durative)
(Full vs. Partial satisfaction)
The
$$$$
$$ Q
uest
ion
25
Markov Decision Processes
h An MDP has four components: S, A, R, T:5 (finite) state set S (|S| = n)5 (finite) action set A (|A| = m)5 (Markov) transition function T(s,a,s’) = Pr(s’ | s,a)
g Probability of going to state s’ after taking action a in state sg How many parameters does it take to represent?
5 bounded, real-valued (Markov) reward function R(s)g Immediate reward we get for being in state sg For example in a goal-based domain R(s) may equal 1 for goal
states and 0 for all othersg Can be generalized to include action costs: R(s,a)g Can be generalized to be a stochastic function
h Can easily generalize to countable or continuous state and action spaces (but algorithms will be different)
27
Assumptionsh First-Order Markovian dynamics (history independence)
5 Pr(St+1|At,St,At-1,St-1,..., S0) = Pr(St+1|At,St) 5 Next state only depends on current state and current action
h First-Order Markovian reward process5 Pr(Rt|At,St,At-1,St-1,..., S0) = Pr(Rt|At,St)5 Reward only depends on current state and action5 As described earlier we will assume reward is specified by a deterministic
function R(s)g i.e. Pr(Rt=R(St) | At,St) = 1
h Stationary dynamics and reward5 Pr(St+1|At,St) = Pr(Sk+1|Ak,Sk) for all t, k5 The world dynamics do not depend on the absolute time
h Full observability5 Though we can’t predict exactly which state we will reach when we
execute an action, once it is realized, we know what it is
28
Policies (“plans” for MDPs)h Nonstationary policy [Even though we have
stationary dynamics and reward??]5 π:S x T → A, where T is the non-negative integers
5 π(s,t) is action to do at state s with t stages-to-go5 What if we want to keep acting indefinitely?
h Stationary policy 5 π:S → A5 π(s) is action to do at state s (regardless of time)5 specifies a continuously reactive controller
h These assume or have these properties:5 full observability5 history-independence5 deterministic action choice
Why not just consider sequences of actions?
Why not just replan?
If you are 20 and are not a liberal, you are heartless
If you are 40 and not a conservative, you are mindless
-Churchill
# non-stationary policies: |A||S|*T
# stationary policies: |A||S|
29
Value of a Policyh How good is a policy π?
h How do we measure “accumulated” reward?
h Value function V: S →ℝ associates value with each state (or each state and time for non-stationary π)
h Vπ(s) denotes value of policy at state s5 Depends on immediate reward, but also what you achieve
subsequently by following π
5 An optimal policy is one that is no worse than any other policy at any state
h The goal of MDP planning is to compute an optimal policy (method depends on how we define value)
30
Finite-Horizon Value Functions
h We first consider maximizing total reward over a finite horizon
h Assumes the agent has n time steps to live
h To act optimally, should the agent use a stationary or non-stationary policy?
h Put another way:5 If you had only one week to live would you act the same
way as if you had fifty years to live?
31
Finite Horizon Problems
h Value (utility) depends on stage-to-go5 hence so should policy: nonstationary π(s,k)
h is k-stage-to-go value function for π
5 expected total reward after executing π for k time steps (for k=0?)
h Here Rt and st are random variables denoting the reward received and state at stage t respectively
)(sV k
]),,(|)([
],|[)(
0
0
0
sstksasRE
sREsV
ttk
t
t
k
t
tk
32
Computing Finite-Horizon Valueh Can use dynamic programming to compute
5 Markov property is critical for this
(a)
(b) )'(' )'),,(,()()( 1 ss VskssTsRsV kk
)(sV k
ssRsV ),()(0
Vk-1Vk
0.7
0.3
π(s,k)
immediate reward expected future payoffwith k-1 stages to go
33
Bellman Backup
a1
a2
How can we compute optimal Vt+1(s) given optimal Vt ?
s4
s1
s3
s2
Vt
0.7
0.3
0.4
0.6
0.4 Vt (s2) + 0.6 Vt(s3)
ComputeExpectations
0.7 Vt (s1) + 0.3 Vt (s4)
Vt+1(s) s
ComputeMax
Vt+1(s) = R(s)+max {
}
34
Value Iteration: Finite Horizon Case
h Markov property allows exploitation of DP principle for optimal policy construction5 no need to enumerate |A|Tn possible policies
h Value Iteration
)'(' )',,(max)()( 1 ss VsasTsRsV kk
a
ssRsV ),()(0
)'(' )',,(maxarg),(* 1 ss VsasTks k
a
Vk is optimal k-stage-to-go value functionΠ*(s,k) is optimal k-stage-to-go policy
Bellman backup
35
Value Iteration
0.3
0.7
0.4
0.6
s4
s1
s3
s2
V0V1
0.4
0.3
0.7
0.6
0.3
0.7
0.4
0.6
V2V3
0.7 V0 (s1) + 0.3 V0 (s4)
0.4 V0 (s2) + 0.6 V0(s3)
V1(s4) = R(s4)+max {
}
Optimal value depends on stages-to-go
(independent in the infinite horizon case)
36
Value Iteration
s4
s1
s3
s2
0.3
0.7
0.4
0.6
0.3
0.7
0.4
0.6
0.3
0.7
0.4
0.6
V0V1V2V3
P*(s4,t) = max { }
38
Value Iteration
h Note how DP is used5 optimal soln to k-1 stage problem can be used without
modification as part of optimal soln to k-stage problem
h Because of finite horizon, policy nonstationary
h What is the computational complexity?5 T iterations 5 At each iteration, each of n states, computes
expectation for |A| actions5 Each expectation takes O(n) time
h Total time complexity: O(T|A|n2)5 Polynomial in number of states. Is this good?
39
Summary: Finite Horizonh Resulting policy is optimal
5 convince yourself of this
h Note: optimal value function is unique, but optimal policy is not 5 Many policies can have same value
kssVsV kk ,,),()(*
40
Discounted Infinite Horizon MDPsh Defining value as total reward is problematic with
infinite horizons5 many or all policies have infinite expected reward5 some MDPs are ok (e.g., zero-cost absorbing states)
h “Trick”: introduce discount factor 0 ≤ β < 15 future rewards discounted by β per time step
h Note:
h Motivation: economic? failure prob? convenience?
],|[)(0
sREsVt
ttk
max
0
max
1
1][)( RREsV
t
t
41
Notes: Discounted Infinite Horizon
h Optimal policy maximizes value at each state
h Optimal policies guaranteed to exist (Howard60)
h Can restrict attention to stationary policies5 I.e. there is always an optimal stationary policy
5 Why change action at state s at new time t?
h We define for some optimal π)()(* sVsV
42
Computing an Optimal Value Functionh Bellman equation for optimal value function
5 Bellman proved this is always true
h How can we compute the optimal value function?5 The MAX operator makes the system non-linear, so the problem is
more difficult than policy evaluation
h Notice that the optimal value function is a fixed-point of the Bellman Backup operator B5 B takes a value function as input and returns a new value function
)'(' *)',,(maxβ)()(* ss VsasTsRsVa
)'(' )',,(maxβ)()]([ ss VsasTsRsVBa
43
Value Iterationh Can compute optimal policy using value
iteration, just like finite-horizon problems (just include discount term)
h Will converge to the optimal value function as k gets large. Why?
)'(' )',,(max)()(
0)(1
0
ss VsasTsRsV
sVkk
a
44
Convergenceh B[V] is a contraction operator on value functions
5 For any V and V’ we have || B[V] – B[V’] || ≤ β || V – V’ ||
5 Here ||V|| is the max-norm, which returns the maximum element of the vector
5 So applying a Bellman backup to any two value functions causes them to get closer together in the max-norm sense.
h Convergence is assured5 any V: || V* - B[V] || = || B[V*] – B[V] || ≤ β|| V* - V ||
5 so applying Bellman backup to any value function brings us closer to V* by a factor β
5 thus, Bellman fixed point theorems ensure convergence in the limit
h When to stop value iteration? when ||Vk - Vk-1||≤ ε 5 this ensures ||Vk – V*|| ≤ εβ /1-β
Contraction property proof sketchh Note that for any functions f and g
h We can use this to show that 5 |B[V]-B[V’]| <= b|V – V’|
45
)]' )'(')'(()',,([max β
)]'('' )',,(max)'(' )',,(maxβ[)])('[][(
)'('' )',,(maxβ)()]('[
)'(' )',,(maxβ)()]([
s sVsVsasT
ss VsasTss VsasTsVBVB
otherfromoneSubtract
ss VsasTsRsVB
ss VsasTsRsVB
a
aa
a
a
f
g
46
How to Act
h Given a Vk from value iteration that closely approximates V*, what should we use as our policy?
h Use greedy policy:
h Note that the value of greedy policy may not be equal to Vk
h Let VG be the value of the greedy policy? How close is VG to V*?
)'(' )',,(maxarg)]([ ss VsasTsVgreedy kk
a
47
How to Acth Given a Vk from value iteration that closely approximates
V*, what should we use as our policy?
h Use greedy policy:
5 We can show that greedy is not too far from optimal if Vk is close to V*
h In particular, if Vk is within ε of V*, then VG within 2εβ /1-β of V* (if ε is 0.001 and β is 0.9, we have 0.018)
h Furthermore, there exists a finite ε s.t. greedy policy is optimal5 That is, even if value estimate is off, greedy policy is optimal
once it is close enough
)'(' )',,(maxarg)]([ ss VsasTa
sVgreedy kk
Improvements to Value Iterationh Initialize with a good approximate value
function5 Instead of R(s), consider something more like h(s)
g Well defined only for SSPs
h Asynchronous value iteration5 Can use the already updated values of neighors to
update the current node
h Prioritized sweeping5 Can decide the order in which to update states
g As long as each state is updated infinitely often, it doesn’t matter if you don’t update them
g What are good heuristics for Value iteration?
48
9/14 (make-up for 9/12)h Policy Evaluation for Infinite Horizon MDPS
h Policy Iteration5 Why it works5 How it compares to Value Iteration
h Indefinite Horizon MDPs5 The Stochastic Shortest Path MDPs
g With initial state5 Value Iteration works; policy iteration?
h Reinforcement Learning start
49
50
Policy Evaluation
h Value equation for fixed policy
h
5 Notice that this is stage-indepedent
h How can we compute the value function for a policy?5 we are given R and Pr5 simple linear system with n variables (each
variables is value of a state) and n constraints (one value equation for each state)
5 Use linear algebra (e.g. matrix inverse)
)'(' )'),(,(β)()( ss VsssTsRsV
51
Policy Iteration
h Given fixed policy, can compute its value exactly:
h Policy iteration exploits this: iterates steps of policy
evaluation and policy improvement
)'(' )'),(,()()( ss VsssTsRsV
1. Choose a random policy π2. Loop:
(a) Evaluate Vπ
(b) For each s in S, set (c) Replace π with π’Until no improving action possible at any state
)'(' )',,(maxarg)(' ss VsasTsa
Policy improvement
59
Policy Iteration Notesh Each step of policy iteration is guaranteed to strictly improve the
policy at some state when improvement is possible 5 [Why? The same contraction property of Bellman Update.
Note that when you go from value to policy to value to policy, you are effectively, doing a DP update on the value]
h Convergence assured (Howard)5 intuitively: no local maxima in value space, and each policy
must improve value; since finite number of policies, will converge to optimal policy
h Gives exact value of optimal policy
h Complexity:5 There are at most exp(n) policies, so PI is no worse than
exponential time in number of states5 Empirically O(n) iterations are required5 Still no polynomial bound on the number of PI iterations
(open problem)!
Improvements to Policy Iterationh Find the value of the policy approximately (by
value iteration) instead of exactly solving the linear equations5 Can run just a few iterations of the value iteration
g This can be asynchronous, prioritized etc.
60
61
Value Iteration vs. Policy Iterationh Which is faster? VI or PI
5 It depends on the problem
h VI takes more iterations than PI, but PI requires more time on each iteration5 PI must perform policy evaluation on each step
which involves solving a linear systemg Can be done approximately
5 Also, VI can be done with asynchronous and prioritized update fashion..
h ***Value Iteration is more robust—it’s convergence is guaranteed for many more types of MDPs..
Need for Indefinite Horizon MDPsh We have see
5 Finite horizon MDPs5 Infinite horizon MDPs
h In many cases, we neither have finite nor infinite horizon, but rather some indefinite horizon5 Need to model MDPs without discount factor,
knowing only that the behavior sequences will be finite
62
Stochastic Shortest-Path MDPs: Definition
SSP MDP is a tuple <S, A, T, C, G>, where:• S is a finite state space• (D is an infinite sequence (1,2, …))• A is a finite action set• T: S x A x S [0, 1] is a stationary transition function• C: S x A x S R is a stationary cost function (= -R: S x A x S R)• G is a set of absorbing cost-free goal states
Under two conditions:• There is a proper policy (reaches a goal with P= 1 from all states)
– No sink states allowed.. • Every improper policy incurs a cost of ∞ from every state from which it
does not reach the goal with P=163
Bertsekas, 1995
[SSP slides from Mausam/Kolobov Tutorial]
SSP MDP Details
• In SSP, maximizing ELAU = minimizing exp. cost
• Every cost-minimizing policy is proper!
• Thus, an optimal policy = cheapest way to a goal
64
Not an SSP MDP Example
65
S1 S2
a1
C(s2, a1, s1) = -1
C(s1, a1, s2) = 1
a2
a2
C(s1, a2, s1) = 7.2C(s2, a2, sG) = 1
SG
C(sG, a2, sG) = 0
C(sG, a1, sG) = 0
C(s2, a2, s2) = -3
T(s2, a2, sG) = 0.3
T(s2, a2, sG) = 0.7
S3
C(s3, a2, s3) = 0.8C(s3, a1, s3) = 2.4
a1 a2
a1
No dead ends allowed!
a1
a2
No cost-free “loops” allowed!
SSP MDPs: Optimality Principle
For an SSP MDP, let:
– Vπ(s,t) = Es,t[C1 + C2 + …] for all s, t
Then:
– V* exists, π* exists, both are stationary– For all s:
V*(s) = mina in A [ ∑s’ in S T(s, a, s’) [ C(s, a, s’) + V*(s’) ] ]
π*(s) = argmina in A [ ∑s’ in S T(s, a, s’) [ C(s, a, s’) + V*(s’) ] ]
66
π
Exp. Lin. Add. Utility
Every policy either takes a finite exp. # of steps to reach a goal, or has an
infinite cost.
For every s,t, the value of a policy is well-defined!
SSP and Other MDP Classes
• SSP is an “indefinite-horizon” MDP• Can compile FH by considering a new state space that is <S,T>
(state at epoch)• Can compile IHDR to SSP by introducing a “goal” node and adding
a b probability transition from any state to goal state.67
SSPIHDR FH
Algorithms for SSP
• Value Iteration works without change for SSPs – (as long as they *are* SSPs—i.e., have proper policies
and infinite costs for all improper ones)– Instead of Max operations, the Bellman update does
min operations• Policy iteration works *iff* we start with a proper
policy (otherwise, it diverges)– It is not often clear how to pick a proper policy though
SSPs and A* Search• The SSP model, being based on absorbing goals and action costs, is
very close to the A* search model– Identical if you have deterministic actions and start the SSP with a
specific initial state– For this case, the optimal value function of SSP is the perfect heuristic for
the corresponding A* search– An admissible heuristic for A* will be a “lower bound” on V* for the SSP
• Start value iteration by initializing with an admissible heuristic!
• ..and since SSP theoretically subsumes Finite Horizon and Infinite Horizon models, you get an effective bridge between MDPs and A* search
• The bridge also allows us to “solve” MDPs using advances in deterministic planning/A* search..
Summary of MDPs
• There are many MDPs, we looked at those that– aim to maximize expected cumulative reward (as against,
say, average reward)– Finite horizon, Infinite Horizon and SSP MDPs
• We looked at Atomic MDPs—states are atomic• We looked at exact methods for solving MDPs
– Value Iteration (and improvements including asynchronous and prioritized sweeping)
– Policy Iteration (and improvements including modified PI)• We looked at connections to A* search
Other topics in MDPs (that we will get back to)
• Approximate solutions for MDPs– E.g. Online solutions based on determinizations
• Factored representations for MDPs– States in terms of state variables– Actions in terms of either Probabilistic STRIPS or
Dynamic Bayes Net representations– Value and Reward functions in terms of decision
diagrams
Modeling Softgoal problems as deterministic MDPs
• Consider the net-benefit problem, where actions have costs, and goals have utilities, and we want a plan with the highest net benefit
• How do we model this as MDP?– (wrong idea): Make every state in which any subset of goals
hold into a sink state with reward equal to the cumulative sum of utilities of the goals.
• Problem—what if achieving g1 g2 will necessarily lead you through a state where g1 is already true?
– (correct version): Make a new fluent called “done” dummy action called Done-Deal. It is applicable in any state and asserts the fluent “done”. All “done” states are sink states. Their reward is equal to sum of rewards of the individual states.