CS 416 Artificial Intelligence

CS 416Artificial Intelligence

Lecture 20Lecture 20Making Complex DecisionsMaking Complex Decisions

Chapter 17Chapter 17

Midterm ResultsAVG:AVG: 7272MED:MED: 7575STD:STD: 1212Rough dividing lines at: 58 (C), 72 (B), 85 (A)Rough dividing lines at: 58 (C), 72 (B), 85 (A)

Assignment 1 ResultsAVG: AVG: 8787MED: MED: 9494STD: STD: 1919

How to interpret the grade sheet…How to interpret the grade sheet…

Interpreting the grade sheet…• You see the tests we ran listed in the first columnYou see the tests we ran listed in the first column

• The metrics we accumulated are:The metrics we accumulated are:

– Solution depth, nodes created, nodes accessed, fringe sizeSolution depth, nodes created, nodes accessed, fringe size

– All metrics are normalized by dividing by the value obtained using one of the All metrics are normalized by dividing by the value obtained using one of the good solutions from last yeargood solutions from last year

• The first four columns show these normalized metrics averaged across the The first four columns show these normalized metrics averaged across the entire class’s submissionsentire class’s submissions

• The next four columns show these normalized metrics for your submission… The next four columns show these normalized metrics for your submission…

– Ex: A value of “1” for “Solution” means your code found a solution at the Ex: A value of “1” for “Solution” means your code found a solution at the same depth as the solution from last year. The class average for “solution” same depth as the solution from last year. The class average for “solution” might be 1.28 because some submissions searched longer and thus might be 1.28 because some submissions searched longer and thus increased the averageincreased the average

Interpreting the grade sheet• SLOW = more than 30 seconds to completeSLOW = more than 30 seconds to complete

– 66% credit given to reflect partial credit even though we 66% credit given to reflect partial credit even though we never obtained firm resultsnever obtained firm results

• N/A = the test would not even launch correctly… it might N/A = the test would not even launch correctly… it might have crashed or ended without outputhave crashed or ended without output

– 33% credit given to reflect that frequently N/A occurs when 33% credit given to reflect that frequently N/A occurs when no attempt was made to create an implementationno attempt was made to create an implementation

If you have an N/A but you think your code If you have an N/A but you think your code reflects partial credit, let us know.reflects partial credit, let us know.

Gambler’s RuinConsider working out examples of gambler’s ruin Consider working out examples of gambler’s ruin for $4 and $8 by handfor $4 and $8 by hand

Ben created some graphs to show solution of Ben created some graphs to show solution of gambler’s ruin for $8gambler’s ruin for $8

$0 bets are not permitted!$0 bets are not permitted!

$8-ruin using batch updateConverges afterConverges afterthree iterations.three iterations.

Value vector isValue vector isonly updated afteronly updated aftera completea completeiteration has iteration has completedcompleted

$8-ruin using in-place updatingConvergence Convergence occurs moreoccurs morequicklyquickly

Updates to valueUpdates to valuefunction occurfunction occurin-place startingin-place startingfrom $1from $1

$100-ruinA more detailedA more detailedgraph thangraph thanprovided in theprovided in theassignmentassignment

Trying it by handAssume value update is working…Assume value update is working…

What’s the best action at $5?What’s the best action at $5?

$1$1 $2$2 $3$3 $4$4 $5$5 $6$6 $7$7 $8$8

.064.064 .16.16 .256.256 .4.4 .496.496 .64.64 .784.784 11

When tied… pick the smallest action

Office hoursSunday: 4 – 5 in Thornton StacksSunday: 4 – 5 in Thornton Stacks

Send email to Ben ([email protected]) by Send email to Ben ([email protected]) by Saturday at midnight to reserve a slotSaturday at midnight to reserve a slot

Also make sure you have stepped through your Also make sure you have stepped through your code (say for the $8 example) to make sure that it code (say for the $8 example) to make sure that it is implementing your logicis implementing your logic

CompilationJust for grinsJust for grins

Take your Visual Studio code and compile using Take your Visual Studio code and compile using g++:g++: g++ foo.cpp –o foo -Wallg++ foo.cpp –o foo -Wall

Partially observable Markov Decision Processes (POMDPs)

Relationship to MDPsRelationship to MDPs• Value and Policy Iteration assume you know a lot about the Value and Policy Iteration assume you know a lot about the

world: world:

– current state, action, next state, reward for state, …current state, action, next state, reward for state, …

• In real world, you don’t exactly know what state you’re inIn real world, you don’t exactly know what state you’re in

– Is the car in front braking hard or braking lightly?Is the car in front braking hard or braking lightly?

– Can you successfully kick the ball to your teammate?Can you successfully kick the ball to your teammate?

Partially observableConsider not knowing what state you’re in…Consider not knowing what state you’re in…• Go left, left, left, left, leftGo left, left, left, left, left

• Go up, up, up, up, upGo up, up, up, up, up

– You’re probably in upper-You’re probably in upper-left cornerleft corner

• Go right, right, right, right, rightGo right, right, right, right, right

Extending the MDP modelMDPs have an explicit transition functionMDPs have an explicit transition function

T(s, a, s’)T(s, a, s’)• We add We add O (s, o)O (s, o)

– The probability of observing The probability of observing o o when in state when in state ss

• We add the We add the belief statebelief state, , bb

– The probability distribution over all possible statesThe probability distribution over all possible states

– b(s)b(s) = belief that you are in state = belief that you are in state ss

Two parts to the problemFigure out what state you’re inFigure out what state you’re in• Use Filtering from Chapter 15Use Filtering from Chapter 15

Figure out what to do in that stateFigure out what to do in that state• Bellman’s equation is useful againBellman’s equation is useful again

The optimal action depends only on the agent’s The optimal action depends only on the agent’s current belief statecurrent belief state

Update b(s) and(s) / U(s) aftereach iteration

Selecting an action

• is normalizing constant that makes belief state sum to 1is normalizing constant that makes belief state sum to 1

• b’ = FORWARD (b, a, o)b’ = FORWARD (b, a, o)

• Optimal policy maps belief states to actionsOptimal policy maps belief states to actions

– Note that the n-dimensional belief-state is continuousNote that the n-dimensional belief-state is continuous Each belief value is a number between 0 and 1Each belief value is a number between 0 and 1

A slight hitchThe previous slide The previous slide required that you know the required that you know the outcome ooutcome o of action a in order to update the belief of action a in order to update the belief statestate

If the If the policypolicy is supposed to navigate through is supposed to navigate through beliefbelief space, we want to know what belief state space, we want to know what belief state we’re moving into before executing action we’re moving into before executing action aa

Predicting future belief statesSuppose you know action a was performed when Suppose you know action a was performed when in belief state b. What is the probability of in belief state b. What is the probability of receiving observation o?receiving observation o?• b provides a guess about initial stateb provides a guess about initial state

• a is knowna is known

• Any observation could be realized… any subsequent state Any observation could be realized… any subsequent state could be realized… any new belief state could be realizedcould be realized… any new belief state could be realized

Predicting future belief statesThe probability of perceiving o, given action a and The probability of perceiving o, given action a and belief state b, is given by summing over all the belief state b, is given by summing over all the actual states the agent might reachactual states the agent might reach

Predicting future belief statesWe just computed the odds of receiving oWe just computed the odds of receiving oWe want new belief stateWe want new belief state• Let Let (b, a, b’) (b, a, b’) be the be the belief transition functionbelief transition function

Equal to 1 ifb′ = FORWARD(b, a, o)Equal to 0 otherwise

Predicted future belief statesCombining previous two slidesCombining previous two slides

This is a transition model through belief statesThis is a transition model through belief states

Relating POMDPs to MDPsWe’ve found a model for transitions through belief We’ve found a model for transitions through belief statesstates• Note MDPs had transitions through states (the real things)Note MDPs had transitions through states (the real things)

We We need need a model for rewards based on beliefsa model for rewards based on beliefs• Note MDPs had a reward function based on stateNote MDPs had a reward function based on state

Bringing it all togetherWe’ve constructed a representation of POMDPs We’ve constructed a representation of POMDPs that make them look like MDPsthat make them look like MDPs• Value and Policy Iteration can be used for POMDPsValue and Policy Iteration can be used for POMDPs

• The optimal policy, The optimal policy, *(b) of the MDP belief-state *(b) of the MDP belief-state representation is also optimal for the physical-state POMDP representation is also optimal for the physical-state POMDP representationrepresentation

Continuous vs. discreteOur POMDP in MDP-form is continuousOur POMDP in MDP-form is continuous• Cluster Cluster continuous space into regions and try to solve for continuous space into regions and try to solve for

approximations within these regionsapproximations within these regions

Final answer to POMDP problem[l, u, u, r, u, u, r, u, u, r, …][l, u, u, r, u, u, r, u, u, r, …]• It’s deterministic (it already takes into account the absence of It’s deterministic (it already takes into account the absence of

observations)observations)

• It has an expected utility of 0.38 (compared with 0.08 of the It has an expected utility of 0.38 (compared with 0.08 of the simple l, l, l, u, u, u, r, r, r,…)simple l, l, l, u, u, u, r, r, r,…)

• It is successful 86.6%It is successful 86.6%

In general, POMDPs with a few dozen states are In general, POMDPs with a few dozen states are nearly impossible to optimizenearly impossible to optimize

Documents

CS 416 Artificial Intelligence