CS 188 Introduction to Summer 2019 Arti cial Intelligence ...CS 188 Summer 2019 Introduction to Arti cial Intelligence Written HW 3 Sol. Self-assessment due: Tuesday 7/23/2019 at 11:59pm

CS 188Summer 2019

Introduction toArtificial Intelligence Written HW 3 Sol.

Self-assessment due: Tuesday 7/23/2019 at 11:59pm (submit via Gradescope)

1

Q1. MDPs: Dice BonanzaA casino is considering adding a new game to their collection, but need to analyze it before releasing it on their floor.They have hired you to execute the analysis. On each round of the game, the player has the option of rolling a fair6-sided die. That is, the die lands on values 1 through 6 with equal probability. Each roll costs 1 dollar, and theplayer must roll the very first round. Each time the player rolls the die, the player has two possible actions:

1. Stop: Stop playing by collecting the dollar value that the die lands on, or

2. Roll: Roll again, paying another 1 dollar.

Having taken CS 188, you decide to model this problem using an infinite horizon Markov Decision Process (MDP).The player initially starts in state Start, where the player only has one possible action: Roll. State si denotes thestate where the die lands on i. Once a player decides to Stop, the game is over, transitioning the player to the Endstate.

(a) In solving this problem, you consider using policy iteration. Your initial policy π is in the table below. Evaluatethe policy at each state, with γ = 1.

State s1 s2 s3 s4 s5 s6

π(s) Roll Roll Stop Stop Stop Stop

V π(s) 3 3 3 4 5 6

We have that si = i for i ∈ {3, 4, 5, 6}, since the player will be awarded no further rewards according to thepolicy. From the Bellman equations, we have that V (s1) = −1 + 1

6 (V (s1) + V (s2) + 3 + 4 + 5 + 6) and thatV (s2) = −1 + 1

6 (V (s1) + V (s2) + 3 + 4 + 5 + 6). Solving this linear system yields V (s1) = V (s2) = 3.

(b) Having determined the values, perform a policy update to find the new policy π′. The table below shows theold policy π and has filled in parts of the updated policy π′ for you. If both Roll and Stop are viable newactions for a state, write down both Roll/Stop. In this part as well, we have γ = 1.

State s1 s2 s3 s4 s5 s6

π(s) Roll Roll Stop Stop Stop Stop

π′(s) Roll Roll Roll/Stop Stop Stop Stop

For each si in part (a), we compare the values obtained via Rolling and Stopping. The value of Rolling foreach state si is −1 + 1

6 (3 + 3 + 3 + 4 + 5 + 6) = 3. The value of Stopping for each state si is i. At each statesi, we take the action that yields the largest value; so, for s1 and s2, we Roll, and for s4 and s5, we stop. Fors3, we Roll/Stop, since the values from Rolling and Stopping are equal.

(c) Is π(s) from part (a) optimal? Explain why or why not.Yes, the old policy is optimal. Looking at part (b), there is a tie between 2 equally good policies that policyiteration considers employing. One of these policies is the same as the old policy. This means that bothnew policies are as equally good as the old policy, and policy iteration has converged. Since policy iterationconverges to the optimal policy, we can be sure that π(s) from part (a) is optimal.

2

(d) Suppose that we were now working with some γ ∈ [0, 1) and wanted to run value iteration. Select the onestatement that would hold true at convergence, or write the correct answer next to Other if none of the optionsare correct.

# V ∗(si) = max

−1 +i

6,∑j

γV ∗(sj)

# V ∗(si) = max

i , 1

6·

−1 +∑j

γV ∗(sj)

# V ∗(si) = max

−1

6+ i ,

∑j

γV ∗(sj)

# V ∗(si) = max

i , −1

6+∑j

γV ∗(sj)

# V ∗(si) =

1

6·∑j

max {i , −1 + γV ∗(sj)}

# V ∗(si) =1

6·∑j

max

{−1 + i ,

∑k

V ∗(sj)

}

# V ∗(si) =∑j

max

{−1 + i ,

1

6· γV ∗(sj))

}

# V ∗(si) =∑j

max

{i

6, −1 + γV ∗(sj)

}

V ∗(si) = max

i , −1 +γ

6

∑j

V ∗(sj)

# V ∗(si) =

∑j

max

{i , −1

6+ γV ∗(sj)

}

# V ∗(si) =∑j

max

{−i6

, −1 + γV ∗(sj)

}

# Other

At convergence,

V ∗(si) = maxa

Q∗(si, a)

= max {Q∗(si, stop), Q∗(si, roll)}

= max

R(si, stop), R(si, roll) + γ∑j

T (si, roll, sj)V∗(sj)

= max

i, −1 +γ

6

∑j

V ∗(sj)

3

Q2. Bellman Equations for the Post-Decision StateConsider an infinite-horizon, discounted MDP (S,A, T,R, γ). Suppose that the transition probabilities and thereward function have the following form:

T (s, a, s′) = P (s′|f(s, a)), R(s, a, s′) = R(s, a)

Here, f is some deterministic function mapping S × A → Y , where Y is a set of states called post-decision states.We will use the letter y to denote an element of Y , i.e., a post-decision state. In words, the state transitions consistof two steps: a deterministic step that depends on the action, and a stochastic step that does not depend on theaction. The sequence of states (st), actions (at), post-decision-states (yt), and rewards (rt) is illustrated below.

(s0, a0) y0 (s1, a1) y1 (s2, a2) · · ·

r0 r1 r2

f P f P f

You have learned about V π(s), which is the expected discounted sum of rewards, starting from state s, when actingaccording to policy π.

V π(s0) = E[R(s0, a0) + γR(s1, a1) + γ2R(s2, a2) + . . .

]given at = π(st) for t = 0, 1, 2, . . .

V ∗(s) is the value function of the optimal policy, V ∗(s) = maxπ Vπ(s).

This question will explore the concept of computing value functions on the post-decision-states y. 1

Wπ(y0) = E[R(s1, a1) + γR(s2, a2) + γ2R(s3, a3) + . . .

]We define W ∗(y) = maxπW

π(y).

1In some applications, it is easier to learn an approximate W function than V or Q. For example, to use reinforcement learning toplay Tetris, a natural approach is to learn the value of the block pile after you’ve placed your block, rather than the value of the pair(current block, block pile). TD-Gammon, a computer program developed in the early 90s, was trained by reinforcement learning to playbackgammon as well as the top human experts. TD-Gammon learned an approximate W function.

4

(a) Write W ∗ in terms of V ∗.W ∗(y) =

∑s′ P (s′

∣∣ y)V ∗(s′)

# ∑s′ P (s′

∣∣ y)[V ∗(s′) + maxaR(s′, a)]

# ∑s′ P (s′

∣∣ y)[V ∗(s′) + γmaxaR(s′, a)]

# ∑s′ P (s′

∣∣ y)[γV ∗(s′) + maxaR(s′, a)]

# None of the above

Consider the expected rewards under the optimal policy.

W ∗(y0) = E[R(s1, a1) + γR(s2, a2) + γ2R(s3, a3) + . . .

∣∣ y0]=∑s1

P (s1∣∣ y0)E

[R(s1, a1) + γR(s2, a2) + γ2R(s3, a3) + . . .

∣∣ s1]=∑s1

P (s1∣∣ y0)V ∗(s1)

V ∗ is time-independent, so we can replace y0 by y and replace s1 by s′, giving

W ∗(y) =∑s′

P (s′∣∣ y)V ∗(s′)

(b) Write V ∗ in terms of W ∗.V ∗(s) =

# maxa[W ∗(f(s, a))]

# maxa[R(s, a) +W ∗(f(s, a))]

maxa[R(s, a) + γW ∗(f(s, a))]

# maxa[γR(s, a) +W ∗(f(s, a))]

# None of the above

V ∗(s0) = maxa0

Q(s0, a0)

= maxa0

E[R(s0, a0) + γR(s1, a1) + γ2R(s2, a2) + . . .

∣∣ s0, a0]= max

a0

(E[R(s0, a0)

∣∣ s0, a0]+ E[γR(s1, a1) + γ2R(s2, a2) + . . .

∣∣ s0, a0])= max

a0

(R(s0, a0) + E

[γR(s1, a1) + γ2R(s2, a2) + . . .

∣∣ f(x0, a0)])

= maxa0

(R(s0, a0) + γW ∗(f(s0, a0)))

Renaming variables, we get

V ∗(s) = maxa

(R(s, a) + γW ∗(f(s, a)))

5

(c) Recall that the optimal value function V ∗ satisfies the Bellman equation:

V ∗(s) = maxa

∑s′

T (s, a, s′) (R(s, a) + γV ∗(s′)) ,

which can also be used as an update equation to compute V ∗.

Provide the equivalent of the Bellman equation for W ∗.

W ∗(y) =∑s′ P (s′|y) maxa (R(s′, a) + γW ∗(f(s′, a)))

The answer follows from combining parts (a) and (b)

(d) Fill in the blanks to give a policy iteration algorithm, which is guaranteed return the optimal policy π∗.

• Initialize policy π(1) arbitrarily.

• For i = 1, 2, 3, . . .

– Compute Wπ(i)

(y) for all y ∈ Y .

– Compute a new policy π(i+1), where π(i+1)(s) = arg maxa

(1) for all s ∈ S.

– If (2) for all s ∈ S, return π(i).

Fill in your answers for blanks (1) and (2) below.

(1) # Wπ(i)

(f(s, a))

# R(s, a) +Wπ(i)

(f(s, a))

R(s, a) + γWπ(i)

(f(s, a))

# γR(s, a) +Wπ(i)

(f(s, a))

# None of the above

(2) π(i)(s) = π(i+1)(s)

Policy iteration performs the following update:

π(i+1)(s) = arg maxa

Qπ(i)

(s, a)

Next we express Qπ in terms of Wπ (similarly to part b):

Qπ(s0, a0) = E[R(s0, a0) + γR(s1, a1) + γ2R(s2, a2) + . . .

∣∣ s0, a0]= R(s0, a0) + γE

[R(s1, a1) + γR(s2, a2) + . . .

∣∣ f(s0, a0)]

= R(s0, a0) + γWπ(f(s0, a0))

6

Q3. Q-learningConsider the following gridworld (rewards shown on left, state names shown on right).

Rewards State names

From state A, the possible actions are right(→) and down(↓). From state B, the possible actions are left(←) anddown(↓). For a numbered state (G1, G2), the only action is to exit. Upon exiting from a numbered square we collectthe reward specified by the number on the square and enter the end-of-game absorbing state X. We also know thatthe discount factor γ = 1, and in this MDP all actions are deterministic and always succeed.

Consider the following episodes:

Episode 1 (E1)

s a s′ rA ↓ G1 0G1 exit X 10

Episode 2 (E2)

s a s′ rB ↓ G2 0G2 exit X 1

Episode 3 (E3)

s a s′ rA → B 0B ↓ G2 0G2 exit X 1

Episode 4 (E4)

s a s′ rB ← A 0A ↓ G1 0G1 exit X 10

(a) Consider using temporal-difference learning to learn V (s). When running TD-learning, all values are initializedto zero.For which sequences of episodes, if repeated infinitely often, does V (s) converge to V ∗(s) for all states s?

(Assume appropriate learning rates such that all values converge.)Write the correct sequence under “Other” if no correct sequences of episodes are listed.

2 E1, E2, E3, E4 2 E1, E2, E1, E2 2 E1, E2, E3, E1 � E4, E4, E4, E42 E4, E3, E2, E1 2 E3, E4, E3, E4 2 E1, E2, E4, E1

� Other See explanation below

TD learning learns the value of the executed policy, which is V π(s). Therefore for V π(s) to converge to V ∗(s),it is necessary that the executing policy π(s) = π∗(s).

Because there is no discounting since γ = 1, the optimal deterministic policy is π∗(A) = ↓ and π∗(B) = ←(π∗(G1) and π∗(G2) are trivially exit because that is the only available action). Therefore episodes E1 and E4act according to π∗(s) while episodes E2 and E3 are sampled from a suboptimal policy.

From the above, TD learning using episode E4 (and optionally E1) will converge to V π(s) = V ∗(s) for statesA, B, G1. However, then we never visit G2, so V (G2) will never converge. If we add either episode E2 or E3to ensure that V (G2) converges, then we are executing a suboptimal policy, which will then cause V (B) to notconverge. Therefore none of the listed sequences will learn a value function V π(s) that converges to V ∗(s) forall states s. An example of a correct sequence would be E2, E4, E4, E4, ...; sampling E2 first with the learningrate α = 1 ensures V π(G2) = V ∗(G2), and then executing E4 infinitely after ensures the values for states A,B, and G1 converge to the optimal values.

7

We also accepted the answer such that the value function V (s) converges to V ∗(s) for states A and B (ignoringG1 and G2). TD learning using only episode E4 (and optionally E1) will converge to V π(s) = V ∗(s) for statesA and B, therefore the only correct listed option is E4, E4, E4, E4.

(b) Consider using Q-learning to learn Q(s, a). When running Q-learning, all values are initialized to zero.For which sequences of episodes, if repeated infinitely often, does Q(s, a) converge to Q∗(s, a) for all state-actionpairs (s, a)

(Assume appropriate learning rates such that all Q-values converge.)Write the correct sequence under “Other” if no correct sequences of episodes are listed.

� E1, E2, E3, E4 2 E1, E2, E1, E2 2 E1, E2, E3, E1 2 E4, E4, E4, E4� E4, E3, E2, E1 � E3, E4, E3, E4 2 E1, E2, E4, E1

2 Other

For Q(s, a) to converge, we must visit all state action pairs for non-zero Q∗(s, a) infinitely often. Therefore wemust take the exit action in states G1 and G2, must take the down and right action in state A, and must takethe left and down action in state B. Therefore the answers must include E3 and E4.

8

Q4. Reinforcement LearningImagine an unknown game which has only two states {A,B} and in each state the agent has two actions to choosefrom: {Up, Down}. Suppose a game agent chooses actions according to some policy π and generates the followingsequence of actions and rewards in the unknown game:

t st at st+1 rt0 A Down B 21 B Down B -42 B Up B 03 B Up A 34 A Up A -1

Unless specified otherwise, assume a discount factor γ = 0.5 and a learning rate α = 0.5

(a) Recall the update function of Q-learning is:

Q(st, at)← (1− α)Q(st, at) + α(rt + γmaxa′

Q(st+1, a′))

Assume that all Q-values initialized as 0. What are the following Q-values learned by running Q-learning withthe above experience sequence?

Q(A,Down) = 1 , Q(B,Up) =7

4

Perform Q-learning update 4 times, once for each of the first 4 observations.(b) In model-based reinforcement learning, we first estimate the transition function T (s, a, s′) and the reward

function R(s, a, s′). Fill in the following estimates of T and R, estimated from the experience above. Write“n/a” if not applicable or undefined.

T̂ (A,Up,A) = 1 , T̂ (A,Up,B) = 0 , T̂ (B,Up,A) =1

2, T̂ (B,Up,B) =

1

2

R̂(A,Up,A) = −1 , R̂(A,Up,B) = n/a , R̂(B,Up,A) = 3 , R̂(B,Up,B) = 0

Count transitions above and calculate frequencies. Rewards are observed rewards.

(c) To decouple this question from the previous one, assume we had a different experience and ended up withthe following estimates of the transition and reward functions:

s a s′ T̂ (s, a, s′) R̂(s, a, s′)A Up A 1 10A Down A 0.5 2A Down B 0.5 2B Up A 1 -5B Down B 1 8

(i) Give the optimal policy π̂∗(s) and V̂ ∗(s) for the MDP with transition function T̂ and reward function R̂.Hint: for any x ∈ R, |x| < 1, we have 1 + x+ x2 + x3 + x4 + · · · = 1/(1− x).

π̂∗(A) = Up , π̂∗(B) = Down , V̂ ∗(A) = 20 , V̂ ∗(B) = 16 .

Find the optimal policy first, and then use optimal policy to calculate the value function using a Bellmanequation.

(ii) If we repeatedly feed this new experience sequence through our Q-learning algorithm, what values will itconverge to? Assume the learning rate αt is properly chosen so that convergence is guaranteed.

the values found above, V̂ ∗

9

# the optimal values, V ∗

# neither V̂ ∗ nor V ∗

# not enough information to determine

The Q-learning algorithm will not converge to the optimal values V ∗ for the MDP because the experiencesequence and transition frequencies replayed are not necessarily representative of the underlying MDP.(For example, the true T (A,Down,A) might be equal to 0.75, in which case, repeatedly feeding in theabove experience would not provide an accurate sampling of the MDP.) However, for the MDP withtransition function T̂ and reward function R̂, replaying this experience repeatedly will result in Q-learningconverging to its optimal values V̂ ∗.

10

Q5. Policy EvaluationIn this question, you will be working in an MDP with states S, actions A, discount factor γ, transition function T ,and reward function R.

We have some fixed policy π : S → A, which returns an action a = π(s) for each state s ∈ S. We want to learnthe Q function Qπ(s, a) for this policy: the expected discounted reward from taking action a in state s and thencontinuing to act according to π: Qπ(s, a) =

∑s′ T (s, a, s′)[R(s, a, s′) + γQπ(s′, π(s′)]. The policy π will not change

while running any of the algorithms below.

(a) Can we guarantee anything about how the values Qπ compare to the values Q∗ for an optimal policy π∗?

Qπ(s, a) ≤ Q∗(s, a) for all s, a

# Qπ(s, a) = Q∗(s, a) for all s, a

# Qπ(s, a) ≥ Q∗(s, a) for all s, a

# None of the above are guaranteed

(b) Suppose T and R are unknown. You will develop sample-based methods to estimate Qπ. You obtain a seriesof samples (s1, a1, r1), (s2, a2, r2), . . . (sT , aT , rT ) from acting according to this policy (where at = π(st), for allt).

(i) Recall the update equation for the Temporal Difference algorithm, performed on each sample in sequence:

V (st)← (1− α)V (st) + α(rt + γV (st+1))

which approximates the expected discounted reward V π(s) for following policy π from each state s, for alearning rate α.

Fill in the blank below to create a similar update equation which will approximate Qπ using the samples.

You can use any of the terms Q, st, st+1, at, at+1, rt, rt+1, γ, α, π in your equation, as well as∑

and maxwith any index variables (i.e. you could write maxa, or

∑a and then use a somewhere else), but no other

terms.

Q(st, at)← (1− α)Q(st, at) + α [rt + γQ(st+1, at+1)]

(ii) Now, we will approximate Qπ using a linear function: Q(s, a) =∑di=1 wifi(s, a) for weights w1, . . . , wd

and feature functions f1(s, a), . . . , fd(s, a).

To decouple this part from the previous part, use Qsamp for the value in the blank in part (i) (i.e.Q(st, at)← (1− α)Q(st, at) + αQsamp).

Which of the following is the correct sample-based update for each wi?

# wi ← wi + α[Q(st, at)−Qsamp]# wi ← wi − α[Q(st, at)−Qsamp]# wi ← wi + α[Q(st, at)−Qsamp]fi(st, at) wi ← wi − α[Q(st, at)−Qsamp]fi(st, at)# wi ← wi + α[Q(st, at)−Qsamp]wi# wi ← wi − α[Q(st, at)−Qsamp]wi

(iii) The algorithms in the previous parts (part i and ii) are:

2 model-based � model-free

11

Q6. Proofs: Admissibility, Consistency and Graph SearchHere, we will revisit and prove some of the properties of search mentioned in lectures in a more rigorous manner.

The central idea of consistency is that we enforce not only that a heuristic underestimates the total distance to agoal from any given node, but also the cost/weight of each edge in the graph. For graph search to be optimal, wehave to make sure that every time we visit a node, it is already the most optimal way to reach that node, sincewe don’t get another chance to expand it with our closed list in place. Hence, it is intuitive to see that, consistentheuristic, a function that underestimates distance of every intermediate ”goal” just like how an admissible heuristicunderestimates total distance to a goal, is likely to be sufficient for graph search to be optimal when run with A*search. We will prove that for a given search problem, if the consistency constraint is satisfied by a heuristic functionh, using A∗ graph search with h on that search problem will yield an optimal solution.

(a) Show that consistency implies admissibility.Admissibility: ∀n, 0 ≤ h(n) ≤ h∗(n)Consistency : ∀A,C h(A)− h(C) ≤ cost(A,C)=⇒ ∀A,C h(A) ≤ cost(A,C) + h(C)

Let v be an arbitrary node in the graph and h(.) be any consistent heuristic.If there is no path from v to the goal node, admissibility is already trivially satisfied as h∗(v) is infinite.If there is some path from v to a goal state G, consider the shortest path (v, v1, v2, ..., vn, G).First consider h(vn). Since h is consistent, h(vn) ≤ cost(vn, G) + h(G) = cost(vn, G) by admissibility of h.Similarly, h(vn−1) ≤ cost(vn−1, vn) + h(vn) ≤ cost(vn−1, vn) + cost(vn, G).Inducting on n, we can see that h(vk) ≤ cost(vk, vk+1) + ... + cost(vn−1, vn) + cost(vn, G) for any k between 1 and n−1.Since (v, v1, v2, ..., vn, G) is the shortest path, we can see that h(vk) ≤ cost(vk, vk+1) + ... + cost(vn−1, vn) + cost(vn, G) =h∗(vk).

(b) Construct a graph and a heuristic such that running A* tree search finds an optimal path while running A*graph search finds a suboptimal one.

12

(c) Recall the following notations:•g(n) - The function representing total backwards cost computed by UCS.•h(n) - The heuristic value function, or estimated forward cost, used by greedy search.•f(n) - The function representing estimated total cost, used by A∗ search. f(n) = g(n) + h(n).

Show that the f value constructed with a consistent heuristic never decreases along a path. Specifically, considera path p = (s1, s2, ..., st−1, st), show that f(si+1) ≥ f(si). Also, check that this is indeed the case with yourexample in (b). Hint: use the definition of consistent heuristic!

Want to show: f(si+1) ≥ f(si)f(si+1) = h(si+1) + g(si+1) = h(si+1) + g(si) + cost(si, si+1) ≥ h(si+1) + g(si) + h(si)− h(si+1) = f(si)Where we used def. of consistency to obtain: cost(si, si+1) ≥ h(si)− h(si+1)

(d) Consider a scenario where some n on path to G∗ isn’t in queue when we need it, because some worse n′ forthe same state was dequeued and expanded first. Take the highest such n in tree and let p be the ancestorof n that was on the queue when n′ was popped. Prove that p would have been expanded before n′ and thisscenario would never happen with a consistent heuristic.f(n) ≥ f(p)f(n) = g(n) + h(n)f(n′) = g(n′) + h(n′)g(n′) > g(n)f(n′) > f(n) ≥ f(p)f(n′) ≥ f(p)=⇒ p will be expanded before n′.

(e) Finally, show that an optimal goal G∗ will always be removed for expansion and returned before any suboptimalgoal G with a consistent heuristic.Since h(G) = h(G*) = 0,f(G*) = g(G*) < g(G) = f(G).

13

Documents

CS 188 Introduction to Summer 2019 Arti cial Intelligence ...CS 188 Summer 2019 Introduction to Arti cial Intelligence Written HW 3 Sol. Self-assessment due: Tuesday 7/23/2019 at 11:59pm