48
1 Life: play and win in 20 trillion moves Stuart Russell Computer Science Division, UC Berkeley Joint work with Devika Subramanian, Ron Parr, David Andre, Carlos Guestrin, Bhaskara Marthi, and Jason Wolfe

Life: play and win in 20 trillion moves

  • Upload
    dionne

  • View
    24

  • Download
    4

Embed Size (px)

DESCRIPTION

Life: play and win in 20 trillion moves. White to play and win in 2 moves. Life. 100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions (Not to mention choosing brain activities!) And the world has a very large state space - PowerPoint PPT Presentation

Citation preview

Page 1: Life: play and win in 20 trillion moves

1

Life: play and win in 20 trillion moves

Stuart RussellComputer Science Division, UC Berkeley

Joint work with Devika Subramanian, Ron Parr, David Andre, Carlos Guestrin, Bhaskara Marthi, and Jason Wolfe

Page 2: Life: play and win in 20 trillion moves

White to play and win in 2 moves

2

Page 3: Life: play and win in 20 trillion moves

Life

100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions

(Not to mention choosing brain activities!) And the world has a very large state space So, being intelligent is “provably” hard How on earth do we manage?

3

Page 4: Life: play and win in 20 trillion moves

4

Outline

Rationality and Intelligence Efficient hierarchical planning Hierarchical RL with partial programs Next steps towards AI

Page 5: Life: play and win in 20 trillion moves

5

Outline

Rationality and Intelligence Efficient hierarchical planning Hierarchical RL with partial programs Next steps towards AI

Page 6: Life: play and win in 20 trillion moves

Intelligence

Turing’s definition of intelligence as human mimicry is nonconstructive, informal

AI needs a formal, constructive, broad definition of intelligence – call it Int – that supports analysis and synthesis

“Look, my system is Int !!!” Is this claim interesting? Is this claim ever true?

6

Page 7: Life: play and win in 20 trillion moves

Candidate definitions

Int1 : perfect rationality Logical AI: always act to achieve the goal Economics, modern AI, stochastic optimal control,

OR: act to maximize expected utility

7

Page 8: Life: play and win in 20 trillion moves

Candidate definitions

Int1 : perfect rationality Logical AI: always act to achieve the goal Economics, modern AI, stochastic optimal control,

OR: act to maximize expected utility Very interesting claim Almost never true – requires instant computation!!

8

Page 9: Life: play and win in 20 trillion moves

Candidate definitions

Int2 : calculative rationality Calculate what would have been the perfectly

rational action had it been done immediately

9

Page 10: Life: play and win in 20 trillion moves

Candidate definitions

Int2 : calculative rationality Calculate what would have been the perfectly

rational action had it been done immediately Algorithms known for many settings – covers much

of classical and modern AI Not useful for the real world! We always end up

cutting corners, often in ad hoc ways

10

Page 11: Life: play and win in 20 trillion moves

Candidate definitions

Int3 : metalevel rationality View computations as actions; agent chooses the

ones providing maximal utility ≈ expected decision improvement minus cost of time

11

Page 12: Life: play and win in 20 trillion moves

Candidate definitions

Int3 : metalevel rationality View computations as actions; agent chooses the

ones providing maximal utility ≈ expected decision improvement minus cost of time

Potentially very interesting claim – superefficient anytime approximation algorithms!!

Almost never possible – metalevel problem harder than object-level problem. Myopic metalevel control effective for search and games

12

Page 13: Life: play and win in 20 trillion moves

Candidate definitions

Int4 : bounded optimality The agent program is the best that can run on its

fixed architecture

13

Page 14: Life: play and win in 20 trillion moves

Candidate definitions

Int4 : bounded optimality The agent program is the best that can run on its

fixed architecture Very interesting if true Always exists (by definition), but may be hard to find Some early results (~1995) on (A)BO construction:

Optimal sequence of classifiers exhibiting time/accuracy tradeoff for real-time decisions

Optimal time allocation within arbitrary compositions of anytime algorithms

Can we solve for more interesting architectures?14

Page 15: Life: play and win in 20 trillion moves

15

Outline

Rationality and Intelligence Efficient hierarchical planning Hierarchical RL with partial programs Next steps towards AI

Page 16: Life: play and win in 20 trillion moves

16

Life: Play and win in 20,000,000,000,000* moves

How do we do it? Hierarchical structure!! Deeply nested behaviors give modularity

E.g., tongue controls for [t] independent of everything else given decision to say “tongue”

High-level decisions give scale E.g., “go to IJCAI” 3,000,000,000 actions Looks further ahead, reduce computation exponentially

[NIPS 97, ICML 99, NIPS 00, AAAI 02, ICML 03, IJCAI 05, ICAPS 07, ICAPS 08, ICAPS 10]

Page 17: Life: play and win in 20 trillion moves

17

High-level actions Classical HTN (Hierarchical Task Network) A high-level action (HLA) has a set of possible refinements

into sequences of actions, primitive or high-level Hierarchical optimality = best primitive refinement of Act

(A restricted subset of simple linear agent programs)

[Act]

[GoCDG, Act] [GoOrly, Act]

[WalkToRER, RERtoCDG, Act]

Page 18: Life: play and win in 20 trillion moves

18

Semantics of HLAs To do offline or online planning with HLAs, need a model

describing outcome of each HLA

Page 19: Life: play and win in 20 trillion moves

19

Semantics of HLAs To do offline or online planning with HLAs, need a model

describing outcome of each HLA Drew McDermott (AI Magazine, 2000):

The semantics of hierarchical planning have never been clarified … no one has ever figured out how to reconcile the semantics of hierarchical plans with the semantics of primitive actions

Page 20: Life: play and win in 20 trillion moves

20

Downward refinement

Downward refinement property (DRP) Every apparently successful high-level plan has a successful

primitive refinement

“Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking

Page 21: Life: play and win in 20 trillion moves

21

Downward refinement

Downward refinement property (DRP) Every apparently successful high-level plan has a successful

primitive refinement

“Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking

Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general

Page 22: Life: play and win in 20 trillion moves

22

Downward refinement

Downward refinement property (DRP) Every apparently successful high-level plan has a successful

primitive refinement

“Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking

Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general

MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds

Page 23: Life: play and win in 20 trillion moves

23

Downward refinement

Downward refinement property (DRP) Every apparently successful high-level plan has a successful

primitive refinement

“Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking

Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general

MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds

Problem: how to say true things about effects of HLAs?

Page 24: Life: play and win in 20 trillion moves

24

s0

g

S

s0

g

S

s0

g

S

h2h1 h1h2

h2

h1h2 is a solution

Angelic semantics for HLAs Start with atomic state-space view, start in s0, goal state g Central idea is the reachable set of an HLA from each state

When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal

May seem related to nondeterminism But the nondeterminism is angelic: the “uncertainty” will be resolved

by the agent, not an adversary or nature

a4 a1 a3 a2

State space

Page 25: Life: play and win in 20 trillion moves

25

Technical development NCSTRIPS to describe reachable sets

STRIPS add/delete plus “possibly add”, “possibly delete”, etc E.g., GoCDG adds AtCDG, possibly adds CarAtCDG

Reachable sets may be too hard to describe exactly Upper and lower descriptions bound the exact

reachable set (and its cost) above and below Still support proofs of plan success/failure Possibly-successful plans must be refined

Sound, complete, optimal offline planning algorithms Complete, eventually-optimal online (real-time) search

Page 26: Life: play and win in 20 trillion moves

26

Online Search: Preliminary ResultsS

tep

s to

rea

ch g

oal

Page 27: Life: play and win in 20 trillion moves

27

Outline

Rationality and Intelligence Efficient hierarchical planning Hierarchical RL with partial programs Next steps towards AI

Page 28: Life: play and win in 20 trillion moves

28

Temporal abstraction in RL Basic theorem (Forestier & Varaiya 78; Parr & Russell 98):

Given an underlying Markov decision process with primitive actions Define temporally extended choice-free actions Agent is in a choice state whenever an extended action terminates Choice states + extended actions form a semi-Markov decision process

Hierarchical structures with unspecified choices = know-how Hierarchical Abstract Machines [Parr & Russell 98] (= recursive NDFAs) Options [Sutton & Precup 98] (= extended choice-free actions) MAXQ [Dietterich 98] (= HTN hierarchy)

General partial programs: agent program falls in a designated restricted subset of arbitrary programs Alisp [Andre & Russell 02]

Concurrent Alisp [Marthi et al 05]

Page 29: Life: play and win in 20 trillion moves

29

Running example

Peasants can move, pickup and dropoff

Penalty for collision Cost-of-living each step Reward for dropping off

resources Goal : gather 10 gold +

10 wood (3L)n++ states s 7n primitive actions a

Page 30: Life: play and win in 20 trillion moves

30

RL and partial programs

Learning algorithm

Completion

a

s,r

Partial program

Page 31: Life: play and win in 20 trillion moves

31

RL and partial programs

Learning algorithm

Completion

a

s,r

Partial program

Hierarchically optimalfor all terminating programs

Page 32: Life: play and win in 20 trillion moves

32

(defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas

(first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas))))

(defun gather-wood () (with-choice ‘forest-choice

(dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

(defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood))))

(defun gather-wood () (with-choice ‘forest-choice

(dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

(defun gather-gold () (with-choice ‘mine-choice

(dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff)))

(defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))

An example Concurrent ALisp programAn example single-threaded ALisp program

Page 33: Life: play and win in 20 trillion moves

33

Technical development

Page 34: Life: play and win in 20 trillion moves

34

Technical development

Decisions based on internal state Joint state ω = [s,θ] environment state + program state

(cf. [Russell & Wefald 1989] )

Page 35: Life: play and win in 20 trillion moves

35

Technical development

Decisions based on internal state Joint state ω = [s,θ] environment state + program state

(cf. [Russell & Wefald 1989] ) MDP + partial program = SMDP over {ω}, learn Qπ(ω,u)

Page 36: Life: play and win in 20 trillion moves

36

Technical development

Decisions based on internal state Joint state ω = [s,θ] environment state + program state

(cf. [Russell & Wefald 1989] ) MDP + partial program = SMDP over {ω}, learn Qπ(ω,u)

Additive decomposition of value functions

Page 37: Life: play and win in 20 trillion moves

37

Technical development

Decisions based on internal state Joint state ω = [s,θ] environment state + program state

(cf. [Russell & Wefald 1989] ) MDP + partial program = SMDP over {ω}, learn Qπ(ω,u)

Additive decomposition of value functions by subroutine structure [Dietterich 00, Andre & Russell 02]

Q is a sum of sub-Q functions per subroutine

Page 38: Life: play and win in 20 trillion moves

38

Technical development

Decisions based on internal state Joint state ω = [s,θ] environment state + program state

(cf. [Russell & Wefald 1989] ) MDP + partial program = SMDP over {ω}, learn Qπ(ω,u)

Additive decomposition of value functions by subroutine structure [Dietterich 00, Andre & Russell 02]

Q is a sum of sub-Q functions per subroutine across concurrent threads [Russell & Zimdars 03]

Q is a sum of sub-Q functions per thread, with decomposed reward signal

Page 39: Life: play and win in 20 trillion moves

39

Internal state

Availability of internal state (e.g., goal stack) can greatly simplify value functions and policies

E.g., while navigating to location (x,y), moving towards (x,y) is a good idea

Natural heuristic (distance from destination) impossible to express in external terms

Page 40: Life: play and win in 20 trillion moves

40

Temporal decomposition and state abstraction

• Temporal decomposition of Q-function: local components capture sum-of-rewards per subroutine [Dietterich 00, Andre & Russell 02]

• => State abstraction - e.g., when navigating, local Q independent of gold reserves

Small local Q-components => fast learning

Q

Page 41: Life: play and win in 20 trillion moves

41

Temporal decomposition and state abstraction

• Temporal decomposition of Q-function: local components capture sum-of-rewards per subroutine [Dietterich 00, Andre & Russell 02]

• => State abstraction - e.g., when navigating, local Q independent of gold reserves

Small local Q-components => fast learning

Q

Page 42: Life: play and win in 20 trillion moves

42

Q-decomposition w/ concurrency?

Temporal decomposition of Q-function lost!! No credit assignment among threads

Peasant 1 brings back gold, Peasant 2 twiddles thumbs Peasant 2 thinks he’s done very well!! => learning is hopelessly slow with many peasants

Page 43: Life: play and win in 20 trillion moves

43

Threadwise decomposition

Idea : decompose reward among threads [Russell & Zimdars 03]

E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants

Qjπ(ω,u) = “Expected total reward received by thread j if we

make joint choice u and then do π” Threadwise Q-decomposition Q = Q1+…Qn

Recursively distributed SARSA => global optimality

Page 44: Life: play and win in 20 trillion moves

44

Page 45: Life: play and win in 20 trillion moves

45

Outline

Rationality and Intelligence Efficient hierarchical planning Hierarchical RL with partial programs Next steps towards AI

Page 46: Life: play and win in 20 trillion moves

46

Bounded optimality (again)

Alisp partial program defines a space of agent programs (the “fixed architecture”); RL converges to the bounded-optimal solution

This gets interesting when the agent program is allowed to do some deliberating:

...

(defun select-by-tree-search (nodes n)

(for i from 1 to n do

...

(setq leaf-to-expand (choose nodes))

...

Page 47: Life: play and win in 20 trillion moves

Bounded optimality contd.

Previous work on metareasoning has considered only homogeneous computation; RL with partial programming offers the ability to control complex, heterogeneous computations

E.g., computations in a hierarchical online-lookahead planning agent Hybrid reactive/deliberative decision making Detailed planning/replanning for the near future Increasingly high-level, abstract, long-duration

planning further into the future47

Page 48: Life: play and win in 20 trillion moves