58
Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab, MIT NIPS Multi-agent Learning Workshop, Whistler, BC 2002

Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Learning in networks(and other asides)

A preliminary investigation & some comments

Yu-Han ChangJoint work with Tracey Ho and Leslie Kaelbling

AI Lab, MIT

NIPS Multi-agent Learning Workshop, Whistler, BC 2002

Page 2: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Networks: a multi-agent system

Graphical games [Kearns, Ortiz, Guestrin, …] Real networks, e.g. a LAN [Boyan, Littman,

…] “Mobile ad-hoc networks” [Johnson, Maltz,

…]

Page 3: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Mobilized ad-hoc networks

Mobile sensors, tracking agents, … Generally a distributed system that

wants to optimize some global reward function

Page 4: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Learning

Nash equilibrium is the phrase of the day, but is it a good solution?

Other equilibria, i.e. refinements of NE

1. Can we do better than Nash Equilibrium?(Game playing approach)

2. Perhaps we want to just learn some good policy in a distributed manner. Then what?(Distributed problem solving)

Page 5: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

What are we studying?

Learning

Knownworld

Single-agent Multiple agents

RL,NDP

Game TheoryDecision Theory,Planning

Stochastic games,Learning in games,…

Page 6: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Part I: Learning

Policy

World,

State

Learning Algorithm

Actions

Observations,Sensations

Rewards

Page 7: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Learning to act in the world

Policy

Environ-ment

Learning Algorithm

Actions

Observations,Sensations

Rewards

Other agents(possibly learning)

?

World

Page 8: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

A simple example

The problem: Prisoner’s Dilemma Possible solutions: Space of policies The solution metric: Nash equilibrium

World,

State

Cooperate Defect

Cooperate 1,1 -2,2

Defect 2,-2 -1,-1

Player 1’s actions Rewards

Player 2’s actions

Page 9: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

That Folk Theorem For discount factors close to 1, any individually

rational payoffs are feasible (and are Nash) in the infinitely repeated game

Coop. Defect

Coop. 1,1 -2,2

Defect 2,-2 -1,-1

R2

R1

(1,1)

(-1,-1)

(2,-2)

(-2,2)

safety value

Page 10: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Better policies: Tit-for-Tat

Expand our notion of policies to include maps from past history to actions

Our choice of action now depends on previous choices (i.e. non-stationary)

history (last period’s play)

Tit-for-Tat Policy:

( . , Defect ) Defect

( . , Cooperate ) Cooperate

Page 11: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Types of policies & consequences

Stationary: 1 At At best, leads to same outcome as single-shot Nash

Equilibrium against rational opponents

Reactionary: { ( ht-1 ) } At Tit for Tat achieves “best” outcome in Prisoners Dilemma

Finite Memory: { ( ht-n , … , ht-2 , ht-1 ) } At May be useful against more complex opponents or in

more complex games

“Algorithmic”: { ( h1 , h2 , … , ht-2 , ht-1 ) } At Makes use of the entire history of actions as it learns

over time

Page 12: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Classifying our policy space

We can classify our learning algorithm’s potential power by observing the amount of history its policies can use

Stationary: H0

1 At

Reactionary: H1

{ ( ht-1 ) } At

Behavioral/Finite Memory: Hn

{ ( ht-n , … , ht-2 , ht-1 ) } At

Algorithmic/Infinite Memory: H

{ ( h1 , h2 , … , ht-2 , ht-1 ) } At

Page 13: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Classifying our belief space

Its also important to quantify our belief space, i.e. our assumptions about what types of policies the opponent is capable of playing

Stationary: B0

Reactionary: B1

Behavioral/Finite Memory: Bn

Infinite Memory/Arbitrary: B

Page 14: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

A Simple Classification

B0 B1 Bn B

H0Minimax-Q, Nash-Q, Corr-Q

Bully

H1Godfather

Hn

H(WoLF) PHC, Fictitious Play, Q-learning (JAL)

Q1-learning

Qt-learning? ???

Page 15: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

A Classification

B0 B1 Bn B

H0Minimax-Q, Nash-Q, Corr-Q

Bully

H1Godfather

Hn

H(WoLF) PHC, Fictitious Play, Q-learning (JAL)

Q1-learning

Qt-learning? ???

Page 16: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

H x B0 : Stationary opponent

Since the opponent is stationary, this case reduces the world to an MDP. Hence we can apply any traditional reinforcement learning methods

Policy hill climber (PHC) [Bowling & Veloso, 02]

Estimates the gradient in the action space and follows it towards the local optimum

Fictitious play [Robinson, 51] [Fudenburg & Levine, 95]

Plays a stationary best response to the statistical frequency of the opponent’s play

Q-learning (JAL) [Watkins, 89] [Claus & Boutilier, 98]

Learns Q-values of states and possibly joint actions

Page 17: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

A Classification

B0 B1 Bn B

H0Minimax-Q, Nash-Q, Corr-Q

Bully

H1Godfather

Hn

H(WoLF) PHC, Fictitious Play, Q-learning (JAL)

Q1-learning

Qt-learning? ???

Page 18: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

H0 x B : My enemy’s pretty smart

“Bully” [Littman & Stone, 01]

Tries to force opponent to conform to the preferred outcome by choosing to play only some part of the game matrix

Cooperate“Swerve”

Defect“Drive”

Cooperate“Swerve”

1,1 -2,2

Defect“Drive”

2,-2 -5,-5Us:

Them:The “Chicken” game (Hawk-Dove) Undesirable

Nash Eq.

Page 19: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Achieving “perfection”

Can we design a learning algorithm that will perform well in all circumstances? Prediction Optimization

But this is not possible!* [Nachbar, 95] [Binmore, 89] * Universal consistency (Exp3 [Auer et al, 02], smoothed

fictitious play [Fudenburg & Levine, 95]) does provide a way out, but it merely guarantees that we’ll do almost as well as any stationary policy that we could have used

Page 20: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

A reasonable goal?

Can we design an algorithm in H x Bn or in a subclass of H x B that will do well?

Should always try to play a best response to any given opponent strategy

Against a fully rational opponent, should thus learn to play a Nash equilibrium strategy

Should try to guarantee that we’ll never do too badly

One possible approach: given knowledge about the opponent, model its behavior and exploit its weaknesses (play best response)

Let’s start by constructing a player that plays well against PHC players in 2x2 games

Page 21: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

2x2 Repeated Matrix Games

Left Right

Up r11 , c11 r12 , c12

Down r21 , c21 r22 , c22

• We choose row i to play• Opponent chooses column j to play• We receive reward rij , they receive cij

Page 22: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Iterated gradient ascent

[Singh Kearns Mansour, 00]

System dynamics for 2x2 matrix games take one of two forms:

Player 1’s probability for Action 1

Pla

yer

2’s

pro

babi l i

ty f

or

Ac t

i on 1

Player 1’s probability for Action 1

Pla

yer

2’s

pro

babi l i

ty f

or

Ac t

i on 1

Page 23: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Can we do better and actually win?

Singh et al show that we can achieve Nash payoffs

But is this a best response? We can do better…

Exploit while winning Deceive and bait while losing

Heads Tails

Heads -1,1 1,-1

Tails 1,-1 -1,1Us:

Them:Matching pennies

Page 24: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

A winning strategy against PHC

If winningplay probability 1 for

current preferred actionin order to maximize

rewards while winningIf losing

play a deceiving policy until we are ready to take advantage of them again 0 1

1

0.5

0.5

Probability we play heads

Pro

babili

ty o

pponent

pla

ys

heads

Page 25: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Formally, PHC does:

Keeps and updates Q values:

Updates policy:

))','(max(),()1(),( ' asQRasQasQ a

otherwise

maxarg if),(),(

1|| iA

a' Q(s,a')aasas

Page 26: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

PHC-Exploiter

Updates policy differently if winning vs. losing:

otherwise0

actionbest theis if1),(1 a

as

If we are winning:

otherwise

actionbest theis if),(),(

1||

211

1

2

A

aasas

Otherwise, we are losing:

Page 27: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

otherwise

maxarg if),(),(

1||

211

1

2

A

a' Q(s,a')aasas

PHC-Exploiter

Updates policy differently if winning vs. losing:

a'

ssRasQas ))(),(*()',()',( 2111

otherwise0

maxarg if1),(1 Q(s,a')a

as a'

If

Otherwise, we are losing:

Page 28: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

otherwise

maxarg if),(),(

1||

211

1

2

A

a' Q(s,a')aasas

PHC-Exploiter

Updates policy differently if winning vs. losing:

a'

ssRasQas ))(),(*()',()',( 2111

otherwise0

maxarg if1),(1 Q(s,a')a

as a'

If

Otherwise, we are losing:

Page 29: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

But we don’t have complete information

Estimate opponent’s policy 2 at each

time period Estimate opponent’s learning rate 2

timet

w

t-wt-2w

Page 30: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Ideally we’d like to see this:

winning

losing

Page 31: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

With our approximations:

Page 32: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

And indeed we’re doing well.

winninglosing

Page 33: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Knowledge (beliefs) are useful

Using our knowledge about the opponent, we’ve demonstrated one case in which we can achieve better than Nash rewards

In general, we’d like algorithms that can guarantee Nash payoffs against fully rational players but can exploit bounded players (such as a PHC)

Page 34: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

So what do we want from learning?

Best Response / Adaptive : exploit the opponent’s weaknesses, essentially always try to play a best response

Regret-minimization : we’d like to be able to look back and not regret our actions; we wouldn’t say to ourselves: “Gosh, why didn’t I choose to do that instead…”

Page 35: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

A next step

Expand the comparison class in universally consistent (regret-minimization) algorithms to include richer spaces of possible strategies

For example, the comparison class could include a best-response player to a PHC

Could also include all t-period strategies

Page 36: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Part II

What if we’re cooperating?

Page 37: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

What if we’re cooperating?

Nash equilibrium is not the most useful concept in cooperative scenarios

We simply want to distributively find the global (perhaps approximately) optimal solution This happens to be a Nash equilibrium, but

its not really the point of NE to address this scenario

Distributed problem solving rather than game playing

May also deal with modeling emergent behaviors

Page 38: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Mobilized ad-hoc networks

Ad-hoc networks are limited in connectivity

Mobilized nodes can significantly improve connectivity

Page 39: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Network simulator

Page 40: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Connectivity bounds

Static ad-hoc networks have loose bounds of the following form:

Given n nodes uniformly distributed i.i.d. in a disk of area A, each with range

the graph is connected almost surely as n iff n .

n

nAr n

n

log

'

Page 41: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Connectivity bounds

Allowing mobility can improve our loose bounds to:

Can we achieve this or even do significantly better than this?

Fraction mobile Required range # nodes

1/2 n/2

2/3 n/3

k/(k+1) n/(k+1)

n

nrn

log

2nr

3nr

1krn

Page 42: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Many challenges

Routing Dynamic environment: neighbor nodes

moving in and out of range, source and receivers may also be moving

Limited bandwidth: channel allocation, limited buffer sizes

Moving What is the globally optimal configuration? What is the globally optimal trajectory of

configurations? Can we learn a good policy using only local

knowledge?

Page 43: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Routing

Q-routing [Boyan Littman, 93] Applied simple Q-learning to the static network

routing problem under congestion Actions: Forward packet to a particular neighbor

node States: Current packet’s intended receiver Reward: Estimated time to arrival at receiver Performed well by learning to route packets around

congested areas Direct application of Q-routing to the mobile

ad-hoc network case Adaptations to the highly dynamic nature of

mobilized ad-hoc networks

Page 44: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Movement: An RL approach

What should our actions be? North, South, East, West, Stay Put Explore, Maintain connection, Terminate

connection, etc.

What should our states be? Local information about nodes, locations,

and paths Summarized local information Globally shared statistics

Policy search? Mixture of experts?

Page 45: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Macros, options, complex actions

Allow the nodes (agents) to utilize complex actions rather than simple N, S, E, W type movements

Actions might take varying amounts of time

Agents can re-evaluate whether to continue to do the action or not at each time step If the state hasn’t really changed, then

naturally the same action will be chosen again

Page 46: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Example action: “plug”

1. Sniff packets in neighborhood2. Identify path (source, receiver pair)

with longest average hops 3. Move to that path4. Move along this path until a long hop is

encountered5. Insert yourself into the path at this

point, thereby decreasing the average hop distance

Page 47: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Some notion of state

State space could be huge, so we choose certain features to parameterize the state space Connectivity, average hop distance, …

Actions should change the world state Exploring will hopefully lead to connectivity,

plugging will lead to smaller average hops, …

Page 48: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Experimental results

Number of nodes

Range Theoretical fraction mobile

Empirical fraction mobile required

25 2 rn

25 rn 1/2 0.21

50 1.7 rn

50 0.85 rn 1/2 0.25

100 1.7 rn

100 0.85 rn 1/2 0.19

200 1.6 rn

200 0.8 rn 1/2 0.17

400 1.6 rn

400 0.8 rn 1/2 0.14

Page 49: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Seems to work well

Page 50: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Pretty pictures

Page 51: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Pretty pictures

Page 52: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Pretty pictures

Page 53: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Pretty pictures

Page 54: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Many things to play with

Lossy transmissions Transmission interference Existence of opponents, jamming signals Self-interested nodes More realistic simulations – ns2 Learning different agent roles or

optimizing the individual complex actions Interaction between route learning and

movement learning

Page 55: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Three yardsticks

1. Non-cooperative case: We want to play our best response to the observed play of the world – we want to learn about the opponent

Minimize regret Play our best response

Page 56: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Three yardsticks

1. Non-cooperative case: We want to play our best response to the observed play of the world

2. Cooperative case: Approximate a global optimal using only local information or less computation

Page 57: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

Three yardsticks

1. Non-cooperative case: We want to play our best response to the observed play of the world

2. Cooperative case: Approximate a global optimal in a distributed manner

3. Skiiing case: 17 cm of fresh powder last night and its still snowing. More snow is better. Who can argue with that?

Page 58: Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang Joint work with Tracey Ho and Leslie Kaelbling AI Lab,

The End