Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Opponent Modelling by Sequence Predictionand Lookahead in Two-Player Games

Richard Mealing and Jonathan L. Shapiromealingr,[email protected]

Machine Learning and Optimisation GroupSchool of Computer Science

University of Manchester, UK

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 1 / 22

mealingr,[email protected]

The Problem

You play against an opponent

The opponent’s actions are based on previous actions

How can you maximise your reward?

Applications

Heads-up pokerAuctionsP2P networkingPath findingetc


Possible Approaches

You could use reinforcement learning to learn to take actions withhigh expected discounted rewards

However we propose to:

Model the opponent using sequence prediction methodsLookahead and take actions which probabilistically, according to theopponent model, lead to the highest reward

Which approach give us the highest rewards?


Opponent Modelling using Sequence Prediction

Observe the opponent’s action and the player’s action (aopp, a)

Form a sequence over time t (memory size n)

(atopp, a

t), (at−1opp , a

t−1), ..., (at−n+1opp , at−n+1)

Predict the opponent’s next action based on this sequence

Pr(at+1

opp |(atopp, a

t), (at−1opp , a

t−1), ..., (at−n+1opp , at−n+1)

)


Sequence Prediction Methods

We tested a variety of sequence prediction methods...

Lempel-Ziv-1978 (LZ78) [1]

Knuth-Morris-Pratt (KMP) [2]Unbounded contexts

Prediction by Partial Matching C (PPMC) [3]

ActiveLeZi [4]Context blending

Transition Directed Acyclic Graph (TDAG) [5]

Entropy Learned Pruned Hypothesis Space (ELPH) [6]

Contextpruning

N-Gram [7]

Hierarchical N-Gram (H. N-Gram) [7] Collection of 1 to N-Grams

Long Short Term Memory (LSTM) [8] Implicit blending & pruning


Sequence Prediction Method Lookahead

Predict with k lookahead given a hypothesised context i.e.

Pr(at+k

opp |(at+k−1opp , at+k−1), (at+k−2

opp , at+k−2), ..., (at+k−nopp , at+k−n)

)

A hypothesised context may contain unobserved (predicted) symbols


Reinforcement Learning: Q-Learning

Learns an action-value function that when input a state-action pair(s, a) outputs the expected value of taking that action in that stateand following a fixed strategy thereafter [9]

Q(

State︷︸︸︷st ,

Action︷︸︸︷at )← (1−

Learning rate︷︸︸︷α )Q(st , at)︸︷︷︸

fraction of old value

+α[

Reward︷︸︸︷r t +

Discount︷︸︸︷γ max

at+1Q(st+1, at+1)]︸︷︷︸

fraction of reward & next max valued action

Select actions with high q-values with some exploration


Need for Lookahead (Prisoner’s Dilemma Example)

D C

D 1,1 4,0

C 0,4 3,3



Defect is the dominant action

Cooperate-Cooperate is socially optimal (highest sum of rewards)

Tit-for-tat (copy opponent’s last move) is good for iterated play

Can we learn tit-for-tat?



4

D

3

C

Pred. C

D C

D 1,1 4,0

C 0,4 3,3

1

D

0

C

Pred. D



4

D

3

C

Pred. C

D C

D 1,1 4,0

C 0,4 3,3

1

D

0

C

Pred. D

Lookahead of 1 shows D has highest reward

With lookahead of 2 (D,C,D,C) has highest total reward (unlikely)

Assume the opponent copies the player’s last move (i.e. tit-for-tat)



4

5

D

4

C

Pred. D

D

3

7

D

6

C

Pred. C

C

Pred. C

D C

D 1,1 4,0

C 0,4 3,3

1

2

D

1

C

Pred. D

D

0

4

D

3

C

Pred. C

C

Pred. D



4

5

D

4

C

Pred. D

D

3

7

D

6

C

Pred. C

C

Pred. C

D C

D 1,1 4,0

C 0,4 3,3

1

2

D

1

C

Pred. D

D

0

4

D

3

C

Pred. C

C

Pred. D

Lookahead of 2 against tit-for-tat shows C has highest reward


Q-Learning’s Implicit Lookahead

Q(

State︷︸︸︷st ,

Action︷︸︸︷at )← (1−

Learning rate︷︸︸︷α )Q(st , at)︸︷︷︸

fraction of old value

+α[

Reward︷︸︸︷r t +

Discount︷︸︸︷γ max

at+1Q(st+1, at+1)]︸︷︷︸

fraction of reward & next max valued action

Assume each state is an opponent action i.e. s = aoppLearns (player action, opponent action) values as:

γ = 0 - payoff matrix (arg maxa Q(at+1opp, a

)same as max lookahead 1)

0 <γ <1 - payoff matrix + future rewards with exponential decayγ = 1 - payoff matrix + future rewards

Increasing γ increases lookahead


Exhaustive Explicit Lookahead

We use exhaustive explicit lookahead with the opponent model and actionvalues to greedily select actions (to limited depth) maximising total reward

2

D

5

C

D

1

D

4

C

C

D

5

D

8

C

D

4

D

7

C

C

C

D

1

D

4

C

D

0

D

3

C

C

D

4

D

7

C

D

3

D

6

C

C

C

C


Experiments

Iterated Rock-Paper-Scissors

Opponent’s actions depend onits previous actions

Iterated Prisoner’s Dilemma

Opponent’s actions depend onboth players’ previous actions

Littman’s Soccer [10]

Direct competition

Which approach has betterperformance?

R P S

R 0,0 -1,1 1,-1

P 1,-1 0,0 -1,1

S -1,1 1,-1 0,0

D C

D 1,1 4,0

C 0,4 3,3


Iterated Rock Paper Scissors

Name Avg Payoff Avg Time Name Avg Payoff Avg Time Name Avg Payoff Avg Time

ELPH 1 ± 0 14.7 ± 0.6 WoLF-PHC 0.645 ± 0.006 89 ± 5 ELPH 0.666 ± 0.0003 56 ± 4

WoLF-PHC 1 ± 0 27 ± 2 PGA-APP 0.644 ± 0.008 59 ± 5 PGA-APP 0.652 ± 0.005 62 ± 4

PGA-APP 0.973 ± 0.009 24 ± 2 ɛ Q-Learner 0.635 ± 0.008 22 ± 3 WoLF-PHC 0.646 ± 0.004 71 ± 4

ɛ Q-Learner 0.97 ± 0.01 29 ± 2 ELPH 0.617 ± 0.002 210 ± 0 ɛ Q-Learner 0.582 ± 0.008 48 ± 6

WPL 0.87 ± 0.01 74 ± 6 WPL 0.374 ± 0.007 143 ± 7 WPL 0.393 ± 0.008 139 ± 7

ELPH 1 ± 0 10 ± 0 ELPH 1 ± 0 10 ± 0 WoLF-PHC 0.68 ± 0.01 173 ± 6

WoLF-PHC 0.98 ± 0.008 91 ± 3 ɛ Q-Learner 0.92 ± 0.01 45 ± 4 ɛ Q-Learner 0.64 ± 0.01 56 ± 5

ɛ Q-Learner 0.97 ± 0.01 28 ± 2 WoLF-PHC 0.91 ± 0.01 147 ± 8 PGA-APP 0.61 ± 0.01 120 ± 7

PGA-APP 0.92 ± 0.01 52 ± 3 PGA-APP 0.86 ± 0.01 109 ± 6 ELPH 0.6 ± 0.002 58 ± 4

WPL 0.65 ± 0.02 105 ± 7 WPL 0.54 ± 0.01 71 ± 6 WPL 0.375 ± 0.009 139 ± 7

ELPH 1 ± 0 10 ± 0 ELPH 1 ± 0 17.2 ± 0.7 ELPH 1 ± 0 16.3 ± 0.7

WoLF-PHC 0.95 ± 0.01 181 ± 6 WoLF-PHC 0.89 ± 0.01 205 ± 3 WoLF-PHC 0.85 ± 0.01 210 ± 0

ɛ Q-Learner 0.94 ± 0.01 37 ± 4 ɛ Q-Learner 0.87 ± 0.01 71 ± 5 ɛ Q-Learner 0.84 ± 0.01 84 ± 6

PGA-APP 0.9 ± 0.02 144 ± 6 PGA-APP 0.87 ± 0.01 179 ± 6 PGA-APP 0.77 ± 0.01 198 ± 3

WPL 0.63 ± 0.01 98 ± 6 WPL 0.69 ± 0.01 208 ± 2 WPL 0.76 ± 0.01 210 ± 0

Mem

ory

Siz

e 1

Mem

ory

Siz

e 2

Mem

ory

Siz

e 3

R,P,S Order 1 R,R,P,P,S,S Order 2 R,R,R,P,P,P,S,S,S Order 3

Good ← ― ― ― ― ― ― → Bad

Agents cannot learn best response with memory size < model order

Our approach gains the highest payoffs at generally the fastest rates


Iterated Prisoner’s Dilemma

Name Avg Payoff Avg Time Position Name Avg Payoff Avg Time Position

PGA-APP 2.03 ± 0.01 30 ± 3 13 ɛ Q-Learner 2.68 ± 0.01 180 ± 5 1

ɛ Q-Learner 1.94 ± 0.01 30 ± 4 16 TDAG + Q-Learner 2.63 ± 0.01 60 ± 4 1

WPL 1.932 ± 0.007 20 ± 1 17 TDAG 2.607 ± 0.008 20 ± 1 1

TDAG 1.93 ± 0.01 30 ± 2 16 WPL 2.31 ± 0.01 30 ± 4 12

WoLF-PHC 1.89 ± 0.01 20 ± 2 18 PGA-APP 2.17 ± 0.02 30 ± 3 13

WoLF-PHC 2.1 ± 0.02 40 ± 5 13

PGA-APP 2.01 ± 0.01 30 ± 4 14 TDAG + Q-Learner 2.828 ± 0.009 120 ± 6 1

WPL 1.949 ± 0.008 20 ± 1 17 ɛ Q-Learner 2.74 ± 0.01 180 ± 5 1

WoLF-PHC 1.92 ± 0.01 30 ± 4 17 TDAG 2.72 ± 0.01 20 ± 1 1

TDAG 1.902 ± 0.008 20 ± 2 16 WPL 2.34 ± 0.01 40 ± 4 12

ɛ Q-Learner 1.822 ± 0.007 20 ± 2 18 PGA-APP 2.18 ± 0.02 40 ± 5 13

WoLF-PHC 2.14 ± 0.01 30 ± 3 13

ɛ Q-Learner 2.02 ± 0.01 30 ± 3 14 TDAG + Q-Learner 2.847 ± 0.009 130 ± 5 1

TDAG 1.958 ± 0.008 20 ± 3 17 TDAG 2.74 ± 0.01 30 ± 3 1

WPL 1.945 ± 0.009 20 ± 3 17 ɛ Q-Learner 2.65 ± 0.01 170 ± 5 1

PGA-APP 1.92 ± 0.009 20 ± 2 16 WPL 2.32 ± 0.01 30 ± 4 12

WoLF-PHC 1.773 ± 0.007 20 ± 1 18 PGA-APP 2.18 ± 0.02 40 ± 4 12

WoLF-PHC 2.14 ± 0.02 40 ± 4 13

Mem

ory

Siz

e 2

Mem

ory

Siz

e 3

Discount = 0 and Depth = 1 Discount = 0.99 and Depth = 2

Mem

ory

Siz

e 1

Good ← ― ― ― ― ― ― → Bad

Increasing lookahead (discounting, search depth) increases rewards

Our approach + Q-Learning increases rewards but also increases time

Our approach gains the highest payoffs at generally the fastest rates


Soccer

Name Avg Payoff Name Avg Payoff Name Avg Payoff Name Avg Payoff

PPMC 0.687 ± 0.006 PPMC 0.701 ± 0.006 PPMC 0.717 ± 0.004 PPMC 0.648 ± 0.006

LSTM 0.635 ± 0.004 LSTM 0.638 ± 0.005 H. N-Gram 0.674 ± 0.002 H. N-Gram 0.608 ± 0.003

TDAG 0.63 ± 0.004 FP 0.637 ± 0.004 N-Gram 0.665 ± 0.001 ActiveLeZi 0.599 ± 0.004

H. N-Gram 0.628 ± 0.003 N-Gram 0.614 ± 0.003 LSTM 0.659 ± 0.003 FP 0.593 ± 0.003

LZ78 0.621 ± 0.004 H. N-Gram 0.612 ± 0.003 TDAG 0.659 ± 0.002 TDAG 0.589 ± 0.004

N-Gram 0.62 ± 0.003 ActiveLeZi 0.606 ± 0.004 FP 0.655 ± 0.003 LSTM 0.585 ± 0.004

ActiveLeZi 0.618 ± 0.003 TDAG 0.606 ± 0.004 LZ78 0.653 ± 0.002 N-Gram 0.582 ± 0.003

ELPH 0.601 ± 0.004 LZ78 0.602 ± 0.004 ActiveLeZi 0.651 ± 0.002 LZ78 0.574 ± 0.003

FP 0.536 ± 0.003 ELPH 0.576 ± 0.003 ELPH 0.637 ± 0.002 ELPH 0.565 ± 0.003

KMP 0.524 ± 0.002 KMP 0.564 ± 0.003 KMP 0.62 ± 0.002 KMP 0.553 ± 0.003

ɛ Q-Learner WoLF-PHC WPL PGA-APP

Good ← ― ― ― ― ― ― → Bad

Our approach wins above 50% of the games using any predictor

PPMC has the highest performances


Conclusions

We proposed sequence prediction and lookahead to accurately modeland effectively respond to opponents with memory

Empirical results show given sufficient memory and lookahead ourapproach outperforms reinforcement learning algorithms


Future Work

Will apply our approach to domains with:

Larger state spacesHidden information

Where the challenges are:

Deeper lookahead (e.g. sampling techniques)Sequence predictor configuration (e.g. 1 predictor per state)


References

[1] Lempel and Ziv. Compression of Individual Sequences via Variable-Rate Coding. 1978.

[2] Byron Knoll. Text Prediction and Classification Using String Matching. 2009.

[3] Alistair Moffat. “Implementing the PPM Data Compression Scheme”. In: IEEETransactions on Communications 38 (1990), pp. 1917–1921.

[4] Karthik Gopalratnam and Diane J. Cook. “ActiveLezi: An incremental parsing algorithmfor sequential prediction”. In: 16th Int. FLAIRS Conf. 2003, pp. 38–42.

[5] Philip Laird and Ronald Saul. “Discrete Sequence Prediction and Its Applications”. In:Machine Learning 15 (1994), pp. 43–68.

[6] Jensen et al. “Non-stationary policy learning in 2-player zero sum games”. In: Proc. of20th Int. Conf. on AI. 2005, pp. 789–794.

[7] Ian Millington. “Artificial Intelligence for Games”. In: ed. by David H. Eberly. MorganKaufmann, 2006. Chap. Learning, pp. 583–590.

[8] Felix A. Gers, Nicol N. Schraudolph, and Jurgen Schmidhuber. “Learning Precise Timingwith LSTM Recurrent Networks”. In: JMLR 3 (2002), pp. 115–143.

[9] C. J. C. H. Watkins. “Learning from delayed rewards”. PhD thesis. Cambridge, 1989.

[10] Michael L. Littman. “Markov games as a framework for multi-agent reinforcementlearning”. In: 11th Proc. of ICML. Morgan Kaufmann, 1994, pp. 157–163.


Documents

Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard