22
Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard Mealing and Jonathan L. Shapiro {mealingr,jls}@cs.man.ac.uk Machine Learning and Optimisation Group School of Computer Science University of Manchester, UK Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchest Sequence Prediction Opponent Modelling 1 / 22

Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Opponent Modelling by Sequence Predictionand Lookahead in Two-Player Games

Richard Mealing and Jonathan L. Shapiromealingr,[email protected]

Machine Learning and Optimisation GroupSchool of Computer Science

University of Manchester, UK

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 1 / 22

Page 2: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

The Problem

You play against an opponent

The opponent’s actions are based on previous actions

How can you maximise your reward?

Applications

Heads-up pokerAuctionsP2P networkingPath findingetc

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 2 / 22

Page 3: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Possible Approaches

You could use reinforcement learning to learn to take actions withhigh expected discounted rewards

However we propose to:

Model the opponent using sequence prediction methodsLookahead and take actions which probabilistically, according to theopponent model, lead to the highest reward

Which approach give us the highest rewards?

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 3 / 22

Page 4: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Opponent Modelling using Sequence Prediction

Observe the opponent’s action and the player’s action (aopp, a)

Form a sequence over time t (memory size n)

(atopp, a

t), (at−1opp , a

t−1), ..., (at−n+1opp , at−n+1)

Predict the opponent’s next action based on this sequence

Pr(at+1

opp |(atopp, a

t), (at−1opp , a

t−1), ..., (at−n+1opp , at−n+1)

)

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 4 / 22

Page 5: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Sequence Prediction Methods

We tested a variety of sequence prediction methods...

Lempel-Ziv-1978 (LZ78) [1]

Knuth-Morris-Pratt (KMP) [2]Unbounded contexts

Prediction by Partial Matching C (PPMC) [3]

ActiveLeZi [4]Context blending

Transition Directed Acyclic Graph (TDAG) [5]

Entropy Learned Pruned Hypothesis Space (ELPH) [6]

Contextpruning

N-Gram [7]

Hierarchical N-Gram (H. N-Gram) [7] Collection of 1 to N-Grams

Long Short Term Memory (LSTM) [8] Implicit blending & pruning

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 5 / 22

Page 6: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Sequence Prediction Method Lookahead

Predict with k lookahead given a hypothesised context i.e.

Pr(at+k

opp |(at+k−1opp , at+k−1), (at+k−2

opp , at+k−2), ..., (at+k−nopp , at+k−n)

)

A hypothesised context may contain unobserved (predicted) symbols

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 6 / 22

Page 7: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Reinforcement Learning: Q-Learning

Learns an action-value function that when input a state-action pair(s, a) outputs the expected value of taking that action in that stateand following a fixed strategy thereafter [9]

Q(

State︷︸︸︷st ,

Action︷︸︸︷at )← (1−

Learning rate︷︸︸︷α )Q(st , at)︸ ︷︷ ︸

fraction of old value

+α[

Reward︷︸︸︷r t +

Discount︷︸︸︷γ max

at+1Q(st+1, at+1)]︸ ︷︷ ︸

fraction of reward & next max valued action

Select actions with high q-values with some exploration

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 7 / 22

Page 8: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Need for Lookahead (Prisoner’s Dilemma Example)

D C

D 1,1 4,0

C 0,4 3,3

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 8 / 22

Page 9: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Need for Lookahead (Prisoner’s Dilemma Example)

Defect is the dominant action

Cooperate-Cooperate is socially optimal (highest sum of rewards)

Tit-for-tat (copy opponent’s last move) is good for iterated play

Can we learn tit-for-tat?

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 9 / 22

Page 10: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Need for Lookahead (Prisoner’s Dilemma Example)

4

D

3

C

Pred. C

D C

D 1,1 4,0

C 0,4 3,3

1

D

0

C

Pred. D

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 10 / 22

Page 11: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Need for Lookahead (Prisoner’s Dilemma Example)

4

D

3

C

Pred. C

D C

D 1,1 4,0

C 0,4 3,3

1

D

0

C

Pred. D

Lookahead of 1 shows D has highest reward

With lookahead of 2 (D,C,D,C) has highest total reward (unlikely)

Assume the opponent copies the player’s last move (i.e. tit-for-tat)

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 11 / 22

Page 12: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Need for Lookahead (Prisoner’s Dilemma Example)

4

5

D

4

C

Pred. D

D

3

7

D

6

C

Pred. C

C

Pred. C

D C

D 1,1 4,0

C 0,4 3,3

1

2

D

1

C

Pred. D

D

0

4

D

3

C

Pred. C

C

Pred. D

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 12 / 22

Page 13: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Need for Lookahead (Prisoner’s Dilemma Example)

4

5

D

4

C

Pred. D

D

3

7

D

6

C

Pred. C

C

Pred. C

D C

D 1,1 4,0

C 0,4 3,3

1

2

D

1

C

Pred. D

D

0

4

D

3

C

Pred. C

C

Pred. D

Lookahead of 2 against tit-for-tat shows C has highest reward

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 13 / 22

Page 14: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Q-Learning’s Implicit Lookahead

Q(

State︷︸︸︷st ,

Action︷︸︸︷at )← (1−

Learning rate︷︸︸︷α )Q(st , at)︸ ︷︷ ︸

fraction of old value

+α[

Reward︷︸︸︷r t +

Discount︷︸︸︷γ max

at+1Q(st+1, at+1)]︸ ︷︷ ︸

fraction of reward & next max valued action

Assume each state is an opponent action i.e. s = aoppLearns (player action, opponent action) values as:

γ = 0 - payoff matrix (arg maxa Q(at+1opp, a

)same as max lookahead 1)

0 <γ <1 - payoff matrix + future rewards with exponential decayγ = 1 - payoff matrix + future rewards

Increasing γ increases lookahead

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 14 / 22

Page 15: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Exhaustive Explicit Lookahead

We use exhaustive explicit lookahead with the opponent model and actionvalues to greedily select actions (to limited depth) maximising total reward

2

D

5

C

D

1

D

4

C

C

D

5

D

8

C

D

4

D

7

C

C

C

D

1

D

4

C

D

0

D

3

C

C

D

4

D

7

C

D

3

D

6

C

C

C

C

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 15 / 22

Page 16: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Experiments

Iterated Rock-Paper-Scissors

Opponent’s actions depend onits previous actions

Iterated Prisoner’s Dilemma

Opponent’s actions depend onboth players’ previous actions

Littman’s Soccer [10]

Direct competition

Which approach has betterperformance?

R P S

R 0,0 -1,1 1,-1

P 1,-1 0,0 -1,1

S -1,1 1,-1 0,0

D C

D 1,1 4,0

C 0,4 3,3

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 16 / 22

Page 17: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Iterated Rock Paper Scissors

Name Avg Payoff Avg Time Name Avg Payoff Avg Time Name Avg Payoff Avg Time

ELPH 1 ± 0 14.7 ± 0.6 WoLF-PHC 0.645 ± 0.006 89 ± 5 ELPH 0.666 ± 0.0003 56 ± 4

WoLF-PHC 1 ± 0 27 ± 2 PGA-APP 0.644 ± 0.008 59 ± 5 PGA-APP 0.652 ± 0.005 62 ± 4

PGA-APP 0.973 ± 0.009 24 ± 2 ɛ Q-Learner 0.635 ± 0.008 22 ± 3 WoLF-PHC 0.646 ± 0.004 71 ± 4

ɛ Q-Learner 0.97 ± 0.01 29 ± 2 ELPH 0.617 ± 0.002 210 ± 0 ɛ Q-Learner 0.582 ± 0.008 48 ± 6

WPL 0.87 ± 0.01 74 ± 6 WPL 0.374 ± 0.007 143 ± 7 WPL 0.393 ± 0.008 139 ± 7

ELPH 1 ± 0 10 ± 0 ELPH 1 ± 0 10 ± 0 WoLF-PHC 0.68 ± 0.01 173 ± 6

WoLF-PHC 0.98 ± 0.008 91 ± 3 ɛ Q-Learner 0.92 ± 0.01 45 ± 4 ɛ Q-Learner 0.64 ± 0.01 56 ± 5

ɛ Q-Learner 0.97 ± 0.01 28 ± 2 WoLF-PHC 0.91 ± 0.01 147 ± 8 PGA-APP 0.61 ± 0.01 120 ± 7

PGA-APP 0.92 ± 0.01 52 ± 3 PGA-APP 0.86 ± 0.01 109 ± 6 ELPH 0.6 ± 0.002 58 ± 4

WPL 0.65 ± 0.02 105 ± 7 WPL 0.54 ± 0.01 71 ± 6 WPL 0.375 ± 0.009 139 ± 7

ELPH 1 ± 0 10 ± 0 ELPH 1 ± 0 17.2 ± 0.7 ELPH 1 ± 0 16.3 ± 0.7

WoLF-PHC 0.95 ± 0.01 181 ± 6 WoLF-PHC 0.89 ± 0.01 205 ± 3 WoLF-PHC 0.85 ± 0.01 210 ± 0

ɛ Q-Learner 0.94 ± 0.01 37 ± 4 ɛ Q-Learner 0.87 ± 0.01 71 ± 5 ɛ Q-Learner 0.84 ± 0.01 84 ± 6

PGA-APP 0.9 ± 0.02 144 ± 6 PGA-APP 0.87 ± 0.01 179 ± 6 PGA-APP 0.77 ± 0.01 198 ± 3

WPL 0.63 ± 0.01 98 ± 6 WPL 0.69 ± 0.01 208 ± 2 WPL 0.76 ± 0.01 210 ± 0

Mem

ory

Siz

e 1

Mem

ory

Siz

e 2

Mem

ory

Siz

e 3

R,P,S Order 1 R,R,P,P,S,S Order 2 R,R,R,P,P,P,S,S,S Order 3

Good ← ― ― ― ― ― ― → Bad

Agents cannot learn best response with memory size < model order

Our approach gains the highest payoffs at generally the fastest rates

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 17 / 22

Page 18: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Iterated Prisoner’s Dilemma

Name Avg Payoff Avg Time Position Name Avg Payoff Avg Time Position

PGA-APP 2.03 ± 0.01 30 ± 3 13 ɛ Q-Learner 2.68 ± 0.01 180 ± 5 1

ɛ Q-Learner 1.94 ± 0.01 30 ± 4 16 TDAG + Q-Learner 2.63 ± 0.01 60 ± 4 1

WPL 1.932 ± 0.007 20 ± 1 17 TDAG 2.607 ± 0.008 20 ± 1 1

TDAG 1.93 ± 0.01 30 ± 2 16 WPL 2.31 ± 0.01 30 ± 4 12

WoLF-PHC 1.89 ± 0.01 20 ± 2 18 PGA-APP 2.17 ± 0.02 30 ± 3 13

WoLF-PHC 2.1 ± 0.02 40 ± 5 13

PGA-APP 2.01 ± 0.01 30 ± 4 14 TDAG + Q-Learner 2.828 ± 0.009 120 ± 6 1

WPL 1.949 ± 0.008 20 ± 1 17 ɛ Q-Learner 2.74 ± 0.01 180 ± 5 1

WoLF-PHC 1.92 ± 0.01 30 ± 4 17 TDAG 2.72 ± 0.01 20 ± 1 1

TDAG 1.902 ± 0.008 20 ± 2 16 WPL 2.34 ± 0.01 40 ± 4 12

ɛ Q-Learner 1.822 ± 0.007 20 ± 2 18 PGA-APP 2.18 ± 0.02 40 ± 5 13

WoLF-PHC 2.14 ± 0.01 30 ± 3 13

ɛ Q-Learner 2.02 ± 0.01 30 ± 3 14 TDAG + Q-Learner 2.847 ± 0.009 130 ± 5 1

TDAG 1.958 ± 0.008 20 ± 3 17 TDAG 2.74 ± 0.01 30 ± 3 1

WPL 1.945 ± 0.009 20 ± 3 17 ɛ Q-Learner 2.65 ± 0.01 170 ± 5 1

PGA-APP 1.92 ± 0.009 20 ± 2 16 WPL 2.32 ± 0.01 30 ± 4 12

WoLF-PHC 1.773 ± 0.007 20 ± 1 18 PGA-APP 2.18 ± 0.02 40 ± 4 12

WoLF-PHC 2.14 ± 0.02 40 ± 4 13

Mem

ory

Siz

e 2

Mem

ory

Siz

e 3

Discount = 0 and Depth = 1 Discount = 0.99 and Depth = 2

Mem

ory

Siz

e 1

Good ← ― ― ― ― ― ― → Bad

Increasing lookahead (discounting, search depth) increases rewards

Our approach + Q-Learning increases rewards but also increases time

Our approach gains the highest payoffs at generally the fastest rates

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 18 / 22

Page 19: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Soccer

Name Avg Payoff Name Avg Payoff Name Avg Payoff Name Avg Payoff

PPMC 0.687 ± 0.006 PPMC 0.701 ± 0.006 PPMC 0.717 ± 0.004 PPMC 0.648 ± 0.006

LSTM 0.635 ± 0.004 LSTM 0.638 ± 0.005 H. N-Gram 0.674 ± 0.002 H. N-Gram 0.608 ± 0.003

TDAG 0.63 ± 0.004 FP 0.637 ± 0.004 N-Gram 0.665 ± 0.001 ActiveLeZi 0.599 ± 0.004

H. N-Gram 0.628 ± 0.003 N-Gram 0.614 ± 0.003 LSTM 0.659 ± 0.003 FP 0.593 ± 0.003

LZ78 0.621 ± 0.004 H. N-Gram 0.612 ± 0.003 TDAG 0.659 ± 0.002 TDAG 0.589 ± 0.004

N-Gram 0.62 ± 0.003 ActiveLeZi 0.606 ± 0.004 FP 0.655 ± 0.003 LSTM 0.585 ± 0.004

ActiveLeZi 0.618 ± 0.003 TDAG 0.606 ± 0.004 LZ78 0.653 ± 0.002 N-Gram 0.582 ± 0.003

ELPH 0.601 ± 0.004 LZ78 0.602 ± 0.004 ActiveLeZi 0.651 ± 0.002 LZ78 0.574 ± 0.003

FP 0.536 ± 0.003 ELPH 0.576 ± 0.003 ELPH 0.637 ± 0.002 ELPH 0.565 ± 0.003

KMP 0.524 ± 0.002 KMP 0.564 ± 0.003 KMP 0.62 ± 0.002 KMP 0.553 ± 0.003

ɛ Q-Learner WoLF-PHC WPL PGA-APP

Good ← ― ― ― ― ― ― → Bad

Our approach wins above 50% of the games using any predictor

PPMC has the highest performances

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 19 / 22

Page 20: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Conclusions

We proposed sequence prediction and lookahead to accurately modeland effectively respond to opponents with memory

Empirical results show given sufficient memory and lookahead ourapproach outperforms reinforcement learning algorithms

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 20 / 22

Page 21: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

Future Work

Will apply our approach to domains with:

Larger state spacesHidden information

Where the challenges are:

Deeper lookahead (e.g. sampling techniques)Sequence predictor configuration (e.g. 1 predictor per state)

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 21 / 22

Page 22: Opponent Modelling by Sequence Prediction and Lookahead in ...mealingr/documents/Opponent... · Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard

References

[1] Lempel and Ziv. Compression of Individual Sequences via Variable-Rate Coding. 1978.

[2] Byron Knoll. Text Prediction and Classification Using String Matching. 2009.

[3] Alistair Moffat. “Implementing the PPM Data Compression Scheme”. In: IEEETransactions on Communications 38 (1990), pp. 1917–1921.

[4] Karthik Gopalratnam and Diane J. Cook. “ActiveLezi: An incremental parsing algorithmfor sequential prediction”. In: 16th Int. FLAIRS Conf. 2003, pp. 38–42.

[5] Philip Laird and Ronald Saul. “Discrete Sequence Prediction and Its Applications”. In:Machine Learning 15 (1994), pp. 43–68.

[6] Jensen et al. “Non-stationary policy learning in 2-player zero sum games”. In: Proc. of20th Int. Conf. on AI. 2005, pp. 789–794.

[7] Ian Millington. “Artificial Intelligence for Games”. In: ed. by David H. Eberly. MorganKaufmann, 2006. Chap. Learning, pp. 583–590.

[8] Felix A. Gers, Nicol N. Schraudolph, and Jurgen Schmidhuber. “Learning Precise Timingwith LSTM Recurrent Networks”. In: JMLR 3 (2002), pp. 115–143.

[9] C. J. C. H. Watkins. “Learning from delayed rewards”. PhD thesis. Cambridge, 1989.

[10] Michael L. Littman. “Markov games as a framework for multi-agent reinforcementlearning”. In: 11th Proc. of ICML. Morgan Kaufmann, 1994, pp. 157–163.

Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 22 / 22