May the Best Man Win! - University Of Marylandscholar.rhsmith.umd.edu/sites/default/files/iryzhov/...\May the Best Man Win!" Simulation optimization for match-making in e-sports Ilya

$Page 1: May the Best Man Win! - University Of Marylandscholar.rhsmith.umd.edu/sites/default/files/iryzhov/...\May the Best Man Win!" Simulation optimization for match-making in e-sports Ilya$
“May the Best Man Win!”Simulation optimization for match-making in e-sports

Ilya O. Ryzhov1 Awais Tariq2 Warren B. Powell2

1Robert H. Smith School of BusinessUniversity of Maryland

College Park, MD 20742

2Operations Research and Financial EngineeringPrinceton UniversityPrinceton, NJ 08544

INFORMS Annual MeetingNovember 15, 2011

1 / 34

Outline

1 Introduction

2 TrueSkill™ model for learning skill levelsLearning with moment-matchingThe DrawChance policy

3 Match-making with knowledge gradients

4 Moving on: Targeting and selection

5 Conclusions

2 / 34

Outline

1 Introduction




5 Conclusions

3 / 34

Motivation: e-sports

The term “e-sports” refers to competitive multi-player online gaming

Thousands of players simultaneously log on to networks such as XboxLive or Battle.net

4 / 34

Motivation: e-sports

Revenues of South Korean game company NCSoft, 2000-2004 (Huhh 2008).

E-sports have become culturally significant and very profitable

Top players from around the world compete professionally

In 2005, Xbox Live had over 2 million subscribers; Battle.net has over3 million registered players for a single game

5 / 34

Ranking and competition in e-sports

Game services and outside organizations create rankings of players

Casual players are matched up automatically according to their skilllevel

6 / 34

Simulation optimization for match-making

We would like to create fair and challenging games by matchingplayers of similar skill level

The TrueSkill™ system used by Xbox Live views this as a Bayesianlearning problem in which we sequentially learn players’ skills

Unlike e.g. multi-armed bandit problems (Gittins 1989), the goal is tomatch a target rather than find the most skilled player

We compare a value-of-information procedure to the greedy policyused by Microsoft

7 / 34

Outline

1 Introduction




5 Conclusions

8 / 34

Mathematical model

Player i = 0,1, ...,M has an underlying skill level si , unknown to thegame master

Our uncertainty about si is expressed as

si ∼N(

µ0i ,(σ

0i

)2)

The performance of player i in a game is expressed as

pi ∼N(si ,σ

2ε

)We assume that performances and skill levels are independent

9 / 34

Non-conjugacy of Bayesian model

We say that player i beats player j if pi > pj in a game between theseplayers

Unfortunately, we never observe the exact values of pi or pj , onlywhich player won

Thus, the posterior belief

P (si ∈ ds |pi > pj) =P (pi > pj |si = s)P (si ∈ ds)

P (pi > pj)

is non-normal

Conjugacy is forced using moment-matching (Minka 2001): plug themean and variance of the posterior into a normal distribution

10 / 34

Moment-matching for approximate conjugacy

Given the beliefs at time n, and the outcome of game n+ 1 between i andj , update (Dangauthier et al. 2007)

µn+1i =

µni +

(σni )

2

σ̄nij·v(

µni −µn

j

σ̄nij

)if pn+1

i > pn+1j ,

µni −

(σni )

2

σ̄nij·v(

µnj −µn

i

σ̄nij

)if pn+1

i < pn+1j ,

(σn+1i

)2=

(σn

i )2

(1− (σn

i )2

σ̄nij·w(

µni −µn

j

σ̄nij

))if pn+1

i > pn+1j ,

(σni )2

(1− (σn

i )2

σ̄nij·w(

µnj −µn

i

σ̄nij

))if pn+1

i < pn+1j ,

with v (x) = φ(x)Φ(x) , w (x) = v (x)(v (x) + x), and

σ̄nij = (σ

ni )2 +

(σnj

)2+ 2σ

2ε .

Intuitively: increase our skill estimate for the winning player.

11 / 34

Choosing an opponent

In Dangauthier et al. (2007), a game between i and j ends in a draw if

|pi −pj |< δ

for some small δ > 0.12 / 34

The draw probability

After n games, our prediction that the (n+ 1)st game will end in a draw is

Pn(∣∣∣pn+1

i −pn+1j

∣∣∣< δ

)= IEnPn

(∣∣∣pn+1i −pn+1

j

∣∣∣< δ |si ,sj).

For very small δ ,

Pn(∣∣∣pn+1

i −pn+1j

∣∣∣< δ |si ,sj)≈ 1√

2π (2σ2ε )

e−(si−sj)

2

4σ2ε δ .

We take δ → 0 and define the draw probability as

qnij = IEn

1√2π (2σ2

ε )e−(si−sj)

2

4σ2ε

.

13 / 34

Choosing an opponent

Thus, the “probability” of a draw between players i and j is

qnij =1√2π

1√(σni

)2+(

σnj

)2+ 2σ2

ε

e

− (µni −µnj )2

2

((σn

i )2

+(σnj )

2+2σ2

ε

).

We expect the game to be more competitive when this quantity is higher.

14 / 34

Connection to DrawChance

The DrawChance formula given in Herbrich et al. (2006) is

q̃nij =

√√√√ 2σ2ε(

σni

)2+(

σnj

)2+ 2σ2

ε

e

− (µni −µnj )2

2

((σn

i )2

+(σnj )

2+2σ2

ε

)

which is identical to qnij up to a constant scale factor

The DrawChance policy used by Xbox Live greedily selects thematch-up with the highest draw probability

15 / 34

Outline

1 Introduction




5 Conclusions

16 / 34

Simulation optimization for match-making

We interpret the match-making problem as online simulationoptimization with the objective

supπ

N

∑n=0

qn0,X π (µn,σn)

where π is a policy for choosing opponents for a fixed player 0

The concept of value of information (Chick 2006) looks ahead to theoutcome of the next decision

This approach can be adapted to many types of objective functions(Frazier et al. 2008; Ryzhov & Powell 2011a)

17 / 34

Prediction of game outcome

Our Bayesian beliefs provide us with an (approximate) estimate of theoutcome of a hypothetical game between i and j :

Proposition

Under the normality assumption, the probability that player i beats playerj in game n+ 1 is given by

Pn(pn+1i > pn+1

j

)= Φ

µni −µn

j(σni

)2+(

σnj

)2+ 2σ2

ε

.

18 / 34

Prediction of game outcome

Proof.

We compute

Pn(pn+1i > pn+1

j

)= IEnΦ

(si − sj√

2σ2ε

)

=∫

∞

−∞

Φ

(x√2σ2

ε

)1√

2π

((σni

)2+(

σnj

)2)e

− (x−(µi−µj))2

2

((σn

i )2

+(σnj )

2)dx

and recast the last line as P (X ≤ Y ) where X ∼N(0,2σ2

ε

)and

Y ∼N

(µi −µj ,(σn

i )2 +(

σnj

)2)

are independent.

19 / 34

Value of information in match-making

Let

kw ,n+1 =(µw ,n+1,σw ,n+1

), k l ,n+1 =

(µl ,n+1,σ l ,n+1

)be the beliefs that we would have at time n+ 1 if player 0 wins (orloses) against j

Similarly, let qw ,n+10i (or ql ,n+1

0i ) be the draw probabilities if player 0wins (or loses)

The greedy policy would arrange the next game by computing

Fw ,n+1 = maxi

qw ,n+10i , F l ,n+1 max

iql ,n+1

0i

depending on what happens now

20 / 34

Value of information in match-making

If we stop learning after the next game, the optimal match-up is

X n = arg maxj

qn0j + (N−n)F nj

where

F nj = Pn (0 beats j)Fw ,n+1 +Pn (j beats 0)F l ,n+1

is the expected value (pre-game) of the highest draw probability(post-game)

If the total number N of games is unknown, use

X n = arg maxj

qn0j +γ

1− γF nj

where γ is tunable

21 / 34

Experimental results: draw probabilities

In simulations, our method behaved more aggressively than DrawChance...

22 / 34

Experimental results: difference in true skills

...pursued tougher opponents early on, but found better matches later...

23 / 34

Experimental results: errors of estimates

...produced better estimates of player 0’s true skill...

24 / 34

Experimental results: win/loss ratios

...and came closer to a 0.5 win/loss ratio.

25 / 34

Outline

1 Introduction




5 Conclusions

26 / 34

...but that’s not the end!

In simulation optimization, wemight tune a simulator to seehow performance of a systemcould be improved

But before the simulator can beoptimized, we need to make surethat it is a good model of reality

Targeting and selection: whichsimulation model most closelymatches data from the field?

27 / 34

Targeting and selection

Let c be a deterministic target (e.g. average historical performance)and consider M simulation models

The mean output si of model i matches the target if

|si − c |< δ

We can simulate system i to obtain a noisy observation

pi ∼N(si ,σ

2ε

),

and apply Bayesian updating with no moment-matching required

The “draw probability” in this context is given by

qni =1√2π

1√(σni

)2+ σ2

ε

e

− (µni −c)2

2

((σn

i )2

+σ2ε

).

28 / 34

The value of information

Our goal is to maximize

supπ

IEmaxi

qNi

or its online equivalent, if we are refining an existing simulator

Bayesian analysis tells us that, conditional on our beliefs at time n,

µn+1i ∼N

(µni ,(σ̃

ni )2)

where (σ̃ni )2 = (σn

i )2−(σn+1i

)2

The knowledge gradient approach simulates the system

X n = arg maxi

IEni max

jqn+1j

which is expected to yield the best result after the simulation

29 / 34

Issues for further work

The quantity IEni maxj q

n+1j can be computed in closed form

If i is believed to be suboptimal with high precision, one simulationmay yield no information (see also Ryzhov & Powell 2011b)

30 / 34

Outline

1 Introduction




5 Conclusions

31 / 34

Conclusions

We have studied online match-making through the framework ofonline optimal learning

In simulations, a look-ahead policy offers some improvement over agreedy policy

The formulation of the problem has interesting implications for futurework in simulation optimization (Ryzhov 2011)

32 / 34

References

Chick, S.E. (2006) “Subjective probability and Bayesianmethodology.” In Handbooks of Operations Research andManagement Science 13, 225–258.

Dangauthier, P., Herbrich, R., Minka, T. & Graepel, T. (2007)“TrueSkill through time: revisiting the history of chess.” In Advancesin Neural Information Processing Systems 20, 337–344.

Frazier, P.I., Powell, W. & Dayanik, S. (2008) “A knowledge-gradientpolicy for sequential information collection.” SIAM J. on Control andOptimization 47:5, 2410-2439.

Gittins, J. (1989) Multi-armed bandit allocation indices. John Wileyand Sons.

Herbrich, R., Minka, T. & Graepel, T. (2006) “TrueSkill™: a Bayesianskill rating system.” In Advances in Neural Information ProcessingSystems 19, 569–576.

33 / 34

References

Huhh, J. (2008) “Culture and business of PC bangs in Korea.”Games and Culture 3:1, 26–37.

Minka, T. (2001) “A family of algorithms for approximate Bayesianinference.” Ph.D. thesis, MIT.

Ryzhov, I.O. (2011) “Targeting and selection: a new approach tosimulation validation.” In preparation.

Ryzhov, I.O. & Powell, W.B. (2011a) “Information collection on agraph.” Operations Research 59:1, 188–201.

Ryzhov, I.O. & Powell, W.B. (2011b) “The value of information inmulti-armed bandits with exponentially distributed rewards.”Proceedings of the 2011 International Conference on ComputationalScience, 1363–1372.

34 / 34

Documents

May the Best Man Win! - University Of Marylandscholar.rhsmith.umd.edu/sites/default/files/iryzhov/...\May the Best Man Win!" Simulation optimization for match-making in e-sports Ilya