Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
“May the Best Man Win!”Simulation optimization for match-making in e-sports
Ilya O. Ryzhov1 Awais Tariq2 Warren B. Powell2
1Robert H. Smith School of BusinessUniversity of Maryland
College Park, MD 20742
2Operations Research and Financial EngineeringPrinceton UniversityPrinceton, NJ 08544
INFORMS Annual MeetingNovember 15, 2011
1 / 34
Outline
1 Introduction
2 TrueSkill™ model for learning skill levelsLearning with moment-matchingThe DrawChance policy
3 Match-making with knowledge gradients
4 Moving on: Targeting and selection
5 Conclusions
2 / 34
Outline
1 Introduction
2 TrueSkill™ model for learning skill levelsLearning with moment-matchingThe DrawChance policy
3 Match-making with knowledge gradients
4 Moving on: Targeting and selection
5 Conclusions
3 / 34
Motivation: e-sports
The term “e-sports” refers to competitive multi-player online gaming
Thousands of players simultaneously log on to networks such as XboxLive or Battle.net
4 / 34
Motivation: e-sports
Revenues of South Korean game company NCSoft, 2000-2004 (Huhh 2008).
E-sports have become culturally significant and very profitable
Top players from around the world compete professionally
In 2005, Xbox Live had over 2 million subscribers; Battle.net has over3 million registered players for a single game
5 / 34
Ranking and competition in e-sports
Game services and outside organizations create rankings of players
Casual players are matched up automatically according to their skilllevel
6 / 34
Simulation optimization for match-making
We would like to create fair and challenging games by matchingplayers of similar skill level
The TrueSkill™ system used by Xbox Live views this as a Bayesianlearning problem in which we sequentially learn players’ skills
Unlike e.g. multi-armed bandit problems (Gittins 1989), the goal is tomatch a target rather than find the most skilled player
We compare a value-of-information procedure to the greedy policyused by Microsoft
7 / 34
Outline
1 Introduction
2 TrueSkill™ model for learning skill levelsLearning with moment-matchingThe DrawChance policy
3 Match-making with knowledge gradients
4 Moving on: Targeting and selection
5 Conclusions
8 / 34
Mathematical model
Player i = 0,1, ...,M has an underlying skill level si , unknown to thegame master
Our uncertainty about si is expressed as
si ∼N(
µ0i ,(σ
0i
)2)
The performance of player i in a game is expressed as
pi ∼N(si ,σ
2ε
)We assume that performances and skill levels are independent
9 / 34
Non-conjugacy of Bayesian model
We say that player i beats player j if pi > pj in a game between theseplayers
Unfortunately, we never observe the exact values of pi or pj , onlywhich player won
Thus, the posterior belief
P (si ∈ ds |pi > pj) =P (pi > pj |si = s)P (si ∈ ds)
P (pi > pj)
is non-normal
Conjugacy is forced using moment-matching (Minka 2001): plug themean and variance of the posterior into a normal distribution
10 / 34
Moment-matching for approximate conjugacy
Given the beliefs at time n, and the outcome of game n+ 1 between i andj , update (Dangauthier et al. 2007)
µn+1i =
µni +
(σni )
2
σ̄nij·v(
µni −µn
j
σ̄nij
)if pn+1
i > pn+1j ,
µni −
(σni )
2
σ̄nij·v(
µnj −µn
i
σ̄nij
)if pn+1
i < pn+1j ,
(σn+1i
)2=
(σn
i )2
(1− (σn
i )2
σ̄nij·w(
µni −µn
j
σ̄nij
))if pn+1
i > pn+1j ,
(σni )2
(1− (σn
i )2
σ̄nij·w(
µnj −µn
i
σ̄nij
))if pn+1
i < pn+1j ,
with v (x) = φ(x)Φ(x) , w (x) = v (x)(v (x) + x), and
σ̄nij = (σ
ni )2 +
(σnj
)2+ 2σ
2ε .
Intuitively: increase our skill estimate for the winning player.
11 / 34
Choosing an opponent
In Dangauthier et al. (2007), a game between i and j ends in a draw if
|pi −pj |< δ
for some small δ > 0.12 / 34
The draw probability
After n games, our prediction that the (n+ 1)st game will end in a draw is
Pn(∣∣∣pn+1
i −pn+1j
∣∣∣< δ
)= IEnPn
(∣∣∣pn+1i −pn+1
j
∣∣∣< δ |si ,sj).
For very small δ ,
Pn(∣∣∣pn+1
i −pn+1j
∣∣∣< δ |si ,sj)≈ 1√
2π (2σ2ε )
e−(si−sj)
2
4σ2ε δ .
We take δ → 0 and define the draw probability as
qnij = IEn
1√2π (2σ2
ε )e−(si−sj)
2
4σ2ε
.
13 / 34
Choosing an opponent
Thus, the “probability” of a draw between players i and j is
qnij =1√2π
1√(σni
)2+(
σnj
)2+ 2σ2
ε
e
− (µni −µnj )2
2
((σn
i )2
+(σnj )
2+2σ2
ε
).
We expect the game to be more competitive when this quantity is higher.
14 / 34
Connection to DrawChance
The DrawChance formula given in Herbrich et al. (2006) is
q̃nij =
√√√√ 2σ2ε(
σni
)2+(
σnj
)2+ 2σ2
ε
e
− (µni −µnj )2
2
((σn
i )2
+(σnj )
2+2σ2
ε
)
which is identical to qnij up to a constant scale factor
The DrawChance policy used by Xbox Live greedily selects thematch-up with the highest draw probability
15 / 34
Outline
1 Introduction
2 TrueSkill™ model for learning skill levelsLearning with moment-matchingThe DrawChance policy
3 Match-making with knowledge gradients
4 Moving on: Targeting and selection
5 Conclusions
16 / 34
Simulation optimization for match-making
We interpret the match-making problem as online simulationoptimization with the objective
supπ
N
∑n=0
qn0,X π (µn,σn)
where π is a policy for choosing opponents for a fixed player 0
The concept of value of information (Chick 2006) looks ahead to theoutcome of the next decision
This approach can be adapted to many types of objective functions(Frazier et al. 2008; Ryzhov & Powell 2011a)
17 / 34
Prediction of game outcome
Our Bayesian beliefs provide us with an (approximate) estimate of theoutcome of a hypothetical game between i and j :
Proposition
Under the normality assumption, the probability that player i beats playerj in game n+ 1 is given by
Pn(pn+1i > pn+1
j
)= Φ
µni −µn
j(σni
)2+(
σnj
)2+ 2σ2
ε
.
18 / 34
Prediction of game outcome
Proof.
We compute
Pn(pn+1i > pn+1
j
)= IEnΦ
(si − sj√
2σ2ε
)
=∫
∞
−∞
Φ
(x√2σ2
ε
)1√
2π
((σni
)2+(
σnj
)2)e
− (x−(µi−µj))2
2
((σn
i )2
+(σnj )
2)dx
and recast the last line as P (X ≤ Y ) where X ∼N(0,2σ2
ε
)and
Y ∼N
(µi −µj ,(σn
i )2 +(
σnj
)2)
are independent.
19 / 34
Value of information in match-making
Let
kw ,n+1 =(µw ,n+1,σw ,n+1
), k l ,n+1 =
(µl ,n+1,σ l ,n+1
)be the beliefs that we would have at time n+ 1 if player 0 wins (orloses) against j
Similarly, let qw ,n+10i (or ql ,n+1
0i ) be the draw probabilities if player 0wins (or loses)
The greedy policy would arrange the next game by computing
Fw ,n+1 = maxi
qw ,n+10i , F l ,n+1 max
iql ,n+1
0i
depending on what happens now
20 / 34
Value of information in match-making
If we stop learning after the next game, the optimal match-up is
X n = arg maxj
qn0j + (N−n)F nj
where
F nj = Pn (0 beats j)Fw ,n+1 +Pn (j beats 0)F l ,n+1
is the expected value (pre-game) of the highest draw probability(post-game)
If the total number N of games is unknown, use
X n = arg maxj
qn0j +γ
1− γF nj
where γ is tunable
21 / 34
Experimental results: draw probabilities
In simulations, our method behaved more aggressively than DrawChance...
22 / 34
Experimental results: difference in true skills
...pursued tougher opponents early on, but found better matches later...
23 / 34
Experimental results: errors of estimates
...produced better estimates of player 0’s true skill...
24 / 34
Experimental results: win/loss ratios
...and came closer to a 0.5 win/loss ratio.
25 / 34
Outline
1 Introduction
2 TrueSkill™ model for learning skill levelsLearning with moment-matchingThe DrawChance policy
3 Match-making with knowledge gradients
4 Moving on: Targeting and selection
5 Conclusions
26 / 34
...but that’s not the end!
In simulation optimization, wemight tune a simulator to seehow performance of a systemcould be improved
But before the simulator can beoptimized, we need to make surethat it is a good model of reality
Targeting and selection: whichsimulation model most closelymatches data from the field?
27 / 34
Targeting and selection
Let c be a deterministic target (e.g. average historical performance)and consider M simulation models
The mean output si of model i matches the target if
|si − c |< δ
We can simulate system i to obtain a noisy observation
pi ∼N(si ,σ
2ε
),
and apply Bayesian updating with no moment-matching required
The “draw probability” in this context is given by
qni =1√2π
1√(σni
)2+ σ2
ε
e
− (µni −c)2
2
((σn
i )2
+σ2ε
).
28 / 34
The value of information
Our goal is to maximize
supπ
IEmaxi
qNi
or its online equivalent, if we are refining an existing simulator
Bayesian analysis tells us that, conditional on our beliefs at time n,
µn+1i ∼N
(µni ,(σ̃
ni )2)
where (σ̃ni )2 = (σn
i )2−(σn+1i
)2
The knowledge gradient approach simulates the system
X n = arg maxi
IEni max
jqn+1j
which is expected to yield the best result after the simulation
29 / 34
Issues for further work
The quantity IEni maxj q
n+1j can be computed in closed form
If i is believed to be suboptimal with high precision, one simulationmay yield no information (see also Ryzhov & Powell 2011b)
30 / 34
Outline
1 Introduction
2 TrueSkill™ model for learning skill levelsLearning with moment-matchingThe DrawChance policy
3 Match-making with knowledge gradients
4 Moving on: Targeting and selection
5 Conclusions
31 / 34
Conclusions
We have studied online match-making through the framework ofonline optimal learning
In simulations, a look-ahead policy offers some improvement over agreedy policy
The formulation of the problem has interesting implications for futurework in simulation optimization (Ryzhov 2011)
32 / 34
References
Chick, S.E. (2006) “Subjective probability and Bayesianmethodology.” In Handbooks of Operations Research andManagement Science 13, 225–258.
Dangauthier, P., Herbrich, R., Minka, T. & Graepel, T. (2007)“TrueSkill through time: revisiting the history of chess.” In Advancesin Neural Information Processing Systems 20, 337–344.
Frazier, P.I., Powell, W. & Dayanik, S. (2008) “A knowledge-gradientpolicy for sequential information collection.” SIAM J. on Control andOptimization 47:5, 2410-2439.
Gittins, J. (1989) Multi-armed bandit allocation indices. John Wileyand Sons.
Herbrich, R., Minka, T. & Graepel, T. (2006) “TrueSkill™: a Bayesianskill rating system.” In Advances in Neural Information ProcessingSystems 19, 569–576.
33 / 34
References
Huhh, J. (2008) “Culture and business of PC bangs in Korea.”Games and Culture 3:1, 26–37.
Minka, T. (2001) “A family of algorithms for approximate Bayesianinference.” Ph.D. thesis, MIT.
Ryzhov, I.O. (2011) “Targeting and selection: a new approach tosimulation validation.” In preparation.
Ryzhov, I.O. & Powell, W.B. (2011a) “Information collection on agraph.” Operations Research 59:1, 188–201.
Ryzhov, I.O. & Powell, W.B. (2011b) “The value of information inmulti-armed bandits with exponentially distributed rewards.”Proceedings of the 2011 International Conference on ComputationalScience, 1363–1372.
34 / 34