An Asymptotic Minimax Theorem for Gaussian Two-Armed Banditacmpt.ru/wp-content/uploads/2017/11/AMT.pdf · Introduction Gaussian Two-Armed Bandit Sketch of the proof of theorem Conclusion

Introduction Gaussian Two-Armed Bandit Sketch of the proof of theorem Conclusion

An Asymptotic Minimax Theoremfor Gaussian Two-Armed Bandit

A.V.Kolnogorov1

1Yaroslav-the-Wise Novgorod State [email protected]

ACMPT

Moscow

2017, October 26

A.V.Kolnogorov An Asymptotic Minimax Theorem for Gaussian Two-Armed Bandit


Bernoulli Two-Armed Bandit

Bernoulli Two-Armed Bandit

It is a slot machine with two arms. If the `-tharm is chosen then the gambler gets unit income(+1) with probability p` and nothing (0) withprobability q` (p` + q` = 1).

The gambler can chose arms N times totally. His goal is to maximize (insome sense) the total expected income. Probabilities p1, p2 are fixedduring control process but unknown to the gambler.

A Dilemma “Information vs Control”

For the gambler it would be better always to chose the armcorresponding to the largest value of probabilities p1, p2. However, if hewants to determine this arm he should try them both and this diminisheshis total expected income.



Formal Setup

Formal Setup

Formally, let’s consider a Bernoulli random controlled process ξn,n = 1, . . . , N , s.t.

Pr(ξn = 1|yn = `) = p`, Pr(ξn = 0|yn = `) = q`, ` = 1, 2.

A strategy σ can use all currently available information of the process:n1, n2 – total numbers of both arms choices, X1, X2 – correspondingtotal incomes. The loss function is as follows

LN (σ, θ) = N(p1 ∨ p2)− Eσ,θ

(N∑

n=1

ξn

),

where θ = (p1, p2) is a parameter of the process.



Bayesian Approach

Bayesian Approach

Let λ(θ) be a prior distribution density onΘ = {(p1, p2) : 0 ≤ p` ≤ 1, ` = 1, 2}. The Bayesian risk is equal to

RBN (λ) = inf

{σ}

∫Θ

LN (σ, θ)λ(θ)dθ,

corresponding optimal strategy σB is called Bayesian strategy.

A Simple Recursive Algorithm of Determination

As Berry and Fristedt write: “. . . it is not that researchers in banditproblems tend to “Bayesians”; rather Bayes’s theorem provides aconvenient mathematical formalism that allows for adaptive learning andso is an ideal tool in sequential decision problems”.



Minimax Approach

Minimax Approach

The minimax risk is equal to

RMN (Θ) = inf

{σ}supΘ

LN (σ, θ),

corresponding optimal strategy σM minimax strategy.

Robustness of Minimax Approach

If σM is applied then the following inequality holds

LN (σM , θ) ≤ RMN (Θ), ∀θ ∈ Θ.

Impossibility of Direct Determination

As Fabius and van Zwet write about Bernoulli two-armed bandit: “thealgebra involved becomes progressively more complicated with increasingN and seems to remain prohibitive already for N as small as 5”.



AMT

An Asymptotic Minimax Theorem

TheoremThe following inequality holds as N →∞ for Bernoulli two-armed bandit

0.612 ≤ (DN)−1/2RMN (Θ) ≤ 0.752.

The proof of the theorem uses some indirect estimates and techniques.

Vogel, W. An asymptotic minimax theorem for the two-armed banditproblem. Ann. Math. Stat., 1961, V. 31, P. 444–451



Specification of the AMT

Specification of the Asymptotic Minimax Theorem

We propose the following specification of the AMT for Gaussiantwo-armed bandit

TheoremThe following equality holds

limN→∞

(DN)−1/2RMN (Θ) ≈ 0.637.

The following issues are to be discussed:

1 What is Gaussian two-armed bandit?2 Why Gaussian two-armed bandit?3 How are Gaussian and Bernoulli two-armed bandits related with?4 How the theorem is proved?



What is Gaussian Two-Armed Bandit

What is Gaussian Two-Armed Bandit

It is a slot machine with two arms. If the `-tharm is chosen then the gambler gets randomincome. This income is normally distributed withunit variance and mathematical expectation m`.

The gambler can chose arms N times totally. His goal is to maximize (insome sense) the total expected income. Expectations m1, m2 are fixedduring control process but unknown to the gambler.

A Dilemma “Information vs Control”

For the gambler it would be better always to chose the armcorresponding to the largest value of expectations m1, m2. However, ifhe wants to determine this arm he should try them both and thisdiminishes his total expected income.



Why Gaussian Two-Armed Bandit?

Why Gaussian Two-Armed Bandit? Parallel processing

-

Assume that a large numberT = NK items of data aregiven, which can be processedby two alternative methods.Processing may be successful(ξt = 1) or unsuccessful (ξt =0).

Probabilities of successful and unsuccessful processing depend on chosenmethods (arms) and are equal to p` and q` respectively (` = 1, 2).Assume that p1, p2 are close to p. Let’s define the process

ξ′n = (DK)−1/2nK∑

t=(n−1)K+1

ξt, n = 1, . . . , N, D = p(1− p).

Distributions of ξ′n are close to normal and their variances are close to 1.A.V.Kolnogorov An Asymptotic Minimax Theorem for Gaussian Two-Armed Bandit


Some properties

Relation between Gaussian and Bernoulli Two-armedBandits

1 In application to data processing, Gaussian two-armed bandit is aparticular case of Bernoulli two-armed bandit which allows toprocess data in parallel. The data should be partitioned in a numberof groups. Data in the same group are then processed in parallel bythe same method.

2 For example, given 30 000 items of data, the data can be partitionedinto 30 groups each containing 1000 items of data. Then they canbe processed in 30 stages by 1000 items of data at each stage.

3 If the number of stages is large enough (e.g. 30 stages or more) thenmaximal losses for parallel data processing are almost the same as ifthe data were processed optimally one-by-one!

4 So, Bernoulli and Gaussian two-armed bandits are equivalent asN →∞.



Formal Setup

Formal Setup of the Problem

Formally incomes are considered as a controlled random processξ1, ξ2, . . . , ξN , which values depend only on currently chosen arms (in thesequel called actions) y1, y2, . . . , yN and are normally distributed withunit variance and mathematical expectation m` if the `-th action ischosen

f(x|m`) = (2π)−1/2 exp{−(x−m`)2/2

}.

This process is completely described by a vector parameter θ = (m1,m2).A control strategy σ prescribes the choice of actions yn, n = 1, . . . , Nand depends on complete prehistory of the process. In fact, it is sufficientto know four current values: n1, n2 – numbers of choices of both actionsand X1, X2 -– total incomes for both actions. Loss function is defined asfollows

LN (σ, θ) = N(m1 ∨m2)− Eσ,θ

(N∑

n=1

ξn

).



Minimax and Bayesian Settings

Minimax and Bayesian Settings

Assume that the set of parameters is thefollowing Θ = {(m1,m2) : |m1 − m2| ≤2c1, |m1 + m2| ≤ 2c2}, with 0 < c1 < ∞,0 < c2 < ∞ and c2 is large enough. Minimaxrisk is defined as

RMN (Θ) = inf

{σ}supΘ

LN (σ, θ),

corresponding strategy σM is called minimaxstrategy.

Consider a prior distribution λ(m1,m2) on Θ. Bayesian risk is defined as

RBN (λ) = inf

{σ}

∫Θ

LN (σ, θ)λ(θ)dθ,

corresponding strategy σB is called Bayesian strategy.



Some Properties of Approaches

Some Properties of Approaches

Robustness of Minimax Approach

If σM is applied then the following inequality holds

LN (σM , θ) ≤ RMN (Θ), ∀θ ∈ Θ.

A Simple Recursive Algorithm for Bayesian Approach

There is a well-known recursive algorithm for Bayesian risk and Bayesianstrategy numerical determination for any prior distribution.

Main Theorem of the Theory of Games

Under mild conditions the minimax risk is equal to Bayesian onecorresponding to the worst-case prior distribution i.e.

RMN (Θ) = sup

{Λ}RB

N (Λ) = RBN (Λ0).

and minimax strategy is equal to corresponding Bayesian strategy as well.In the sequel, minimax risk is searched as Bayesian one calculatedwith respect to the worst-case prior.



The Worst-Case Prior Distribution

Asymptotically the Worst-Case Prior Distribution

Asymptotically the worst-caseprior is uniform along andsymmetric crosswise the maindiagonal

Calculations allow to expect that itconcentrates on two parallel lines.Distance between them is the onlyunknown parameter



Invariant Equation – 1

Change of Variables

Recall that n1, n2 denote total numbers of choices of both actions, X1,X2 are corresponding total incomes and m1, m2 are mathematicalexpectations. We assume that actions are applied to groups of the sizeM . This allows parallel processing.Let’s denote

u =X1n2 −X2n1

nN1/2,

t1 =n1

N, t2 =

n2

N, ε =

M

N,

w = (m1 −m2)N1/2

and let %(w) characterize a symmetric uniform prior distribution.



Invariant Equation – 2

Invariant Recursive Equation with Unit Time Horizon

To determine the Bayesian risk one should solve Bellman type recursiveequation

rε(u, t1, t2) = min`=1,2

r(`)ε (u, t1, t2),

where r(1)ε (u, t1, t2) = r

(2)ε (u, t1, t2) = 0 for t1 + t2 = 1,

r(1)ε (u, t1, t2) = εg(1)(u, t1, t2) + rε(u, t1 + ε, t2) ∗ fεt22t−1(t+ε)−1(u),

r(2)ε (u, t1, t2) = εg(2)(u, t1, t2) + rε(u, t1, t2 + ε) ∗ fεt21t−1(t+ε)−1(u),

for t1 + t2 < 1, t1 ≥ ε and t2 ≥ ε.

g(`)(u, t1, t2) =∞∫0

2wg(u, (−1)`+1w, t1, t2)%(w)dw, ` = 1, 2,

g(u, w, t1, t2) = exp(−2uw − 2w2t1t2t

−1),

fε(x) = (2πε)−1/2 exp(x2/(2ε)).



Passage to the Limit

Passage to the Limit

Let ε → 0. Then for all u and for all t1, t2 for which solution of theequation is well defined there exists a limit

r(u, t1, t2) = limε→+0

rε(u, t1, t2), ` = 1, 2.

Under some additional conditions r(u, t1, t2) satisfies the second orderpartial differential equation

min`=1,2

(∂r

∂t`+

t2`

2t2× ∂2r

∂u2+ g(`)(u, t1, t2)

)= 0,

` = 3− `, ` = 1, 2, with initial conditions

limt1+t2→1

r(u, t1, t2) = 0 if t1 > ε, t2 > ε,

and boundary conditions

limu→+∞

r(u, t1, t2) = limu→−∞

r(u, t1, t2) = 0



Risk and Strategy

Risk and Strategy

Minimax Risk

Limiting minimax risk is equal to the limiting Bayesian risk correspondingto the worst prior distribution and is calculated as

limN→∞

N−1/2RMN (Θ) = sup

%r(%;u, t1, t2)

∣∣u=0,t1=0,t2=0

.

Optimal Strategy

Currently optimal is the `-th action if

∂r

∂t`+

t2`

2t2× ∂2r

∂u2+ g(`)(u, t1, t2)

has the smaller value (` = 1, 2).



Numerical Experiments

Numerical Experiments

Calculations were done under assumption that the worst prior distribution%(w) is concentrated at two points w = ±d with 0.5 ≤ d ≤ 2.5. Themaximal value of r(%; 0, 0, 0) was determined as

maxd

r(%; 0, 0, 0) ≈ 0.637

corresponding to d ≈ 1.57.



Some References - 1

Russian References

1 Tsetlin, M.L., Issledovaniya po teorii avtomatov i modelirovaniyubiologicheskikh sistem (Studies in Automata Theory and ModelingBiological Systems), Moscow: Nauka, 1969.

2 Varshavskii, V.I., Kollektivnoe povedenie avtomatov (CollectiveBehavior of Automata), Moscow: Nauka, 1973.

3 Sragovich, V.G., Adaptivnoe upravlenie (Adaptive Control), Moscow:Nauka, 1981.

4 Nazin, A.V. and Poznyak, A.S., Adaptivnyi vybor variantov(Adaptive Choice between Alternatives), Moscow: Nauka, 1986.

5 Presman, E.L. and Sonin, I.M., Posledovatel’noe upravlenie ponepolnym dannym (Sequential Control with Incomplete Data),Moscow: Nauka, 1982.



Some References - 2

English References

1 Berry, D.A. and Fristedt, B. Bandit Problems: Sequential Allocationof Experiments. Chapman and Hall. London, New York.,1985.

2 Lai, T.L. and Robbins, H. Asymptotically Efficient AdaptiveAllocation Rules. Advances in Applied Mathematics, 1985, V. 6,P. 4-22.

3 Robbins, H. Some aspects of the sequential design of experiments.Bulletin of Amer. Math. Soc., 1952, V. 58, P.527-535.

4 Vogel, W. An asymptotic minimax theorem for the two-armed banditproblem. Ann. Math. Stat., 1961, V. 31, P.444-451

5 Fabius, J., and van Zwet, W.R. Some remarks on the two-armedbandit. Ann. Math. Stat., 1970, V. 41, 1906 -1916.



Some References - 3

Previous Publications

1 Kolnogorov A. V. Finding Minimax Strategy and Minimax Risk in aRandom Environment (the Two-Armed Bandit Problem) // Automationand Remote Control, 2011, Vol. 72, No. 5, pp. 1017-1027.

2 Kolnogorov A.V. On a Limiting Description of Robust Parallel Control in aRandom Environment // Automation and Remote Control, Vol. 76,No. 7, pp. 1229 - 1241, 2015.



Thank you

Thank you for attention

Thank you for attention


Documents

An Asymptotic Minimax Theorem for Gaussian Two-Armed Banditacmpt.ru/wp-content/uploads/2017/11/AMT.pdf · Introduction Gaussian Two-Armed Bandit Sketch of the proof of theorem Conclusion