1
SIC - MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits Etienne Boursier , Vianney Perchet *† ENS Paris-Saclay, CMLA * Criteo AI Lab [email protected],[email protected] Summary Synchronisation = communication through collisions in MMAB Contradicts previous lower bounds Synchronisation is unrealistic and a loophole in the model More realistic dynamic model: first logarithmic regret algorithm Multiplayer bandits model Motivation cognitive radio networks Framework K arms, M K players. At each t T , player j pulls π j (t) Reward r j (t) : = X π j (t) (t)(1 - η π j (t) (t)) η k (t) : = ¤ 1 if several players pull k, 0 otherwise X k (t) [0, 1] i.i.d. with E[X k ]= μ k Decentralized no information exchange between players Observations Collision sensing: observes r j (t) and η π j (t) (t) No sensing: observes r j (t) only Synchro models Static: any player plays for t = 1,..., T Dynamic: player j plays for t = 1 + τ j τ j τ j ,..., T Goal with M(t) set of players at time t, minimize regret R T : = T X t=1 #M(t) X k=1 μ (k) - E μ T X t=1 X jM(t) r j (t) Notations μ (1) ... μ (K ) and ¯ Δ (M) : = min i=1,...,M (μ (i) - μ (i+1) ) State of the art bounds Setting Prior knowledge Asympt. Upper bound Centralized Multiplayer M X k>M log(T ) μ (M) -μ (k) [3] Decentralized, Col. Sensing μ (M) - μ (M+1) MK log(T ) ( μ (M) -μ (M+1) ) 2 [4] Decentralized, Col. Sensing - X k>M log(T ) μ (M) -μ (k) + MK log(T ) Dec., Col. Sensing , Dynamic ¯ Δ (M) M p K log(T )T ¯ Δ 2 (M) [4] Dec., No Sensing, Dynamic - MK log(T ) ¯ Δ 2 (M) + M 2 K log(T ) μ (M) Our results in red. SIC-MMAB: static with collision sensing Observation: coll. indicator η k (t) ∈{0, 1}→ bit sent from one player to another = communication possible between players Algorithm 1: SIC-MMAB Initialization phase: estimate M and player rank j in time O(K log(T )) for p = 1, ..., do Exploration phase: explore each active arm 2 p times without collision Communication phase: players exchange statistics of arms using collisions Eliminate suboptimal arms Attribute optimal arms to players (who enter exploitation phase) end Exploitation phase: pull attributed arm until T Communication protocol When player i sends stats of arm k to player j: Pull arm j to send 1 bit (collision) Pull arm i to send 0 bit (no collision) Send quantized empirical mean in binary For 2 p exploration rounds, communicate in p sublogarithmic comm. regret Regret of SIC-MMAB E R T Exploration phases z }| { X k>M log(T ) μ (M) - μ (k) + MK log(T ) | {z } Initialization phase + Communication phases z }| { M 3 K log 2 log(T ) (μ (M) - μ (M+1) ) 2 Contradiction with the lower bound Centralized lower bound X k>M log(T ) μ (M) -μ (k) [1] Decentralized lower bound M X k>M log(T ) μ (M) -μ (k) ? [2] NO Decentralized Centralized SIC-MMAB Toward a realistic model Unrealistic communication protocols loophole in current model Which assumption did go wrong? Collision sensing? No: still send a bit in log(T ) μ (K) rounds without it Synchronisation? Yes! Communication possible because players know when to talk to each other Synchronisation: unrealistic assumption leading to hacking algorithms Let’s remove it dynamic model DYN-MMAB: dynamic no sensing Algorithm 2: DYN-MMAB Exploration phase: pull k U (K ) Update occupied arms and queue of arms to exploit if r k (t) > 0 and k = arm to exploit then Enter exploitation phase end Exploitation phase: pull exploited arm until T How does it work? Do not estimate μ k but γμ k where γ P(no collision) Still distinguish optimal from suboptimal arms What if another player exploits k? either ˆ r k 0 quickly and arm k becomes suboptimal or too many 0s in a row = k is occupied Regret of DYN-MMAB E[R T ] M 2 K log(T ) μ (M) + MK log(T ) ¯ Δ 2 (M) References [1] Anantharam, V., Varaiya, P., andWalrand, J. (1987). Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part i: I.i.d. rewards. IEEE Transactions on Automatic Control, 32(11):968–976. [2] Besson, L. and Kaufmann, E. (2018). Multi-Player Bandits Revisited. In Algorithmic Learning Theory, Lanzarote, Spain. [3] Komiyama, J., Honda, J., and Nakagawa, H. (2015). Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays. In International Conference on Machine Learning, pages 1152–1161. [4] Rosenski, J., Shamir, O., and Szlak, L. (2016). Multi-player bandits – a musical chairs approach. In International Conference on Machine Learning, pages 155–163.

SIC-MMAB: Synchronisation Involves Communication in … · 2020. 9. 3. · SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits Etienne Boursier†,

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SIC-MMAB: Synchronisation Involves Communication in … · 2020. 9. 3. · SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits Etienne Boursier†,

SIC - MMAB: Synchronisation Involves Communication inMultiplayer Multi-Armed Bandits

Etienne Boursier†, Vianney Perchet*†

†ENS Paris-Saclay, CMLA *Criteo AI [email protected], [email protected]

Summary

• Synchronisation =⇒ communication through collisions in MMAB• Contradicts previous lower bounds• Synchronisation is unrealistic and a loophole in the model• More realistic dynamic model: first logarithmic regret algorithm

Multiplayer bandits model

Motivation cognitive radio networksFramework K arms, M ≤ K players. At each t≤ T, player j pulls πj(t)

Reward rj(t) := Xπj(t)(t)(1−ηπj(t)(t))

• ηk(t) :=

¨

1 if several players pull k,

0 otherwise• Xk(t) ∈ [0, 1] i.i.d. with E[Xk] = µk

Decentralized no information exchange between playersObservations Collision sensing: observes rj(t) and ηπj(t)(t)

No sensing: observes rj(t) onlySynchro models Static: any player plays for t= 1, . . . , T

Dynamic: player j plays for t= 1+τjτjτj, . . . , TGoal with M(t) set of players at time t, minimize regret

RT :=T∑

t=1

#M(t)∑

k=1

µ(k)−Eµ

T∑

t=1

j∈M(t)

rj(t)

Notations µ(1) ≥ . . .≥ µ(K) and ∆̄(M) := mini=1,...,M

(µ(i)−µ(i+1))

State of the art bounds

Setting Prior knowledge Asympt. Upper bound

Centralized Multiplayer M∑

k>M

log(T)µ(M)−µ(k)

[3]

Decentralized, Col. Sensing µ(M)−µ(M+1)MK log(T)

(µ(M)−µ(M+1))2 [4]

Decentralized, Col. Sensing -∑

k>M

log(T)µ(M)−µ(k)

+MK log(T)

Dec., Col. Sensing , Dynamic ∆̄(M)Mp

K log(T)T∆̄2(M)

[4]

Dec., No Sensing, Dynamic - MK log(T)∆̄2(M)+ M2K log(T)

µ(M)

Our results in red.

SIC-MMAB: static with collision sensing

Observation: coll. indicator ηk(t) ∈ {0, 1} → bit sent from one player to another=⇒ communication possible between players

Algorithm 1: SIC-MMABInitialization phase: estimate M and player rank j in time O(K log(T))for p= 1, ...,∞ doExploration phase: explore each active arm 2p times without collisionCommunication phase: players exchange statistics of arms using collisionsEliminate suboptimal armsAttribute optimal arms to players (who enter exploitation phase)

endExploitation phase: pull attributed arm until T

Communication protocol

When player i sends stats of arm k to player j:

• Pull arm j to send 1 bit (collision)• Pull arm i to send 0 bit (no collision)• Send quantized empirical mean in binary

For 2p exploration rounds, communicate in p→ sublogarithmic comm. regret

Regret of SIC-MMAB

E�

RT

®

Exploration phases︷ ︸︸ ︷

k>M

log(T)µ(M)−µ(k)

+ MK log(T)︸ ︷︷ ︸

Initialization phase

+

Communication phases︷ ︸︸ ︷

M3K log2

log(T)(µ(M)−µ(M+1))2

Contradiction with the lower bound

Centralized lower bound

k>M

log(T)µ(M)−µ(k)

[1]

Decentralized lower bound

M∑

k>M

log(T)µ(M)−µ(k)

? [2]

NO

Decentralized ∼ Centralized

SIC-MMAB

Toward a realistic model

Unrealistic communication protocols→ loophole in current modelWhich assumption did go wrong?

• Collision sensing? No: still send a bit in log(T)µ(K)

rounds without it

• Synchronisation? Yes! Communication possible because players knowwhen to talk to each other

Synchronisation: unrealistic assumption leading to hacking algorithmsLet’s remove it→ dynamic model

DYN-MMAB: dynamic no sensing

Algorithm 2: DYN-MMABExploration phase: pull k∼ U(K)Update occupied arms and queue of arms to exploitif rk(t)> 0 and k= arm to exploit thenEnter exploitation phase

endExploitation phase: pull exploited arm until T

How does it work?Do not estimate µk but γµk where γ' P(no collision)Still distinguish optimal from suboptimal armsWhat if another player exploits k?

• either r̂k→ 0 quickly and arm k becomes suboptimal• or too many 0s in a row =⇒ k is occupied

Regret of DYN-MMAB

E[RT]®M2K log(T)µ(M)

+MK log(T)∆̄2(M)

References

[1] Anantharam, V., Varaiya, P., and Walrand, J. (1987). Asymptotically efficient allocationrules for the multiarmed bandit problem with multiple plays-part i: I.i.d. rewards. IEEETransactions on Automatic Control, 32(11):968–976.

[2] Besson, L. and Kaufmann, E. (2018). Multi-Player Bandits Revisited. In AlgorithmicLearning Theory, Lanzarote, Spain.

[3] Komiyama, J., Honda, J., and Nakagawa, H. (2015). Optimal regret analysis of thompsonsampling in stochastic multi-armed bandit problem with multiple plays. In InternationalConference on Machine Learning, pages 1152–1161.

[4] Rosenski, J., Shamir, O., and Szlak, L. (2016). Multi-player bandits – a musical chairsapproach. In International Conference on Machine Learning, pages 155–163.