25

Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

Embed Size (px)

Citation preview

Page 1: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays
Page 2: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays
Page 3: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

( ) ( )

Page 4: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

: 𝐾

𝑡 = 1,2, … , 𝑇

𝐼(𝑡) ∈

{1, . . , 𝐾}

𝑋𝐼 𝑡 𝑡

𝑡=1𝑇 𝑋𝐼 𝑡 (𝑡)

(image from

http://www.directgamesroom.com )

arm

Page 5: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays
Page 6: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

Bernoulli: 1= ,

0= )

Page 7: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

𝑖 𝜈𝑖

𝑋𝐼 𝑡 𝑡 ∼ 𝜈𝐼(𝑡)

𝜈𝑖

Bernoulli(𝜇𝑖)

{𝜇𝑖

Page 8: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

𝜇𝑖𝜇1 > 𝜇2 > 𝜇3 > ⋯ > 𝜇𝐾

{𝜇𝑖}𝑖∈[𝐾]

𝜇1 𝑇 𝜇1𝑇

Page 9: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

𝜇1, … . , 𝜇𝐾

𝜇𝑖argmaxi 𝜇𝑖

argmaxi 𝜇𝑖 = argmaxi𝜇𝑖 =: 𝜇1

Page 10: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

𝜇1

Regret 𝑇 = 𝜇1𝑇 −

𝑖

𝐾

𝜇𝑖𝑁𝑇 (𝑖)

𝑁𝑇(𝑖) 𝑇𝑖

𝑖 𝜇1 − 𝜇𝑖 𝐸 Regret 𝑇

𝐸[𝑁𝑖(𝑇)]

Page 11: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

Page 12: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays
Page 13: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

2

Page 14: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays
Page 15: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

: 𝐾 L (< 𝐾): 𝑇

𝑡 L 𝐼(𝑡){𝑋𝑖 𝑡 } (𝑖 ∈ 𝐼 𝑡 ) .

𝑋𝑖 𝑡 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝜇𝑖)

Regret(T) =

𝑡=1

𝑇

𝑖∈ 𝐿

𝜇𝑖 𝑡 −

𝑖∈𝐼 𝑡

𝜇𝑖 𝑡

{𝐿 + 1, 𝐿 + 2,… , 𝐾}𝐼 𝑡 = {1,… , 𝐿}

Page 16: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

単数選択で最適 複数選択で最適

Page 17: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

単数選択で最適 複数選択で最適

本研究

Page 18: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

Regret 𝑇 ≥

𝑖∈{𝐿+1,…,𝐾}

𝜇𝐿 − 𝜇𝑖 log 𝑇

𝐷𝐾𝐿 𝜇𝑖 , 𝜇𝐿− 𝑜 log 𝑇

3

L-2

L-1

i>L

j>L

𝐼(𝑡)

2

L

Page 19: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays
Page 20: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

Page 21: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays
Page 22: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

𝑖

𝛼𝑖(1) = 1, 𝛽𝑖(1) = 1

𝜃𝑖(𝑡) ∼ Beta(𝛼𝑖 𝑡 , 𝛽𝑖(𝑡)) 𝐼 𝑡 = 𝜃𝑖(𝑡)

𝑋𝐼 𝑡 𝑡 𝛼𝐼(𝑡) 𝛼𝐼(𝑡)(𝑡)

𝛽𝐼(𝑡) 𝛽𝐼(𝑡)(𝑡)

Page 23: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

𝜃𝑖(𝑡) ∼ 𝐵𝑒𝑡𝑎(𝛼𝑖(𝑡), 𝛽𝑖(𝑡)) 𝐼 𝑡 =𝜃𝑖(𝑡)

𝑖 ∈ 𝐼 𝑡

𝑋𝑖 𝑡 𝛼𝑖 𝛼𝑖(𝑡)𝛽𝑖 𝛽𝑖

Page 24: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

𝑂(log 𝑡

𝑡)

𝑂(log 𝑡

𝑡

2)

𝑡 = 1,… , 𝑇 𝑂(1)

Page 25: Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays