Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Introduction of Multi-arm Bandit Algorithms

Xi Chen

Stern School of BusinessNew York University

0/43

A useful algorithmic tool for combininglearning and decisions: Multi-Armed Bandit

1/43

Multi-Armed Bandit Problem (MAB)

• Different machine generates different random rewards

• Gambler decides which slot machine to play with each token

• Maximize reward ($$)

2/43

Online decision-making: learning while doing• Online decision-making involves a fundamental choice: Exploration: Gather more information

Exploitation: Make the best decision given current information

• The best long-term strategy may involve short-term sacrifices 3/43

Example: Insufficient Exploration

4/43

1 2 3 4 5 6 7 8


$0 $0 $0

$5 $5 $5 $5 $5 …

1 2 3 4 5 6 7 8

10/43


$0 $0 $0

$5 $5 $5 $5 $5 $5 $5 $5 …

It turns out always pays $5/round

1 2 3 4 5 6 7 8

11/43


1 2 3 4 5 6 7 8

$0 $0 $0

$5 $5 $5 $5 $5 $5 $5 $5 …


pays $100 a quarter of the time($25/round on average)

12/43


1 2 3 4 5 6 7 8

$100 $0 $0 $0 $0 $100 $0 $100

$5 $5 $5 $5 $5 $5 $5 $5 …


pays $100 a quarter of the time($25/round on average)

13/43

A/B Testing

9/43

Exploration: Gather more information about which design is better

Exploitation: Show the best design to the customer

Learning-while-doing in revenue management

10

• Retailers are interested in finding an optimal policy (pricing) to maximize their revenue

• Unknown relationship between price and customer’s purchasing decision (demand distribution) Exploration: Gather more information about customers

behavior using different prices

Exploitation: Make the best price based on the current information

Wang, Deng & Ye. Close the Gaps: A Learning-While-Doing Algorithm for Single-Product Revenue Management Problems. Operations Research.2014 10/43

Crowdsourcing• Crowdsourcing: a problem-solving process where a large number

of tasks are outsourced to a distributed group of workers with varying expertise

• The estimated valuation for Amazon Mechanical Turk is about $250M (2012)

• Xi Chen, Qihang Lin, and Dengyong Zhou. Statistical Decision Making for Optimal Budget Allocation in Crowd Labeling. Journal of Machine Learning Research, 2015

• Xi Chen, Kevin Jiao, and Qihang Lin. Bayesian Decision Process for Cost-Efficient Dynamic Ranking via Crowdsourcing. Journal of Machine Learning Research, 2016 15/43

Crowdsourcing

12

Elliptical: +1 Spiral: -1

16/43

Crowdsourcing

+1 −𝟏𝟏

+1

+1 −𝟏𝟏 −𝟏𝟏+1

Online decision: how much budget should be spent on a difficult image?

Exploration: give an image to a new worker to test his/her ability Exploitation: give an image to the current best worker

13/43

Applications of MAB

14

Many applications have been studied:

• Clinical trials

• Recommender systems

• Advertising: what ad to put on a web-page?

• Auctions

• Financial portfolio design

14/43

Many algorithms for balancing the exploration- exploitation tradeoff

15/41

• 𝜖𝜖-greedy algorithm

• Thompson sampling• Bayesian setup with a prior distribution over reward parameters• Choose the action that maximizes the expected reward under posterior

• Upper confidence bound (UCB)• Add confidence bonus to the estimated mean• If the estimator is reliable, add less; if not, add more

Online Network Revenue Management Using Thompson SamplingFerreira, Simchi-Levi, and Wang, 2016

• ~$300B industry with ~10% annual growth over the last 5 years

• IBISWorld US Industry Report; excludes online sales of traditionally brick & mortar stores

• Online retailers have additional information as compared to brick & mortar retailers, e.g. real-time customer purchase decisions (buy / no buy)

• How can we use this information to develop a more effective revenue management strategy?

21/43

Setting• Finite selling horizon of T periods

• One customer arrives per period• Sequentially observe customer purchase decisions

• Finite set of prices; ith price denoted pi

• Unknown mean demand per price (“purchase probability”) di

• Given unlimited inventory and known demand, select price with highest revenue = pi*di

• Challenges: Unknown demand• Exploration vs. Exploitation Tradeoff

22/43

Multi-Armed Bandit Problem

• Retailer decides…• Which price to offer a customer• How many times to offer each price• In what order to offer prices to customers

• Learns demand at each price to maximize revenue

$24.90 $29.90 $34.90 $39.90

18/43

Thompson Sampling: Two Price Example

�𝒅𝒅1 ~ Beta(1, 1)True (unknown) d1=0.6


1. Customer arrives

2. Retailer samples θ1 and θ2from current distributional estimation of d1 and d2

3. Retailer offers price that maximizes piθi

4. Customer makes purchase decision (according to di)

5. Retailer observes purchase decision and updates demand estimation

Beta pdf Beta pdf

p1 =$29.90

p2 =$39.90

19/43

p1 =$29.90

p2 =$39.90



θ1 = 0.41, θ2 = 0.83

p2θ2 > p1θ1

p2 =$39.90

Customer does not buy item 2�𝒅𝒅2 ~ Beta(1, 1+1)

True (unknown) d2=0.3

Beta pdf Beta pdf

Thompson Sampling:Two Price Example

20/43

p1 =$29.90

p2 =$39.90


θ1 = 0.93, θ2 = 0.12

p1θ1 > p2θ2

p1 =$29.90

Customer buys item 1�𝒅𝒅2 ~ Beta(1, 1+1)

True (unknown) d2=0.3

Beta pdf

�𝒅𝒅1 ~ Beta(1+1, 1)True (unknown) d1=0.6

Beta pdf

Thompson Sampling:Two Price Example

21/4e

• As each price is offered more times…• Beta pdf converges to reflect true mean demand• Will choose optimal price with high probability

p1 =$29.90

p2 =$39.90

�𝒅𝒅2 ~ Beta(1 + # “buy”, 1 + # “no buy”)True (unknown) d2=0.3

�𝒅𝒅1 ~ Beta(1 + # “buy”, 1 + # “no buy”)True (unknown) d1=0.6

Beta pdf Beta pdf

Thompson Sampling: Two Price Example

22/43

Advantages of Thompson Sampling

23/43

• Empirical and theoretical results show it’s a highly competitive algorithm for unlimited inventory

• Easy to implement and understand• Non-parametric• Continuous exploration & exploitation

How do we incorporate inventory constraints?• Key Tradeoffs:

• Exploration vs. Exploitation• Explore at the cost of running out of inventory

Thompson Sampling with Inventory:Two Price Example

24/43

1. Customer arrives2. Retailer samples θ1 and θ23. Retailer solves a deterministic LP to identify the

optimal fraction of remaining customers to offer p1and p2, using

• θ1 and θ2• Remaining unsold inventory & customers

4. Retailer offers price 𝑝𝑝𝑖𝑖 with probability based on fraction found in Step 3

5. Customer makes purchase decision 6. Retailer observes decision and updates �di

Step 3

25/43

max𝑥𝑥1,𝑥𝑥2

�𝑇𝑇−𝑡𝑡

𝑝𝑝1𝜃𝜃1𝑥𝑥1 + 𝑝𝑝2𝜃𝜃2𝑥𝑥2

𝑠𝑠. 𝑡𝑡. 𝑥𝑥1 + 𝑥𝑥2 ≤ 1

(𝑇𝑇 − 𝑡𝑡)(𝜃𝜃1𝑥𝑥1 + 𝜃𝜃2𝑥𝑥2) ≤ 𝐼𝐼𝐼𝐼𝐼𝐼(𝑡𝑡)

𝑥𝑥1, 𝑥𝑥2 ≥ 0

𝑥𝑥𝑖𝑖 = fraction of remaining customers (T-t) to offer price pi

maximize revenue over remaining customersfraction of remaining customers ≤ 1

expected inventory sold is upper-bounded by remaining inventory

Agenda

26/43

• Motivation• Thompson sampling

- Unlimited inventory- Our contribution: Limited inventory

• Theoretical results• Simulation results• Summary of Main Contributions

Measuring Algorithm Performance

27/43

Regret = E[Revenue of Optimal Policy with Known Demand] –E[Revenue of Algorithm]

≤ Upper Bound on Optimal Policy –E[Revenue of Algorithm]

Theoretical Results

28/43

TheoremSuppose the LP of the underlying true demand (i.e. benchmark) is nondegenerate. Then, for the modified Thompson Sampling with Inventory Algorithm,

𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑡𝑡 𝑇𝑇 ≤ 𝑂𝑂 𝑇𝑇 log𝑇𝑇 log log𝑇𝑇 = �𝑂𝑂 𝑇𝑇

My Research in Bandit Learning

• Identifying the top-K arms (arms with the largest means)

• Yuan Zhou, Xi Chen, and Jian Li. Optimal PAC Multiple Arm Identification with Applications to Crowdsourcing. In Proceedings of International Conference on Machine Learning (ICML), 2014

• Jiecao Chen, Xi Chen, Qin Zhang, and Yuan Zhou. Adaptive Multiple-Arm Identification. In Proceedings of International Conference on Machine Learning (ICML), 2017.

• Xi Chen, Yuanzhi Li, Jieming Mao. An Instance Optimal Algorithm for Top-k Ranking under the Multinomial Logit Model. In Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018.

• Bandit for dynamic pricing (Alibaba supermarket)• Sentao Miao, Xi Chen, Xiuli Chao, Jiaxi Liu, and Yidong Zhang. Context-

Based Dynamic Pricing with Online Clustering, 201929/41

My Research in Bandit Learning

• Bandit for assortment optimization (online product recommendation)

• Yining Wang, Xi Chen, and Yuan Zhou. Near-Optimal Policies for Dynamic Multinomial Logit Assortment Selection Models. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2018.

• Xi Chen and Yining Wang. A Note on Tight Lower Bound for MNL-Bandit Assortment Selection Models. Operations Reserach Letters, 2018.

• Xi Chen, Yining Wang, and Yuan Zhou. Dynamic Assortment Optimization with Changing Contextual Information, 2019

• Xi Chen, Yining Wang, and Yuan Zhou. Dynamic Assortment Selection under the Nested Logit Models, 2019

• Papers can be found at my homepage http://people.stern.nyu.edu/xchen3/publication.html 30/41

https://arxiv.org/abs/1805.04785

http://people.stern.nyu.edu/xchen3/publication.html

My Research in Reinforcement Learning• Foundations:

• Yasin Abbasi-Yadkori, Peter L. Bartlett, Xi Chen, Alan Malek. Large-Scale Markov Decision Problems via the Linear Programming Dual (https://arxiv.org/abs/1901.01992)

• High-frequency trading: • Qihang Lin, Xi Chen, and Javier Peña. A Trade Execution Model under a

Composite Dynamic Coherent Risk Measure. Operations Research Letters, 2015

• Crowdsourcing:• Xi Chen, Qihang Lin, and Dengyong Zhou. Statistical Decision Making for

Optimal Budget Allocation in Crowd Labeling. Journal of Machine Learning Research, 2015

• Xi Chen, Kevin Jiao, and Qihang Lin. Bayesian Decision Process for Cost-Efficient Dynamic Ranking via Crowdsourcing. Journal of Machine Learning Research, 2016

31/41

https://arxiv.org/abs/1901.01992

Documents

Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba