32
Introduction of Multi-arm Bandit Algorithms Xi Chen Stern School of Business New York University 0/43

Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Introduction of Multi-arm Bandit Algorithms

Xi Chen

Stern School of BusinessNew York University

0/43

Page 2: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

A useful algorithmic tool for combininglearning and decisions: Multi-Armed Bandit

1/43

Page 3: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Multi-Armed Bandit Problem (MAB)

• Different machine generates different random rewards

• Gambler decides which slot machine to play with each token

• Maximize reward ($$)

2/43

Page 4: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Online decision-making: learning while doing• Online decision-making involves a fundamental choice: Exploration: Gather more information

Exploitation: Make the best decision given current information

• The best long-term strategy may involve short-term sacrifices 3/43

Page 5: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Example: Insufficient Exploration

4/43

1 2 3 4 5 6 7 8

Page 6: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Example: Insufficient Exploration

$0 $0 $0

$5 $5 $5 $5 $5 …

1 2 3 4 5 6 7 8

10/43

Page 7: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Example: Insufficient Exploration

$0 $0 $0

$5 $5 $5 $5 $5 $5 $5 $5 …

It turns out always pays $5/round

1 2 3 4 5 6 7 8

11/43

Page 8: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Example: Insufficient Exploration

1 2 3 4 5 6 7 8

$0 $0 $0

$5 $5 $5 $5 $5 $5 $5 $5 …

It turns out always pays $5/round

pays $100 a quarter of the time($25/round on average)

12/43

Page 9: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Example: Insufficient Exploration

1 2 3 4 5 6 7 8

$100 $0 $0 $0 $0 $100 $0 $100

$5 $5 $5 $5 $5 $5 $5 $5 …

It turns out always pays $5/round

pays $100 a quarter of the time($25/round on average)

13/43

Page 10: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

A/B Testing

9/43

Exploration: Gather more information about which design is better

Exploitation: Show the best design to the customer

Page 11: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Learning-while-doing in revenue management

10

• Retailers are interested in finding an optimal policy (pricing) to maximize their revenue

• Unknown relationship between price and customer’s purchasing decision (demand distribution) Exploration: Gather more information about customers

behavior using different prices

Exploitation: Make the best price based on the current information

Wang, Deng & Ye. Close the Gaps: A Learning-While-Doing Algorithm for Single-Product Revenue Management Problems. Operations Research.2014 10/43

Page 12: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Crowdsourcing• Crowdsourcing: a problem-solving process where a large number

of tasks are outsourced to a distributed group of workers with varying expertise

• The estimated valuation for Amazon Mechanical Turk is about $250M (2012)

• Xi Chen, Qihang Lin, and Dengyong Zhou. Statistical Decision Making for Optimal Budget Allocation in Crowd Labeling. Journal of Machine Learning Research, 2015

• Xi Chen, Kevin Jiao, and Qihang Lin. Bayesian Decision Process for Cost-Efficient Dynamic Ranking via Crowdsourcing. Journal of Machine Learning Research, 2016 15/43

Page 13: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Crowdsourcing

12

Elliptical: +1 Spiral: -1

16/43

Page 14: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Crowdsourcing

+1 −𝟏𝟏

+1

+1 −𝟏𝟏 −𝟏𝟏+1

Online decision: how much budget should be spent on a difficult image?

Exploration: give an image to a new worker to test his/her ability Exploitation: give an image to the current best worker

13/43

Page 15: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Applications of MAB

14

Many applications have been studied:

• Clinical trials

• Recommender systems

• Advertising: what ad to put on a web-page?

• Auctions

• Financial portfolio design

14/43

Page 16: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Many algorithms for balancing the exploration- exploitation tradeoff

15/41

• 𝜖𝜖-greedy algorithm

• Thompson sampling• Bayesian setup with a prior distribution over reward parameters• Choose the action that maximizes the expected reward under posterior

• Upper confidence bound (UCB)• Add confidence bonus to the estimated mean• If the estimator is reliable, add less; if not, add more

Page 17: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Online Network Revenue Management Using Thompson SamplingFerreira, Simchi-Levi, and Wang, 2016

• ~$300B industry with ~10% annual growth over the last 5 years

• IBISWorld US Industry Report; excludes online sales of traditionally brick & mortar stores

• Online retailers have additional information as compared to brick & mortar retailers, e.g. real-time customer purchase decisions (buy / no buy)

• How can we use this information to develop a more effective revenue management strategy?

21/43

Page 18: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Setting• Finite selling horizon of T periods

• One customer arrives per period• Sequentially observe customer purchase decisions

• Finite set of prices; ith price denoted pi

• Unknown mean demand per price (“purchase probability”) di

• Given unlimited inventory and known demand, select price with highest revenue = pi*di

• Challenges: Unknown demand• Exploration vs. Exploitation Tradeoff

22/43

Page 19: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Multi-Armed Bandit Problem

• Retailer decides…• Which price to offer a customer• How many times to offer each price• In what order to offer prices to customers

• Learns demand at each price to maximize revenue

$24.90 $29.90 $34.90 $39.90

18/43

Page 20: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Thompson Sampling: Two Price Example

�𝒅𝒅1 ~ Beta(1, 1)True (unknown) d1=0.6

�𝒅𝒅2 ~ Beta(1, 1)True (unknown) d2=0.3

1. Customer arrives

2. Retailer samples θ1 and θ2from current distributional estimation of d1 and d2

3. Retailer offers price that maximizes piθi

4. Customer makes purchase decision (according to di)

5. Retailer observes purchase decision and updates demand estimation

Beta pdf Beta pdf

p1 =$29.90

p2 =$39.90

19/43

Page 21: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

p1 =$29.90

p2 =$39.90

�𝒅𝒅1 ~ Beta(1, 1)True (unknown) d1=0.6

�𝒅𝒅2 ~ Beta(1, 1)True (unknown) d2=0.3

θ1 = 0.41, θ2 = 0.83

p2θ2 > p1θ1

p2 =$39.90

Customer does not buy item 2�𝒅𝒅2 ~ Beta(1, 1+1)

True (unknown) d2=0.3

Beta pdf Beta pdf

Thompson Sampling:Two Price Example

20/43

Page 22: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

p1 =$29.90

p2 =$39.90

�𝒅𝒅1 ~ Beta(1, 1)True (unknown) d1=0.6

θ1 = 0.93, θ2 = 0.12

p1θ1 > p2θ2

p1 =$29.90

Customer buys item 1�𝒅𝒅2 ~ Beta(1, 1+1)

True (unknown) d2=0.3

Beta pdf

�𝒅𝒅1 ~ Beta(1+1, 1)True (unknown) d1=0.6

Beta pdf

Thompson Sampling:Two Price Example

21/4e

Page 23: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

• As each price is offered more times…• Beta pdf converges to reflect true mean demand• Will choose optimal price with high probability

p1 =$29.90

p2 =$39.90

�𝒅𝒅2 ~ Beta(1 + # “buy”, 1 + # “no buy”)True (unknown) d2=0.3

�𝒅𝒅1 ~ Beta(1 + # “buy”, 1 + # “no buy”)True (unknown) d1=0.6

Beta pdf Beta pdf

Thompson Sampling: Two Price Example

22/43

Page 24: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Advantages of Thompson Sampling

23/43

• Empirical and theoretical results show it’s a highly competitive algorithm for unlimited inventory

• Easy to implement and understand• Non-parametric• Continuous exploration & exploitation

How do we incorporate inventory constraints?• Key Tradeoffs:

• Exploration vs. Exploitation• Explore at the cost of running out of inventory

Page 25: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Thompson Sampling with Inventory:Two Price Example

24/43

1. Customer arrives2. Retailer samples θ1 and θ23. Retailer solves a deterministic LP to identify the

optimal fraction of remaining customers to offer p1and p2, using

• θ1 and θ2• Remaining unsold inventory & customers

4. Retailer offers price 𝑝𝑝𝑖𝑖 with probability based on fraction found in Step 3

5. Customer makes purchase decision 6. Retailer observes decision and updates �di

Page 26: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Step 3

25/43

max𝑥𝑥1,𝑥𝑥2

�𝑇𝑇−𝑡𝑡

𝑝𝑝1𝜃𝜃1𝑥𝑥1 + 𝑝𝑝2𝜃𝜃2𝑥𝑥2

𝑠𝑠. 𝑡𝑡. 𝑥𝑥1 + 𝑥𝑥2 ≤ 1

(𝑇𝑇 − 𝑡𝑡)(𝜃𝜃1𝑥𝑥1 + 𝜃𝜃2𝑥𝑥2) ≤ 𝐼𝐼𝐼𝐼𝐼𝐼(𝑡𝑡)

𝑥𝑥1, 𝑥𝑥2 ≥ 0

𝑥𝑥𝑖𝑖 = fraction of remaining customers (T-t) to offer price pi

maximize revenue over remaining customersfraction of remaining customers ≤ 1

expected inventory sold is upper-bounded by remaining inventory

Page 27: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Agenda

26/43

• Motivation• Thompson sampling

- Unlimited inventory- Our contribution: Limited inventory

• Theoretical results• Simulation results• Summary of Main Contributions

Page 28: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Measuring Algorithm Performance

27/43

Regret = E[Revenue of Optimal Policy with Known Demand] –E[Revenue of Algorithm]

≤ Upper Bound on Optimal Policy –E[Revenue of Algorithm]

Page 29: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

Theoretical Results

28/43

TheoremSuppose the LP of the underlying true demand (i.e. benchmark) is nondegenerate. Then, for the modified Thompson Sampling with Inventory Algorithm,

𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑡𝑡 𝑇𝑇 ≤ 𝑂𝑂 𝑇𝑇 log𝑇𝑇 log log𝑇𝑇 = �𝑂𝑂 𝑇𝑇

Page 30: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

My Research in Bandit Learning

• Identifying the top-K arms (arms with the largest means)

• Yuan Zhou, Xi Chen, and Jian Li. Optimal PAC Multiple Arm Identification with Applications to Crowdsourcing. In Proceedings of International Conference on Machine Learning (ICML), 2014

• Jiecao Chen, Xi Chen, Qin Zhang, and Yuan Zhou. Adaptive Multiple-Arm Identification. In Proceedings of International Conference on Machine Learning (ICML), 2017.

• Xi Chen, Yuanzhi Li, Jieming Mao. An Instance Optimal Algorithm for Top-k Ranking under the Multinomial Logit Model. In Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018.

• Bandit for dynamic pricing (Alibaba supermarket)• Sentao Miao, Xi Chen, Xiuli Chao, Jiaxi Liu, and Yidong Zhang. Context-

Based Dynamic Pricing with Online Clustering, 201929/41

Page 31: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

My Research in Bandit Learning

• Bandit for assortment optimization (online product recommendation)

• Yining Wang, Xi Chen, and Yuan Zhou. Near-Optimal Policies for Dynamic Multinomial Logit Assortment Selection Models. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2018.

• Xi Chen and Yining Wang. A Note on Tight Lower Bound for MNL-Bandit Assortment Selection Models. Operations Reserach Letters, 2018.

• Xi Chen, Yining Wang, and Yuan Zhou. Dynamic Assortment Optimization with Changing Contextual Information, 2019

• Xi Chen, Yining Wang, and Yuan Zhou. Dynamic Assortment Selection under the Nested Logit Models, 2019

• Papers can be found at my homepage http://people.stern.nyu.edu/xchen3/publication.html 30/41

Page 32: Introduction of Multi-arm Bandit Algorithmspeople.stern.nyu.edu/.../Lecture_2_Intro_Bandit.pdf · Symposium on Discrete Algorithms (SODA), 2018 . • Bandit for dynamic pricing (Alibaba

My Research in Reinforcement Learning• Foundations:

• Yasin Abbasi-Yadkori, Peter L. Bartlett, Xi Chen, Alan Malek. Large-Scale Markov Decision Problems via the Linear Programming Dual (https://arxiv.org/abs/1901.01992)

• High-frequency trading: • Qihang Lin, Xi Chen, and Javier Peña. A Trade Execution Model under a

Composite Dynamic Coherent Risk Measure. Operations Research Letters, 2015

• Crowdsourcing:• Xi Chen, Qihang Lin, and Dengyong Zhou. Statistical Decision Making for

Optimal Budget Allocation in Crowd Labeling. Journal of Machine Learning Research, 2015

• Xi Chen, Kevin Jiao, and Qihang Lin. Bayesian Decision Process for Cost-Efficient Dynamic Ranking via Crowdsourcing. Journal of Machine Learning Research, 2016

31/41