Learning to Price Airline Seats Under Competition 7th Annual INFORMS Revenue Management and Pricing...

Preview:

Citation preview

Learning to Price Airline Seats Under Competition

7th Annual INFORMS Revenue Management and Pricing Conference

Barcelona, SpainThursday, 28th June 2007

Presenter: Andrew Collins a.j.collins@soton.ac.uk

Supervisor: Prof Lyn Thomas

a.j.collins@soton.ac.uk 2

Overview

• Motivation

• Reinforcement Learning

• Methodology

• Model

• Results

• Conclusions

a.j.collins@soton.ac.uk 3

Motivation

Game Theory project manager:• 2001- 2004• Defence Science and Technology Laboratories (Dstl), UK

Research at University of Southampton:

“To demonstrate Game Theory as a practical analytical modelling technique for usage within the OR community”

Frustration with Game Theory:• Difficulty with deriving feasible solutions

• Difficulty in validating results (due to simplifications)

• Dependency on input variables

• Speed and Memory issues of running models

Applications of Game Theory in Defence Project- A. Collins, F. Pullum, L. Kenyon (2003) [Dstl - Unclassified]

a.j.collins@soton.ac.uk 4

Learning in Games• Brown’s Fictitious play (1951)• Fudenberg and Levine (1998)

Evolutionary• Weibull (1995)• Replicator Dynamics

Neural Networks• Just a statistical process in the limit

– Neal (1996)

Reinforcement Learning• Association with Psychology

– Palvov (1927)– Rescorla and Wagner (1972)

• Convergence– Collins and Leslie (2005)

Theory of Learning in Games- Drew Fudenberg and David Levine (1998)

Reinforcement Learning

Introduction

CB

A

a.j.collins@soton.ac.uk 6

Reinforcement Learning (RL)A.K.A. ‘Neuro-Dynamic Programming’ or ‘Approximate

Dynamic Programming’.

Agents/players reinforce their world-view from interaction with the environment.

Agent

Environment

Action

“a”

Reward

“r”State

“s”

Neuro-Dynamic Programming- Dimitri Bertsekas and John Tsitsiklis (1996)

CB

A

a.j.collins@soton.ac.uk 7

Types

Type Update ‘U’

Monte Carlo R(next s)MC

Q-Learning maxaQ(a, next s)QL

SARSA Q(next a, next s)SA

• Players store information about each state-action pair (called the Q-value)• They use this information to select an action when at that state• They updated this information depending on a rule:

‘U’ depends on the RL type used. It usually involves the: • Observed return ‘R’, post current state• Current Q-value estimates of proceeding states (called bootstrapping)

Reinforcement Learning- Richard Sutton and Andrew Barto (1998)

Q(a, s) = (1 - ).Q(a, s) + .(reward(a, s) + U(next s))

a.j.collins@soton.ac.uk 8

IssuesHow do players select their actions?• Exploration vs. exploitation• Boltzmann Action Selection (a.k.a. Softmax)

– Similar to Logit Models

Stochastic Uncoupled Dynamics and Nash Equilibrium- Sergiu Hart and Andreu Mas-Colell (2006)

Actionsb

bQ

actionQ

e

eactionP

/)(

/)(

)(

Leads to Nash distribution:– Unique Nash Equilibrium as 0

Uncoupled Games• Hart and Mas-Colell (2003, 2006)

Methodology

Introduction

a.j.collins@soton.ac.uk 10

Methodology

Construct a simple AIRLINE pricing model– Dynamic Pricing of Airline Tickets with Competition

• Currie, Cheng and Smith (2005)

– Reinforcement Learning Approach to Airline Seat Allocation• Gosavi, Bandla and Das (2002)

Run various Reinforcement Learning (RL) models– Compare to ‘optimal’ solutions– Prove RL converges using Stochastic Approximation

Analyse the optimal solution for the model– Find optimal solution using Dynamic Programming – Deduce generalisation from these results

Tools for Thinking: Modelling in Management Science- Mike Pidd (1996)

a.j.collins@soton.ac.uk 11

Episode Generation

Players learn about

Environment

Policy updated

Repeat

Reinforcement Learning

Start

Backward Induction

Dynamic Programming

Optimal Policy

Airline Pricing Model: Flow Diagram

Compare

Airline Pricing Model

Introduction to an

Airline Pricing Model

a.j.collins@soton.ac.uk 13

Airline Pricing Model

The game consists of two competing airline firms. The firms are ‘P1’ and ‘P2’.

• Each firm is selling seats for a single leg flight

• Both flights are identical• Firms attract customers with their prices

A separate model is used for customer demand.

The Theory and Practice of Revenue Management- K. Talluri and G. van Ryzin (2004)

a.j.collins@soton.ac.uk 14

EndRound 1

Simple Airline Pricing Model

P1Price

Change

FlightsLeave

P1 sets Price

P2 sets Price

EndRound 2

Customer- Lowest Price

P2Price

Change

Airline Pricing Model

Solution Example to Simple Airline Model

a.j.collins@soton.ac.uk 16

Solution Example

1 2 1 2

P1 9

P2

R1

R2

1+1=?

8

9

8

8

0

8

a.j.collins@soton.ac.uk 17

Solution Example

1 2 1 2

P1

P2 10

R1

R2

1+1=?

Player ‘1’ can now attempt to attract one or both of

the remaining customers. However, player ‘2’ still

has a chance to undercut to gain the last customer.

a.j.collins@soton.ac.uk 18

Solution Example

1 2 1 2

P1

P2 10

R1

R2

1+1=?

9 9

10

9

88

9

8

9

8

a.j.collins@soton.ac.uk 19

Solution Example

1 2 1 2

P1 5 5 9 9 9

P2 10 10 10 8 8

R1 5 9 14R2 8 8

1+1=?

Comparison

Using metrics to compare policies

a.j.collins@soton.ac.uk 21

ComparisonsOnce we have learned a policy, how do you compare

policies?• Q-values or action probabilities

– Difficultly in weighting states

What I really care about is return, so:• Compare return from each path

– Curse of dimensionality

• Produce the Return Probability Distribution (RPD) of the different policies played against some standard policies:– Nash distribution, Nash equilibrium, myopic play, random play, etc.– Would need to compare ALL possibilities to be sure of

convergence

a.j.collins@soton.ac.uk 22

Nash Equilibrium: Derived from play of (5, 10, 9,8).

The BLUE bar are for P2 and RED for P1.

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30

Payoff

Pro

bab

ilit

y

P1

P2

a.j.collins@soton.ac.uk 23

Nash Distribution : = 0.0020.

Difference so small that you not notice them here.

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30

Payoff

Pro

bab

ilit

y

a.j.collins@soton.ac.uk 24

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30

Payoff

Pro

bab

ilit

y

Nash Distribution : = 0.0050.

We can now see a slight change in the distribution

a.j.collins@soton.ac.uk 25

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30

Payoff

Pro

bab

ilit

y

Nash Distribution : = 0.0100.

Notice there is more variation for P2 than P1.

a.j.collins@soton.ac.uk 26

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30

Payoff

Pro

bab

ilit

y

Nash Distribution : = 0.0200.

Notice that P1 is observing some very bad results.

a.j.collins@soton.ac.uk 27

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30

Payoff

Pro

bab

ilit

y

Nash Distribution : = 0.2000.

Almost get random play (see next slide).

a.j.collins@soton.ac.uk 28

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30

Payoff

Pro

bab

ilit

y

Random Play: Notice that the expected rewards are even as it does not matter the order of players.

a.j.collins@soton.ac.uk 29

Metrics

If a policy is very similar to a another policy, we would expect to see similar RPD from both policies, when played against the standard policies.

How do we compare RPD?• L1-metric meaningless….• Hellinger, Kolmogorov-Smirov, Gini, Information

value, Separation, Total Variation, Mean, Chi-squared…

On Choosing and Bounding Probability Metrics-Alison Gibbs and Francis Su (2002)

a.j.collins@soton.ac.uk 30

Example Metric Results

Metric comparison of the RPDs of:

1) Nash Equilibrium policy vs. SARSA learnt policy

2) Nash Equilibrium policy vs. Nash Equilibrium policy

Greedy action selection used for calculating RPD.

The x-axis is a log-scale of episodes.

10M episodes run in total.

1+1=?

IV

KS

KS_P1

KS_P2

ROC

ROC_P1

ROC_P2

SD1

SD2

CHI1

CHI2 TV

E1 E2 H

Reinforcement Learning Model

Results

CB

A

a.j.collins@soton.ac.uk 32

Tau variation

0

0.2

0.4

0.6

0.8

1

0 0.005 0.01 0.015 0.02 0.025 0.03Exploration

Ko

lmo

go

rov

-Sm

irn

ov QL

MC

SARSA

Results compare learning policy’s RPD to corresponding Nash Distribution policy’s RPD.

MC seems to improve as exploration increases. Why not increase exploration?

a.j.collins@soton.ac.uk 33

Other Issues1) Stability• Excess exploration implies instability• Higher dependency on the most recent observation implies instability

2) Computing• Batch runs: 100 x 10M episodes• 2.2Ghz 4Gb RAM• Time considerations

– 23 hrs• Memory requirements

– 300MB

3) Curse of Dimensionality• Wish to increase number of rounds

4) Customer Behaviour• Wish to change customer behaviour (i.e. multiple customers, Logit models)

Simulation-Based Optimization-Abhijit Gosavi (2003)

a.j.collins@soton.ac.uk 34

Conclusions

A simple airline pricing model can lead to some interesting results. Understanding the meaning of these results might give insight into real-world pricing policy.

By trying to use the RL algorithm to solve this model, interesting behaviour is observed. – Curse of Dimensionality– Stability

SARSA RL method out performs other methods for certain exploration levels.

Questions?Andrew Collins

a.j.collins@soton.ac.uk

Recommended