Learning to Price Airline Seats Under Competition 7th Annual INFORMS Revenue Management and Pricing...

Learning to Price Airline Seats Under Competition

7th Annual INFORMS Revenue Management and Pricing Conference

Barcelona, SpainThursday, 28th June 2007

Presenter: Andrew Collins a.j.collins@soton.ac.uk

Supervisor: Prof Lyn Thomas

a.j.collins@soton.ac.uk 2

Overview

• Motivation

• Reinforcement Learning

• Methodology

• Model

• Results

• Conclusions

Motivation

Game Theory project manager:• 2001- 2004• Defence Science and Technology Laboratories (Dstl), UK

Research at University of Southampton:

“To demonstrate Game Theory as a practical analytical modelling technique for usage within the OR community”

Frustration with Game Theory:• Difficulty with deriving feasible solutions

• Difficulty in validating results (due to simplifications)

• Dependency on input variables

• Speed and Memory issues of running models

Applications of Game Theory in Defence Project- A. Collins, F. Pullum, L. Kenyon (2003) [Dstl - Unclassified]

Learning in Games• Brown’s Fictitious play (1951)• Fudenberg and Levine (1998)

Evolutionary• Weibull (1995)• Replicator Dynamics

Neural Networks• Just a statistical process in the limit

– Neal (1996)

Reinforcement Learning• Association with Psychology

– Palvov (1927)– Rescorla and Wagner (1972)

• Convergence– Collins and Leslie (2005)

Theory of Learning in Games- Drew Fudenberg and David Levine (1998)

Reinforcement Learning

Introduction

Reinforcement Learning (RL)A.K.A. ‘Neuro-Dynamic Programming’ or ‘Approximate

Dynamic Programming’.

Agents/players reinforce their world-view from interaction with the environment.

Environment

Action

“a”

Reward

“r”State

“s”

Neuro-Dynamic Programming- Dimitri Bertsekas and John Tsitsiklis (1996)

Type Update ‘U’

Monte Carlo R(next s)MC

Q-Learning maxaQ(a, next s)QL

SARSA Q(next a, next s)SA

• Players store information about each state-action pair (called the Q-value)• They use this information to select an action when at that state• They updated this information depending on a rule:

‘U’ depends on the RL type used. It usually involves the: • Observed return ‘R’, post current state• Current Q-value estimates of proceeding states (called bootstrapping)

Reinforcement Learning- Richard Sutton and Andrew Barto (1998)

Q(a, s) = (1 - ).Q(a, s) + .(reward(a, s) + U(next s))

IssuesHow do players select their actions?• Exploration vs. exploitation• Boltzmann Action Selection (a.k.a. Softmax)

– Similar to Logit Models

Stochastic Uncoupled Dynamics and Nash Equilibrium- Sergiu Hart and Andreu Mas-Colell (2006)

Actionsb

actionQ

eactionP

Leads to Nash distribution:– Unique Nash Equilibrium as 0

Uncoupled Games• Hart and Mas-Colell (2003, 2006)

Methodology

Introduction

Methodology

Construct a simple AIRLINE pricing model– Dynamic Pricing of Airline Tickets with Competition

• Currie, Cheng and Smith (2005)

– Reinforcement Learning Approach to Airline Seat Allocation• Gosavi, Bandla and Das (2002)

Run various Reinforcement Learning (RL) models– Compare to ‘optimal’ solutions– Prove RL converges using Stochastic Approximation

Analyse the optimal solution for the model– Find optimal solution using Dynamic Programming – Deduce generalisation from these results

Tools for Thinking: Modelling in Management Science- Mike Pidd (1996)

Episode Generation

Players learn about

Environment

Policy updated

Repeat

Reinforcement Learning

Backward Induction

Dynamic Programming

Optimal Policy

Airline Pricing Model: Flow Diagram

Compare

Airline Pricing Model

Introduction to an

The game consists of two competing airline firms. The firms are ‘P1’ and ‘P2’.

• Each firm is selling seats for a single leg flight

• Both flights are identical• Firms attract customers with their prices

A separate model is used for customer demand.

The Theory and Practice of Revenue Management- K. Talluri and G. van Ryzin (2004)

EndRound 1

Simple Airline Pricing Model

P1Price

Change

FlightsLeave

P1 sets Price

P2 sets Price

EndRound 2

Customer- Lowest Price

P2Price

Change

Solution Example to Simple Airline Model

Solution Example

1 2 1 2

Solution Example

1 2 1 2

Player ‘1’ can now attempt to attract one or both of

the remaining customers. However, player ‘2’ still

has a chance to undercut to gain the last customer.

Solution Example

1 2 1 2

Solution Example

1 2 1 2

P1 5 5 9 9 9

P2 10 10 10 8 8

R1 5 9 14R2 8 8

Comparison

Using metrics to compare policies

ComparisonsOnce we have learned a policy, how do you compare

policies?• Q-values or action probabilities

– Difficultly in weighting states

What I really care about is return, so:• Compare return from each path

– Curse of dimensionality

• Produce the Return Probability Distribution (RPD) of the different policies played against some standard policies:– Nash distribution, Nash equilibrium, myopic play, random play, etc.– Would need to compare ALL possibilities to be sure of

convergence

Nash Equilibrium: Derived from play of (5, 10, 9,8).

The BLUE bar are for P2 and RED for P1.

0 5 10 15 20 25 30

Payoff

Nash Distribution : = 0.0020.

Difference so small that you not notice them here.

0 5 10 15 20 25 30

Payoff

0 5 10 15 20 25 30

Payoff

We can now see a slight change in the distribution

0 5 10 15 20 25 30

Payoff

Notice there is more variation for P2 than P1.

0 5 10 15 20 25 30

Payoff

Notice that P1 is observing some very bad results.

0 5 10 15 20 25 30

Payoff

Almost get random play (see next slide).

0 5 10 15 20 25 30

Payoff

Random Play: Notice that the expected rewards are even as it does not matter the order of players.

Metrics

If a policy is very similar to a another policy, we would expect to see similar RPD from both policies, when played against the standard policies.

How do we compare RPD?• L1-metric meaningless….• Hellinger, Kolmogorov-Smirov, Gini, Information

value, Separation, Total Variation, Mean, Chi-squared…

On Choosing and Bounding Probability Metrics-Alison Gibbs and Francis Su (2002)

Example Metric Results

Metric comparison of the RPDs of:

1) Nash Equilibrium policy vs. SARSA learnt policy

2) Nash Equilibrium policy vs. Nash Equilibrium policy

Greedy action selection used for calculating RPD.

The x-axis is a log-scale of episodes.

10M episodes run in total.

ROC_P1

ROC_P2

CHI2 TV

E1 E2 H

Reinforcement Learning Model

Results

Tau variation

0 0.005 0.01 0.015 0.02 0.025 0.03Exploration

Results compare learning policy’s RPD to corresponding Nash Distribution policy’s RPD.

MC seems to improve as exploration increases. Why not increase exploration?

Other Issues1) Stability• Excess exploration implies instability• Higher dependency on the most recent observation implies instability

2) Computing• Batch runs: 100 x 10M episodes• 2.2Ghz 4Gb RAM• Time considerations

– 23 hrs• Memory requirements

– 300MB

3) Curse of Dimensionality• Wish to increase number of rounds

4) Customer Behaviour• Wish to change customer behaviour (i.e. multiple customers, Logit models)

Simulation-Based Optimization-Abhijit Gosavi (2003)

Conclusions

A simple airline pricing model can lead to some interesting results. Understanding the meaning of these results might give insight into real-world pricing policy.

By trying to use the RL algorithm to solve this model, interesting behaviour is observed. – Curse of Dimensionality– Stability

SARSA RL method out performs other methods for certain exploration levels.

Questions?Andrew Collins

a.j.collins@soton.ac.uk

Learning to Price Airline Seats Under Competition 7th Annual INFORMS Revenue Management and Pricing...

Documents

INFORMS - Boca

Sell Off All Your Airline Empty Seats And Operate At 100% Capacity, Every Day

INFORMS 2015

INFORMS Newsletter - Career Account Web Pagesweb.ics.purdue.edu/~informs/wp-content/uploads/2014/09/...Ashutosh Nayak INFORMS Newsletter INFORMS Executive Board: from Statistics and

Informs SIS J2K Presentation Informs Proprietary Material Informs Clients SIS J2K Building The Future MOSIS 2000 Conference SIS J2K - A Java-based Student

“How Theory Informs Application and How Application Informs Theory”

Net Gen Informs

Never Pay Retail For Airline Seats...2014/05/04 · true for domestic flights as well as international flights, for all classes of airline seats – coach, business and first class

INFORMS PUBLICATIONS

Style Guide - INFORMS

INFORMS 2004

Dawn informs 20150903

September 23, 2010 - All inclusive vacation packages ...€¦ · holiday travel tour operator 3 consumers wide range of packaged products, airline seats, fit travel transat outgoing

Airline etiquettes © 2015 albert-learning.com AIRLINE ETIQUETTES EVERY EMPLOYEE OF AN AIRLINE IS CONSIDERED AS THE FACE OF AN AIRLINE Airline etiquettes

vain - informs-sim.org

INFORMS Presentation

Demand and Supply - Ituaviation.itu.edu.tr/img/aviation/datafiles/Lecture Notes...Realizing the vision together Airline supply • Perishability of seats • seats cannot be inventoried

INFORMS Keynote Capone

Antitrust Immunity and International Airline Alliances Immunity and International Airline Alliances by ... Alliance agreements, whereby an airline may market seats on some of its partner’

ited.12.2.78 - Informs