45
Outline for today The Bandit Problem Gaussian Process Bandits Advanced Topics in Machine Learning: Part I John Shawe-Taylor and Steffen Grünewalder UCL Second semester 2010 John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Advanced Topics in Machine Learning: Part I

John Shawe-Taylor and Steffen GrünewalderUCL

Second semester2010

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 2: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

1 The Bandit ProblemDefinition of the Bandit ProblemLINREL

2 Gaussian Process BanditsGaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 3: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Definition of the Bandit ProblemLINREL

Multi-armed Bandit (MAB) problem

Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:

Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 4: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Definition of the Bandit ProblemLINREL

Multi-armed Bandit (MAB) problem

Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:

Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 5: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Definition of the Bandit ProblemLINREL

Multi-armed Bandit (MAB) problem

Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:

Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 6: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Definition of the Bandit ProblemLINREL

Multi-armed Bandit (MAB) problem

Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:

Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 7: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Definition of the Bandit ProblemLINREL

Multi-armed Bandit (MAB) problem

Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:

Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 8: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Definition of the Bandit ProblemLINREL

Multi-armed Bandit (MAB) problem

Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:

Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 9: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Definition of the Bandit ProblemLINREL

Multi-armed Bandit (MAB) problem

Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:

Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 10: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Definition of the Bandit ProblemLINREL

Multi-armed Bandit (MAB) problem

Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:

Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 11: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Definition of the Bandit ProblemLINREL

LINREL

At stage t solve(Z (t)′Z (t) + λI

)αi = Z (t)′xt

i

where Z (t) is a matrix formed with columns the featurevectors of the selected armsDefine

widthi(t) = ‖αi‖√

ln(2TK/δ)

ucbi(t) =⟨

rt , αi⟩+ widthi(t)

Play the arm with largest ucbi(t)

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 12: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Definition of the Bandit ProblemLINREL

LINREL

At stage t solve(Z (t)′Z (t) + λI

)αi = Z (t)′xt

i

where Z (t) is a matrix formed with columns the featurevectors of the selected armsDefine

widthi(t) = ‖αi‖√

ln(2TK/δ)

ucbi(t) =⟨

rt , αi⟩+ widthi(t)

Play the arm with largest ucbi(t)

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 13: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Definition of the Bandit ProblemLINREL

LINREL

At stage t solve(Z (t)′Z (t) + λI

)αi = Z (t)′xt

i

where Z (t) is a matrix formed with columns the featurevectors of the selected armsDefine

widthi(t) = ‖αi‖√

ln(2TK/δ)

ucbi(t) =⟨

rt , αi⟩+ widthi(t)

Play the arm with largest ucbi(t)

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 14: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Definition of the Bandit ProblemLINREL

Kernel LinRel

Of course we can use the kernel trick indeed if we take theformula (

Z (t)′Z (t) + λI)αi = Z (t)′xt

i

for LinRel this can be given in a kernel defined featurespace as

(K (t) + λI)αi = kti

where K (t) is the kernel matrix of the feature vectors of theselected arms and kt

i is the vector of kernel evaluationsbetween the feature vector for the i th arm at time t andeach of the selected arms up to time t

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 15: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Definition of the Bandit ProblemLINREL

Drawbacks of LINREL

There does appear to be an explicit modelling of mean andvarianceNot immediately clear how to optimise over infinitely manyarmsNot clear the role of the kernel in kernel LinRelWould like to build in explicit prior over options – cf theBeta prior for the simple MAB, which could encode likelyresponse to click through

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 16: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Definition of the Bandit ProblemLINREL

Drawbacks of LINREL

There does appear to be an explicit modelling of mean andvarianceNot immediately clear how to optimise over infinitely manyarmsNot clear the role of the kernel in kernel LinRelWould like to build in explicit prior over options – cf theBeta prior for the simple MAB, which could encode likelyresponse to click through

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 17: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Definition of the Bandit ProblemLINREL

Drawbacks of LINREL

There does appear to be an explicit modelling of mean andvarianceNot immediately clear how to optimise over infinitely manyarmsNot clear the role of the kernel in kernel LinRelWould like to build in explicit prior over options – cf theBeta prior for the simple MAB, which could encode likelyresponse to click through

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 18: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Definition of the Bandit ProblemLINREL

Drawbacks of LINREL

There does appear to be an explicit modelling of mean andvarianceNot immediately clear how to optimise over infinitely manyarmsNot clear the role of the kernel in kernel LinRelWould like to build in explicit prior over options – cf theBeta prior for the simple MAB, which could encode likelyresponse to click through

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 19: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Gaussian Posterior Distribution

Recall (from supervised learning) that for GaussianProcess regression we could compute the posteriordistribution exactly and hence output mean and variancefor test points:

wmean = X′(K + σ2I)−1y, so that m(x) = k(K + σ2I)−1y

where K is the covariance (kernel) matrix Kij = κ(xi ,xj), kis the vector of kernel evaluations with the training dataki = κ(x,xi), y is the vector of output values and σ is thenoise level in the observations

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 20: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Gaussian Posterior Distribution

Similarly for the variance:

σ2y = E[(y −w′meanφ(x))

2] = κ(x,x)− k′(K + σ2I)−1k.

Recall that the posterior mean and covariance arecomputed from the prior and the evidence given by thetraining dataThe probabilities involved rely only on the noise model andnot the i.i.d. assumptionHence, can apply the same analysis even if we have astrategy for choosing the inputs

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 21: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Gaussian Posterior Distribution

Similarly for the variance:

σ2y = E[(y −w′meanφ(x))

2] = κ(x,x)− k′(K + σ2I)−1k.

Recall that the posterior mean and covariance arecomputed from the prior and the evidence given by thetraining dataThe probabilities involved rely only on the noise model andnot the i.i.d. assumptionHence, can apply the same analysis even if we have astrategy for choosing the inputs

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 22: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Gaussian Posterior Distribution

Similarly for the variance:

σ2y = E[(y −w′meanφ(x))

2] = κ(x,x)− k′(K + σ2I)−1k.

Recall that the posterior mean and covariance arecomputed from the prior and the evidence given by thetraining dataThe probabilities involved rely only on the noise model andnot the i.i.d. assumptionHence, can apply the same analysis even if we have astrategy for choosing the inputs

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 23: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Gaussian Posterior Distribution

Similarly for the variance:

σ2y = E[(y −w′meanφ(x))

2] = κ(x,x)− k′(K + σ2I)−1k.

Recall that the posterior mean and covariance arecomputed from the prior and the evidence given by thetraining dataThe probabilities involved rely only on the noise model andnot the i.i.d. assumptionHence, can apply the same analysis even if we have astrategy for choosing the inputs

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 24: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Gaussian Process Bandits

Aim is to model the response rates of the arms as aGaussian Process, i.e. define a prior distribution over likelyreward functions through the use of a kernel (covariance)functionThis will apply to the case where we can model the noisein the observations as Gaussian with a known variance σ,that is the reward distribution is given by

p(r |x,w) =1

σ√

2πexp

(−(〈w, φ(x)〉 − r)2

2σ2

)

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 25: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Gaussian Process Bandits

Aim is to model the response rates of the arms as aGaussian Process, i.e. define a prior distribution over likelyreward functions through the use of a kernel (covariance)functionThis will apply to the case where we can model the noisein the observations as Gaussian with a known variance σ,that is the reward distribution is given by

p(r |x,w) =1

σ√

2πexp

(−(〈w, φ(x)〉 − r)2

2σ2

)

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 26: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Gaussian Process Bandits

Using the mean and variance we wish to select the armthat maximises

argmaxx∈Qk(K + σ2I)−1y + B(t)√κ(x,x)− k′(K + σ2I)−1k

where B(t) is again the slowly increasing function of thenumber t of pulls, e.g.

√log t

The set of arms Q could be a finite set given to the user(as in LinRel) or some region of the input spaceFor finite set of arms this gives a UCB style algorithm forGP Bandits

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 27: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Gaussian Process Bandits

Using the mean and variance we wish to select the armthat maximises

argmaxx∈Qk(K + σ2I)−1y + B(t)√κ(x,x)− k′(K + σ2I)−1k

where B(t) is again the slowly increasing function of thenumber t of pulls, e.g.

√log t

The set of arms Q could be a finite set given to the user(as in LinRel) or some region of the input spaceFor finite set of arms this gives a UCB style algorithm forGP Bandits

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 28: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Gaussian Process Bandits

Using the mean and variance we wish to select the armthat maximises

argmaxx∈Qk(K + σ2I)−1y + B(t)√κ(x,x)− k′(K + σ2I)−1k

where B(t) is again the slowly increasing function of thenumber t of pulls, e.g.

√log t

The set of arms Q could be a finite set given to the user(as in LinRel) or some region of the input spaceFor finite set of arms this gives a UCB style algorithm forGP Bandits

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 29: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Randomised strategy

As with the Beta distribution strategy for MABs we can alsoplay a randomised strategy: sample from the posteriordistribution and choose the maximumAs with the Beta strategy this corresponds to selecting anarm with probability equal to its probability of being thebest arm in the posterior distributionWe can sample iteratively by computing the mean andvariance for a test arm m(x) and σ(x), sampling accordingto

y ∼ N(

m(x), σ2(x))

and then adding (x, y) as an observation with no noise

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 30: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Randomised strategy

As with the Beta distribution strategy for MABs we can alsoplay a randomised strategy: sample from the posteriordistribution and choose the maximumAs with the Beta strategy this corresponds to selecting anarm with probability equal to its probability of being thebest arm in the posterior distributionWe can sample iteratively by computing the mean andvariance for a test arm m(x) and σ(x), sampling accordingto

y ∼ N(

m(x), σ2(x))

and then adding (x, y) as an observation with no noise

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 31: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Randomised strategy

As with the Beta distribution strategy for MABs we can alsoplay a randomised strategy: sample from the posteriordistribution and choose the maximumAs with the Beta strategy this corresponds to selecting anarm with probability equal to its probability of being thebest arm in the posterior distributionWe can sample iteratively by computing the mean andvariance for a test arm m(x) and σ(x), sampling accordingto

y ∼ N(

m(x), σ2(x))

and then adding (x, y) as an observation with no noise

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 32: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Randomised strategy

There are iterative algorithms for updating the inverse of amatrix when a new row and column are added known asthe Sherman-Morrison formula:

(A + uv′)−1 = A−1 − A−1uv′A−1

1 + v′A−1u.

To add row m take A the previous matrix augmented withthe diagonal element and then v = em with u the additionalcolumn, then repeat reversing the roles to add theadditional rowFor a finite set of arms we can simply choose the arm withlargest output after running through the optionsCan be implemented efficiently over trees givingcompetitive results when compared to UCT

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 33: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Randomised strategy

There are iterative algorithms for updating the inverse of amatrix when a new row and column are added known asthe Sherman-Morrison formula:

(A + uv′)−1 = A−1 − A−1uv′A−1

1 + v′A−1u.

To add row m take A the previous matrix augmented withthe diagonal element and then v = em with u the additionalcolumn, then repeat reversing the roles to add theadditional rowFor a finite set of arms we can simply choose the arm withlargest output after running through the optionsCan be implemented efficiently over trees givingcompetitive results when compared to UCT

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 34: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Randomised strategy

There are iterative algorithms for updating the inverse of amatrix when a new row and column are added known asthe Sherman-Morrison formula:

(A + uv′)−1 = A−1 − A−1uv′A−1

1 + v′A−1u.

To add row m take A the previous matrix augmented withthe diagonal element and then v = em with u the additionalcolumn, then repeat reversing the roles to add theadditional rowFor a finite set of arms we can simply choose the arm withlargest output after running through the optionsCan be implemented efficiently over trees givingcompetitive results when compared to UCT

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 35: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Randomised strategy

There are iterative algorithms for updating the inverse of amatrix when a new row and column are added known asthe Sherman-Morrison formula:

(A + uv′)−1 = A−1 − A−1uv′A−1

1 + v′A−1u.

To add row m take A the previous matrix augmented withthe diagonal element and then v = em with u the additionalcolumn, then repeat reversing the roles to add theadditional rowFor a finite set of arms we can simply choose the arm withlargest output after running through the optionsCan be implemented efficiently over trees givingcompetitive results when compared to UCT

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 36: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Infinitely many arms

For infinitely many arms we need to maximise over acontinuous space:

argmaxx∈Qk(K + σ2I)−1y + B(t)√κ(x,x)− k′(K + σ2I)−1k

This can be difficult as we expect the function to becomequite smooth over time since we are always picking armsthat maximise the function and this will tend to reduce thevalue of the function in that region

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 37: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Infinitely many arms

For infinitely many arms we need to maximise over acontinuous space:

argmaxx∈Qk(K + σ2I)−1y + B(t)√κ(x,x)− k′(K + σ2I)−1k

This can be difficult as we expect the function to becomequite smooth over time since we are always picking armsthat maximise the function and this will tend to reduce thevalue of the function in that region

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 38: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Infinitely many arms

An alternative is to take the randomised strategy graduallysampling new points obtained by approximately optimisingthis expression, where we update with the new informationfrom each sample (again with no noise) – again play thesample that has highest rewardThis typically leads to a relatively small number of pointsbeing sampledRegret bounds involving the regret to the maximum rewardobserved have been derived

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 39: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Infinitely many arms

An alternative is to take the randomised strategy graduallysampling new points obtained by approximately optimisingthis expression, where we update with the new informationfrom each sample (again with no noise) – again play thesample that has highest rewardThis typically leads to a relatively small number of pointsbeing sampledRegret bounds involving the regret to the maximum rewardobserved have been derived

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 40: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Infinitely many arms

An alternative is to take the randomised strategy graduallysampling new points obtained by approximately optimisingthis expression, where we update with the new informationfrom each sample (again with no noise) – again play thesample that has highest rewardThis typically leads to a relatively small number of pointsbeing sampledRegret bounds involving the regret to the maximum rewardobserved have been derived

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 41: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Conclusions

Putting a Gaussian process prior of the reward functionleads to a natural Bayesian update for the mean andvariance when the reward noise model is GaussianLeads to natural way of encoding prior knowledge andquite efficient implementation for finite sets of armsEfficiently implemented for kernels over trees – c.f. UCTFor infinitely many arms a randomised strategy seems tobe the most promisingSome recent results for bounding the regret

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 42: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Conclusions

Putting a Gaussian process prior of the reward functionleads to a natural Bayesian update for the mean andvariance when the reward noise model is GaussianLeads to natural way of encoding prior knowledge andquite efficient implementation for finite sets of armsEfficiently implemented for kernels over trees – c.f. UCTFor infinitely many arms a randomised strategy seems tobe the most promisingSome recent results for bounding the regret

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 43: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Conclusions

Putting a Gaussian process prior of the reward functionleads to a natural Bayesian update for the mean andvariance when the reward noise model is GaussianLeads to natural way of encoding prior knowledge andquite efficient implementation for finite sets of armsEfficiently implemented for kernels over trees – c.f. UCTFor infinitely many arms a randomised strategy seems tobe the most promisingSome recent results for bounding the regret

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 44: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Conclusions

Putting a Gaussian process prior of the reward functionleads to a natural Bayesian update for the mean andvariance when the reward noise model is GaussianLeads to natural way of encoding prior knowledge andquite efficient implementation for finite sets of armsEfficiently implemented for kernels over trees – c.f. UCTFor infinitely many arms a randomised strategy seems tobe the most promisingSome recent results for bounding the regret

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I

Page 45: Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Definition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Conclusions

Putting a Gaussian process prior of the reward functionleads to a natural Bayesian update for the mean andvariance when the reward noise model is GaussianLeads to natural way of encoding prior knowledge andquite efficient implementation for finite sets of armsEfficiently implemented for kernels over trees – c.f. UCTFor infinitely many arms a randomised strategy seems tobe the most promisingSome recent results for bounding the regret

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I