Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Advanced Topics in Machine Learning: Part I
John Shawe-Taylor and Steffen GrünewalderUCL
Second semester2010
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
1 The Bandit ProblemDefinition of the Bandit ProblemLINREL
2 Gaussian Process BanditsGaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Definition of the Bandit ProblemLINREL
Multi-armed Bandit (MAB) problem
Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:
Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Definition of the Bandit ProblemLINREL
Multi-armed Bandit (MAB) problem
Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:
Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Definition of the Bandit ProblemLINREL
Multi-armed Bandit (MAB) problem
Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:
Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Definition of the Bandit ProblemLINREL
Multi-armed Bandit (MAB) problem
Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:
Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Definition of the Bandit ProblemLINREL
Multi-armed Bandit (MAB) problem
Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:
Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Definition of the Bandit ProblemLINREL
Multi-armed Bandit (MAB) problem
Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:
Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Definition of the Bandit ProblemLINREL
Multi-armed Bandit (MAB) problem
Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:
Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Definition of the Bandit ProblemLINREL
Multi-armed Bandit (MAB) problem
Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:
Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Definition of the Bandit ProblemLINREL
LINREL
At stage t solve(Z (t)′Z (t) + λI
)αi = Z (t)′xt
i
where Z (t) is a matrix formed with columns the featurevectors of the selected armsDefine
widthi(t) = ‖αi‖√
ln(2TK/δ)
ucbi(t) =⟨
rt , αi⟩+ widthi(t)
Play the arm with largest ucbi(t)
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Definition of the Bandit ProblemLINREL
LINREL
At stage t solve(Z (t)′Z (t) + λI
)αi = Z (t)′xt
i
where Z (t) is a matrix formed with columns the featurevectors of the selected armsDefine
widthi(t) = ‖αi‖√
ln(2TK/δ)
ucbi(t) =⟨
rt , αi⟩+ widthi(t)
Play the arm with largest ucbi(t)
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Definition of the Bandit ProblemLINREL
LINREL
At stage t solve(Z (t)′Z (t) + λI
)αi = Z (t)′xt
i
where Z (t) is a matrix formed with columns the featurevectors of the selected armsDefine
widthi(t) = ‖αi‖√
ln(2TK/δ)
ucbi(t) =⟨
rt , αi⟩+ widthi(t)
Play the arm with largest ucbi(t)
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Definition of the Bandit ProblemLINREL
Kernel LinRel
Of course we can use the kernel trick indeed if we take theformula (
Z (t)′Z (t) + λI)αi = Z (t)′xt
i
for LinRel this can be given in a kernel defined featurespace as
(K (t) + λI)αi = kti
where K (t) is the kernel matrix of the feature vectors of theselected arms and kt
i is the vector of kernel evaluationsbetween the feature vector for the i th arm at time t andeach of the selected arms up to time t
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Definition of the Bandit ProblemLINREL
Drawbacks of LINREL
There does appear to be an explicit modelling of mean andvarianceNot immediately clear how to optimise over infinitely manyarmsNot clear the role of the kernel in kernel LinRelWould like to build in explicit prior over options – cf theBeta prior for the simple MAB, which could encode likelyresponse to click through
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Definition of the Bandit ProblemLINREL
Drawbacks of LINREL
There does appear to be an explicit modelling of mean andvarianceNot immediately clear how to optimise over infinitely manyarmsNot clear the role of the kernel in kernel LinRelWould like to build in explicit prior over options – cf theBeta prior for the simple MAB, which could encode likelyresponse to click through
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Definition of the Bandit ProblemLINREL
Drawbacks of LINREL
There does appear to be an explicit modelling of mean andvarianceNot immediately clear how to optimise over infinitely manyarmsNot clear the role of the kernel in kernel LinRelWould like to build in explicit prior over options – cf theBeta prior for the simple MAB, which could encode likelyresponse to click through
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Definition of the Bandit ProblemLINREL
Drawbacks of LINREL
There does appear to be an explicit modelling of mean andvarianceNot immediately clear how to optimise over infinitely manyarmsNot clear the role of the kernel in kernel LinRelWould like to build in explicit prior over options – cf theBeta prior for the simple MAB, which could encode likelyresponse to click through
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Gaussian Posterior Distribution
Recall (from supervised learning) that for GaussianProcess regression we could compute the posteriordistribution exactly and hence output mean and variancefor test points:
wmean = X′(K + σ2I)−1y, so that m(x) = k(K + σ2I)−1y
where K is the covariance (kernel) matrix Kij = κ(xi ,xj), kis the vector of kernel evaluations with the training dataki = κ(x,xi), y is the vector of output values and σ is thenoise level in the observations
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Gaussian Posterior Distribution
Similarly for the variance:
σ2y = E[(y −w′meanφ(x))
2] = κ(x,x)− k′(K + σ2I)−1k.
Recall that the posterior mean and covariance arecomputed from the prior and the evidence given by thetraining dataThe probabilities involved rely only on the noise model andnot the i.i.d. assumptionHence, can apply the same analysis even if we have astrategy for choosing the inputs
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Gaussian Posterior Distribution
Similarly for the variance:
σ2y = E[(y −w′meanφ(x))
2] = κ(x,x)− k′(K + σ2I)−1k.
Recall that the posterior mean and covariance arecomputed from the prior and the evidence given by thetraining dataThe probabilities involved rely only on the noise model andnot the i.i.d. assumptionHence, can apply the same analysis even if we have astrategy for choosing the inputs
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Gaussian Posterior Distribution
Similarly for the variance:
σ2y = E[(y −w′meanφ(x))
2] = κ(x,x)− k′(K + σ2I)−1k.
Recall that the posterior mean and covariance arecomputed from the prior and the evidence given by thetraining dataThe probabilities involved rely only on the noise model andnot the i.i.d. assumptionHence, can apply the same analysis even if we have astrategy for choosing the inputs
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Gaussian Posterior Distribution
Similarly for the variance:
σ2y = E[(y −w′meanφ(x))
2] = κ(x,x)− k′(K + σ2I)−1k.
Recall that the posterior mean and covariance arecomputed from the prior and the evidence given by thetraining dataThe probabilities involved rely only on the noise model andnot the i.i.d. assumptionHence, can apply the same analysis even if we have astrategy for choosing the inputs
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Gaussian Process Bandits
Aim is to model the response rates of the arms as aGaussian Process, i.e. define a prior distribution over likelyreward functions through the use of a kernel (covariance)functionThis will apply to the case where we can model the noisein the observations as Gaussian with a known variance σ,that is the reward distribution is given by
p(r |x,w) =1
σ√
2πexp
(−(〈w, φ(x)〉 − r)2
2σ2
)
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Gaussian Process Bandits
Aim is to model the response rates of the arms as aGaussian Process, i.e. define a prior distribution over likelyreward functions through the use of a kernel (covariance)functionThis will apply to the case where we can model the noisein the observations as Gaussian with a known variance σ,that is the reward distribution is given by
p(r |x,w) =1
σ√
2πexp
(−(〈w, φ(x)〉 − r)2
2σ2
)
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Gaussian Process Bandits
Using the mean and variance we wish to select the armthat maximises
argmaxx∈Qk(K + σ2I)−1y + B(t)√κ(x,x)− k′(K + σ2I)−1k
where B(t) is again the slowly increasing function of thenumber t of pulls, e.g.
√log t
The set of arms Q could be a finite set given to the user(as in LinRel) or some region of the input spaceFor finite set of arms this gives a UCB style algorithm forGP Bandits
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Gaussian Process Bandits
Using the mean and variance we wish to select the armthat maximises
argmaxx∈Qk(K + σ2I)−1y + B(t)√κ(x,x)− k′(K + σ2I)−1k
where B(t) is again the slowly increasing function of thenumber t of pulls, e.g.
√log t
The set of arms Q could be a finite set given to the user(as in LinRel) or some region of the input spaceFor finite set of arms this gives a UCB style algorithm forGP Bandits
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Gaussian Process Bandits
Using the mean and variance we wish to select the armthat maximises
argmaxx∈Qk(K + σ2I)−1y + B(t)√κ(x,x)− k′(K + σ2I)−1k
where B(t) is again the slowly increasing function of thenumber t of pulls, e.g.
√log t
The set of arms Q could be a finite set given to the user(as in LinRel) or some region of the input spaceFor finite set of arms this gives a UCB style algorithm forGP Bandits
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Randomised strategy
As with the Beta distribution strategy for MABs we can alsoplay a randomised strategy: sample from the posteriordistribution and choose the maximumAs with the Beta strategy this corresponds to selecting anarm with probability equal to its probability of being thebest arm in the posterior distributionWe can sample iteratively by computing the mean andvariance for a test arm m(x) and σ(x), sampling accordingto
y ∼ N(
m(x), σ2(x))
and then adding (x, y) as an observation with no noise
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Randomised strategy
As with the Beta distribution strategy for MABs we can alsoplay a randomised strategy: sample from the posteriordistribution and choose the maximumAs with the Beta strategy this corresponds to selecting anarm with probability equal to its probability of being thebest arm in the posterior distributionWe can sample iteratively by computing the mean andvariance for a test arm m(x) and σ(x), sampling accordingto
y ∼ N(
m(x), σ2(x))
and then adding (x, y) as an observation with no noise
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Randomised strategy
As with the Beta distribution strategy for MABs we can alsoplay a randomised strategy: sample from the posteriordistribution and choose the maximumAs with the Beta strategy this corresponds to selecting anarm with probability equal to its probability of being thebest arm in the posterior distributionWe can sample iteratively by computing the mean andvariance for a test arm m(x) and σ(x), sampling accordingto
y ∼ N(
m(x), σ2(x))
and then adding (x, y) as an observation with no noise
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Randomised strategy
There are iterative algorithms for updating the inverse of amatrix when a new row and column are added known asthe Sherman-Morrison formula:
(A + uv′)−1 = A−1 − A−1uv′A−1
1 + v′A−1u.
To add row m take A the previous matrix augmented withthe diagonal element and then v = em with u the additionalcolumn, then repeat reversing the roles to add theadditional rowFor a finite set of arms we can simply choose the arm withlargest output after running through the optionsCan be implemented efficiently over trees givingcompetitive results when compared to UCT
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Randomised strategy
There are iterative algorithms for updating the inverse of amatrix when a new row and column are added known asthe Sherman-Morrison formula:
(A + uv′)−1 = A−1 − A−1uv′A−1
1 + v′A−1u.
To add row m take A the previous matrix augmented withthe diagonal element and then v = em with u the additionalcolumn, then repeat reversing the roles to add theadditional rowFor a finite set of arms we can simply choose the arm withlargest output after running through the optionsCan be implemented efficiently over trees givingcompetitive results when compared to UCT
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Randomised strategy
There are iterative algorithms for updating the inverse of amatrix when a new row and column are added known asthe Sherman-Morrison formula:
(A + uv′)−1 = A−1 − A−1uv′A−1
1 + v′A−1u.
To add row m take A the previous matrix augmented withthe diagonal element and then v = em with u the additionalcolumn, then repeat reversing the roles to add theadditional rowFor a finite set of arms we can simply choose the arm withlargest output after running through the optionsCan be implemented efficiently over trees givingcompetitive results when compared to UCT
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Randomised strategy
There are iterative algorithms for updating the inverse of amatrix when a new row and column are added known asthe Sherman-Morrison formula:
(A + uv′)−1 = A−1 − A−1uv′A−1
1 + v′A−1u.
To add row m take A the previous matrix augmented withthe diagonal element and then v = em with u the additionalcolumn, then repeat reversing the roles to add theadditional rowFor a finite set of arms we can simply choose the arm withlargest output after running through the optionsCan be implemented efficiently over trees givingcompetitive results when compared to UCT
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Infinitely many arms
For infinitely many arms we need to maximise over acontinuous space:
argmaxx∈Qk(K + σ2I)−1y + B(t)√κ(x,x)− k′(K + σ2I)−1k
This can be difficult as we expect the function to becomequite smooth over time since we are always picking armsthat maximise the function and this will tend to reduce thevalue of the function in that region
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Infinitely many arms
For infinitely many arms we need to maximise over acontinuous space:
argmaxx∈Qk(K + σ2I)−1y + B(t)√κ(x,x)− k′(K + σ2I)−1k
This can be difficult as we expect the function to becomequite smooth over time since we are always picking armsthat maximise the function and this will tend to reduce thevalue of the function in that region
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Infinitely many arms
An alternative is to take the randomised strategy graduallysampling new points obtained by approximately optimisingthis expression, where we update with the new informationfrom each sample (again with no noise) – again play thesample that has highest rewardThis typically leads to a relatively small number of pointsbeing sampledRegret bounds involving the regret to the maximum rewardobserved have been derived
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Infinitely many arms
An alternative is to take the randomised strategy graduallysampling new points obtained by approximately optimisingthis expression, where we update with the new informationfrom each sample (again with no noise) – again play thesample that has highest rewardThis typically leads to a relatively small number of pointsbeing sampledRegret bounds involving the regret to the maximum rewardobserved have been derived
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Infinitely many arms
An alternative is to take the randomised strategy graduallysampling new points obtained by approximately optimisingthis expression, where we update with the new informationfrom each sample (again with no noise) – again play thesample that has highest rewardThis typically leads to a relatively small number of pointsbeing sampledRegret bounds involving the regret to the maximum rewardobserved have been derived
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Conclusions
Putting a Gaussian process prior of the reward functionleads to a natural Bayesian update for the mean andvariance when the reward noise model is GaussianLeads to natural way of encoding prior knowledge andquite efficient implementation for finite sets of armsEfficiently implemented for kernels over trees – c.f. UCTFor infinitely many arms a randomised strategy seems tobe the most promisingSome recent results for bounding the regret
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Conclusions
Putting a Gaussian process prior of the reward functionleads to a natural Bayesian update for the mean andvariance when the reward noise model is GaussianLeads to natural way of encoding prior knowledge andquite efficient implementation for finite sets of armsEfficiently implemented for kernels over trees – c.f. UCTFor infinitely many arms a randomised strategy seems tobe the most promisingSome recent results for bounding the regret
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Conclusions
Putting a Gaussian process prior of the reward functionleads to a natural Bayesian update for the mean andvariance when the reward noise model is GaussianLeads to natural way of encoding prior knowledge andquite efficient implementation for finite sets of armsEfficiently implemented for kernels over trees – c.f. UCTFor infinitely many arms a randomised strategy seems tobe the most promisingSome recent results for bounding the regret
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Conclusions
Putting a Gaussian process prior of the reward functionleads to a natural Bayesian update for the mean andvariance when the reward noise model is GaussianLeads to natural way of encoding prior knowledge andquite efficient implementation for finite sets of armsEfficiently implemented for kernels over trees – c.f. UCTFor infinitely many arms a randomised strategy seems tobe the most promisingSome recent results for bounding the regret
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I
Outline for todayThe Bandit Problem
Gaussian Process Bandits
Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms
Conclusions
Putting a Gaussian process prior of the reward functionleads to a natural Bayesian update for the mean andvariance when the reward noise model is GaussianLeads to natural way of encoding prior knowledge andquite efficient implementation for finite sets of armsEfficiently implemented for kernels over trees – c.f. UCTFor infinitely many arms a randomised strategy seems tobe the most promisingSome recent results for bounding the regret
John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I