Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Deﬁnition of the Bandit Problem

Outline for todayThe Bandit Problem

Gaussian Process Bandits

Advanced Topics in Machine Learning: Part I

John Shawe-Taylor and Steffen GrünewalderUCL

Second semester2010

John Shawe-Taylor and Steffen Grünewalder UCL Advanced Topics in Machine Learning: Part I



1 The Bandit ProblemDefinition of the Bandit ProblemLINREL

2 Gaussian Process BanditsGaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms




Definition of the Bandit ProblemLINREL

Multi-armed Bandit (MAB) problem

Modelling of a casino with a finite collection of K slotmachines (aka one-armed bandits):Learning proceeds in iterations: discrete time slots,t = 1,2, . . . ,At each time the player must decide which machine to playAfter playing machine i the player either receives a(randomised) reward Ri , e.g. unit of reward or nothingEach machine i has a fixed (but not known to the player)reward distribution with mean µi , e.g. probability pi ofgiving a rewardThe goal of the player is to maximise his reward:

Could be over a fixed (known) number T of plays or horizonAlternatively maximising the accumulated reward at anytime: any time performance






















































LINREL

At stage t solve(Z (t)′Z (t) + λI

)αi = Z (t)′xt

i

where Z (t) is a matrix formed with columns the featurevectors of the selected armsDefine

widthi(t) = ‖αi‖√

ln(2TK/δ)

ucbi(t) =⟨

rt , αi⟩+ widthi(t)

Play the arm with largest ucbi(t)





LINREL


)αi = Z (t)′xt

i



ln(2TK/δ)

ucbi(t) =⟨







LINREL


)αi = Z (t)′xt

i



ln(2TK/δ)

ucbi(t) =⟨







Kernel LinRel

Of course we can use the kernel trick indeed if we take theformula (

Z (t)′Z (t) + λI)αi = Z (t)′xt

i

for LinRel this can be given in a kernel defined featurespace as

(K (t) + λI)αi = kti

where K (t) is the kernel matrix of the feature vectors of theselected arms and kt

i is the vector of kernel evaluationsbetween the feature vector for the i th arm at time t andeach of the selected arms up to time t





Drawbacks of LINREL

There does appear to be an explicit modelling of mean andvarianceNot immediately clear how to optimise over infinitely manyarmsNot clear the role of the kernel in kernel LinRelWould like to build in explicit prior over options – cf theBeta prior for the simple MAB, which could encode likelyresponse to click through





Drawbacks of LINREL






Drawbacks of LINREL






Drawbacks of LINREL





Gaussian posterior distributionGaussian Process BanditsRandomised strategyInfinitely many arms

Gaussian Posterior Distribution

Recall (from supervised learning) that for GaussianProcess regression we could compute the posteriordistribution exactly and hence output mean and variancefor test points:

wmean = X′(K + σ2I)−1y, so that m(x) = k(K + σ2I)−1y

where K is the covariance (kernel) matrix Kij = κ(xi ,xj), kis the vector of kernel evaluations with the training dataki = κ(x,xi), y is the vector of output values and σ is thenoise level in the observations






Similarly for the variance:

σ2y = E[(y −w′meanφ(x))

2] = κ(x,x)− k′(K + σ2I)−1k.

Recall that the posterior mean and covariance arecomputed from the prior and the evidence given by thetraining dataThe probabilities involved rely only on the noise model andnot the i.i.d. assumptionHence, can apply the same analysis even if we have astrategy for choosing the inputs








2] = κ(x,x)− k′(K + σ2I)−1k.









2] = κ(x,x)− k′(K + σ2I)−1k.









2] = κ(x,x)− k′(K + σ2I)−1k.







Aim is to model the response rates of the arms as aGaussian Process, i.e. define a prior distribution over likelyreward functions through the use of a kernel (covariance)functionThis will apply to the case where we can model the noisein the observations as Gaussian with a known variance σ,that is the reward distribution is given by

p(r |x,w) =1

σ√

2πexp

(−(〈w, φ(x)〉 − r)2

2σ2

)






Aim is to model the response rates of the arms as aGaussian Process, i.e. define a prior distribution over likelyreward functions through the use of a kernel (covariance)functionThis will apply to the case where we can model the noisein the observations as Gaussian with a known variance σ,that is the reward distribution is given by

p(r |x,w) =1

σ√

2πexp

(−(〈w, φ(x)〉 − r)2

2σ2

)






Using the mean and variance we wish to select the armthat maximises

argmaxx∈Qk(K + σ2I)−1y + B(t)√κ(x,x)− k′(K + σ2I)−1k

where B(t) is again the slowly increasing function of thenumber t of pulls, e.g.

√log t

The set of arms Q could be a finite set given to the user(as in LinRel) or some region of the input spaceFor finite set of arms this gives a UCB style algorithm forGP Bandits









√log t










√log t






Randomised strategy

As with the Beta distribution strategy for MABs we can alsoplay a randomised strategy: sample from the posteriordistribution and choose the maximumAs with the Beta strategy this corresponds to selecting anarm with probability equal to its probability of being thebest arm in the posterior distributionWe can sample iteratively by computing the mean andvariance for a test arm m(x) and σ(x), sampling accordingto

y ∼ N(

m(x), σ2(x))

and then adding (x, y) as an observation with no noise





Randomised strategy


y ∼ N(

m(x), σ2(x))






Randomised strategy


y ∼ N(

m(x), σ2(x))






Randomised strategy

There are iterative algorithms for updating the inverse of amatrix when a new row and column are added known asthe Sherman-Morrison formula:

(A + uv′)−1 = A−1 − A−1uv′A−1

1 + v′A−1u.

To add row m take A the previous matrix augmented withthe diagonal element and then v = em with u the additionalcolumn, then repeat reversing the roles to add theadditional rowFor a finite set of arms we can simply choose the arm withlargest output after running through the optionsCan be implemented efficiently over trees givingcompetitive results when compared to UCT





Randomised strategy


(A + uv′)−1 = A−1 − A−1uv′A−1

1 + v′A−1u.






Randomised strategy


(A + uv′)−1 = A−1 − A−1uv′A−1

1 + v′A−1u.






Randomised strategy


(A + uv′)−1 = A−1 − A−1uv′A−1

1 + v′A−1u.






Infinitely many arms

For infinitely many arms we need to maximise over acontinuous space:


This can be difficult as we expect the function to becomequite smooth over time since we are always picking armsthat maximise the function and this will tend to reduce thevalue of the function in that region






For infinitely many arms we need to maximise over acontinuous space:


This can be difficult as we expect the function to becomequite smooth over time since we are always picking armsthat maximise the function and this will tend to reduce thevalue of the function in that region






An alternative is to take the randomised strategy graduallysampling new points obtained by approximately optimisingthis expression, where we update with the new informationfrom each sample (again with no noise) – again play thesample that has highest rewardThis typically leads to a relatively small number of pointsbeing sampledRegret bounds involving the regret to the maximum rewardobserved have been derived

















Conclusions

Putting a Gaussian process prior of the reward functionleads to a natural Bayesian update for the mean andvariance when the reward noise model is GaussianLeads to natural way of encoding prior knowledge andquite efficient implementation for finite sets of armsEfficiently implemented for kernels over trees – c.f. UCTFor infinitely many arms a randomised strategy seems tobe the most promisingSome recent results for bounding the regret





Conclusions






Conclusions






Conclusions






Conclusions



Documents

Advanced Topics in Machine Learning: Part I · 2010. 2. 4. · Outline for today The Bandit Problem Gaussian Process Bandits 1 The Bandit Problem Deﬁnition of the Bandit Problem