Thompson Sampling for Contextual Combinatorial Banditsasamov.com/download/ts_combinatorial.pdftions are used to predict CTR by logistic regression. Due to the Thompson sampling’s

Thompson Sampling for Contextual Combinatorial Bandits

Yingfei WangDepartment of Computer

Science, Princeton UniversityPrinceton, NJ 08540, USA

[email protected]

Hua OuyangYahoo Labs

701 First Avenue, Sunnyvale,CA 94089, USA

[email protected]

Chu WangProgram in Applied and

Computational Mathematics,Princeton University

Princeton, NJ 08540, [email protected]

Jianhui ChenYahoo Labs


[email protected]

Tsvetan AsamovDepartment of OperationsResearch and FinancialEngineering, Princeton

UniversityPrinceton, NJ 08540, USA

[email protected]

Yi ChangYahoo Labs


[email protected]

ABSTRACTMulti-Armed Bandit (MAB) framework has been success-fully applied in many web applications, where the exploration-exploitation trade-off can be naturally taken care of. How-ever, many complex real-world applications that involve mul-tiple content recommendations cannot fit into the traditionalMAB setting. To address this issue, we consider a combi-natorial bandit problem where the learner selects S actionsfrom a base set of K actions, and displays the results in S(out of M) different positions. The aim is to maximize thecumulative reward or equivalently minimize the regret withrespect to the best possible subset and positions in hind-sight. By the adaptation of network flow formulations, apractical algorithm based on Thompson sampling is derivedfor the (contextual) combinatorial problem, thus resolvingthe problem of computational intractability. The networkflow formulations can also be incorporated in ε-greedy andpure exploitation algorithms to provide an exact single-stepsolution without enumerating on a gigantic number of pos-sible combinatorial subsets. With its potential to work withany probabilistic models, to illustrate the effectiveness ofour method, we focus on context-free Gaussian process op-timization and a contextual setting where click-through-rateis predicted by logistic regression. We demonstrate the algo-rithms’ performance on synthetic Gaussian process problemsand on large-scale news article recommendation datasetsfrom Yahoo! Front Page Today Module.

CCS Concepts•Information systems → Content ranking; Personal-ization; Sponsored search advertising; •Computing

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

WOODSTOCK ’97 El Paso, Texas USAc© 2016 ACM. ISBN 123-4567-24-567/08/06. . . $15.00

DOI: 10.475/123 4

methodologies → Online learning settings;

KeywordsThompson sampling; contextual combinatorial bandits; Gaus-sian processes optimization; recommender systems

1. INTRODUCTIONThe Multi-Armed Bandit (MAB) problem is a classic and

natural framework for many machine learning applications.In this setting, the learner takes an action and only observespartial feedback from the environment. MAB naturally ad-dresses the fundamental trade-off between exploration andexploitation [4, 17]. The learner must balance the exploita-tion of actions that performed well in the past and the ex-ploration of less played actions that could potentially yieldhigher payoffs in the future. Traditional MAB is a sequentialdecision making setting defined over a set of K actions. Ateach time step t, the learner selects an action It and observessome payoff Xt,It . In stochastic MAB, the reward of eacharm is assumed to be drawn from some unknown probabilitydistribution. The learner then updates its knowledge basedon this payoff. The goal is to maximize the commutativepayoff obtained in a sequence of n allocations over time, orequivalently minimizing the regret :

Rn ≡ maxi=1,...,K

E[ n∑t=1

Xt,i −n∑t=1

Xt,It

].

MAB has been extensively studied, and many algorithms areproposed and have found applications in various domains.Some successful web applications are news and movie rec-ommendation, web advertising, vertical search and queryautocompletion [18, 7, 9, 14]. Dispite of these advances,many real-world applications cannot fit into the traditionalmulti-armed bandit framework.

We use news article recommendation as a motivating ex-ample. Fig. 1 is a snapshot of a portal website. In thisview, there are totally 10 slots where 9 news articles and1 sponsored ad can be displayed. These slots differ in po-sitions, container sizes, image shapes and font colors. Toselect 9 articles from a large pool and place them to 9 of 10

slots in the webpage is a combinatorial problem. Moreover,a user can also click more than one articles during a viewsession, and the goal is to find an optimal layout configura-tion which maximizes the expected total click-through-rate(CTR). There are some work that addresses the bandit withmulti-plays, for example, subset regret problems [15, 13, 25],batch-mode bandit optimization with delayed feedbacks [12]and ranked bandits [20]. However, the complex combinato-rial setting in our example is beyond the capacity of exist-ing methods. In this paper we consider a novel and generalcontextual combinatorial bandit formulation to account forany layout information or possible position bias and can beaccomplished potentially with any probabilistic user clickmodels.

Figure 1: An example of news article recommenda-tion.

To model this complex scenario, we consider the followingcombinatorial bandit problem. Given the context informa-tion, instead of selecting one arm, the learner selects a subsetof S actions and displays them on S different positions fromM possible positions. If we set S = 1, the problem is equiv-alent to the traditional contextual K-armed bandits. If weset K = 1 as a dummy variable and treat N positions asactions, our combinatorial bandit problem degenerates tothe unordered subset regrets problems [15, 13]. The banditordered slate problem [15] and ranked bandits [20] are alsospecial cases of our setting with S = M . Yet our setting isnot restricted to learn-to-rank and is general enough to en-code any layout information. In our news recommendationexample, the slot positions, container shapes, image and fontsizes in Fig. 1 can be co-optimized simultaneously.

Various algorithms have been proposed to solve MABproblems [4, 16, 22]. In this work we focus on the Thompsonsampling [23] method. The idea of Thompson sampling is todraw each arm according to its probability of being optimal.One advantage of Thompson sampling is that no matter howcomplex the stochastic reward function (of the action, con-

text and the unknown parameter θ), it is computationallyeasy and algorithmically straightforward to sample from theposterior distribution and select the action with the high-est reward under the sampled parameter. Thus it has thepotential to work with any probabilistic user click models.In contrast, the popular UCB policies can not extend easilybeyond (generalized) linear models. Another particular ad-vantage of Thompson sampling is that it requires no tuningparameters.

The primary contributions of this paper are the definitionof the (contextual) combinatorial bandit problem. With thepotential to work with any probabilistic (user click) models,we mainly focus on context-free Gaussian process optimiza-tion in the bandit setting and a contextual setting wherethe features of the content, the user information and posi-tions are used to predict CTR by logistic regression. Due tothe Thompson sampling’s flexibility on the stochastic rewardfunction and by adaptation of network flow formulations, apractical algorithm is derived to provide exact solution forthe (contextual) combinatorial problem without enumerat-ing on a gigantic number of possible combinatorial subsets.The network flow formulations can also be incorporated in ε-greedy and pure exploitation algorithms to provide an exactsingle-step solution in the (contextual) combinatorial ban-dit problem. We demonstrate the algorithms’ performanceon synthetic Gaussian process problems and on large-scalenews article recommendation dataset from Yahoo! FrontPage Today Module.

2. PROBLEM SETTINGSDue to position and layout bias, it is reasonable to assume

that for every article and every position that it can be shownin, there is a click-through-rate associated with the (content,position) pair, which specifies the probability that a user willclick on the vertical if it is displayed in that position. In asequence of rounds t = 1, 2, ..., T , the learner is required tochoose S actions from a base set A of K actions to displayin S (out of M) different positions, and receives a rewardthat is the sum of the rewards of (action, position) pair inthe chosen subset. All the payoff for each displayed (contentposition) pair is revealed.

(Contextual) Combinatorial Bandits. In each roundt, the learner is presented with a (optional) context vectorxt. In order to take layout information into consideration,a feature vector ak,m is constructed for each (action, posi-tion) pair (k,m), such that ak ∈ A, and m ∈ {1, 2, ...,M}.The learner chooses S actions from A to display in S (outof M) different positions. Thus a valid combinatorial subsetis a mapping from S different actions to S different posi-tions; or more concisely, it is a one-to-one mapping πt :{1, 2, ..., S} 7→

(A, {1, 2, ...,M}

). The learner then receives

reward rπt(s)(t) for each chosen (action, position) pair. Thetotal reward of round t is the sum of the rewards of each po-sition

∑Ss=1 rπt(s)(t). The goal is to maximize the expected

cumulative rewards over time E[∑T

t=1

∑Ss=1 rπt(s)(t)

].

An important special case of the contextual combinato-rial bandits is the context-free setting in which the contextxt remains constant for all t. By setting S, K to specialvalues, many existing literature, including (contextual) K-armed bandits, bandit slate problems [15] and subset regretsproblems [13], can be seen as special cases of our combina-torial bandits setting.

In the example of sponsored search, we may view ads inthe pool as actions. For the t-th user visit, the context vectorxt summarizes information including the user and the pageview. The action feature vector ak,n includes features of thecandidate ad and identifiers of the positions. S out of Kads are chosen to be displayed on K out of M positions.The When each of the displayed ads is clicked on, a rewardof 1 is acquired; otherwise, the reward is 0. The goal is tomaximize the expected number of clicks from users.

3. THOMPSON SAMPLINGIn (contextual)K-armed bandit problems, at each round a

(optional) context information x is provided for the learner.The learner then chooses an action a ∈ A and observes areward r. The goal is to find a policy that maximizes theexpected cumulative reward of the context sequence.

Thompson Sampling [23, 7] for the contextual bandit prob-lems is best understood in a Bayesian setting as follows.Each past observation consists of a triplet (xi, ai, ri) andthe likelihood function of the reward is modeled in the para-metric form Pr(r|a, x, θ) over some parameter set Θ. Givensome known prior distribution over Θ, the posterior distri-bution of these parameters is given by the Bayes rule basedon the past observations. At each time step t, the learnerdraws θt from the posterior and chooses the action that hasthe largest expected reward based on the sampled θt, asdescribed in Algorithm 1.

Algorithm 1: Thompson sampling

input : Prior Distribution P (θ) and observationhistory D0 = ∅

for t = 1 to T do

Receive context xt1. Sample from posterior:

Draw θt from P (θ|Dt−1)2. Select action:

Draw arm at = arg maxa E[r|a, xt, θt]Observe reward rt

3. Update posterior:Dt = Dt−1 ∪ {xt, at, rt}Update posterior distribution P (θ|Dt) according

to Bayes ruleend

It can be seen from the algorithm that in Step 1, the pos-terior sampling has nothing to do with the format of thestochastic reward function r and the function is only evalu-ated at sampled θt in Step 2. This makes Thompson sam-pling computationally applicable for complex reward func-tion as long as a probabilistic modeling for the unknownrewards can be obtained.

4. ALGORITHMS FOR THE CONTEXTUALCOMBINATORIAL BANDIT PROBLEMS

Our main algorithmic idea is to use the Thompson sam-pling algorithm for contextual bandits due to its flexibilityfor complex reward functions.

On each round t, the contextual combinatorial banditproblem involves choosing S actions from a set A of K ac-tions to display in S (out of M) different positions, and re-ceives a reward that is the sum of the chosen subset. A naive

approach is to treat each complex combination as a superarm and apply a traditional bandit algorithm which involvesenumerating the values on all the super arms at each timestep. This approach encounters practical and computationalissues since the number of super arms is

(MS

)(KS

)S! which is

gigantic (e.g., ≈ 108 even for K = 100,M = 10 and S = 3).Suppose the likelihood function of the reward of each con-

text x and (action, position) pair ak,m is modeled in theparametric form Pr(r|x, a, θ) over some parameter set Θ.Therefore we need a special variant of Thompson samplingwhich efficiently finds the optimal mapping π∗t : {1, 2, ..., S} 7→(A, {1, 2, ...,M}

)so that

π∗t ∈ arg maxπt

S∑s=1

E[r|aπt(s), xt, θt]. (1)

The rest of this section is organized as follows. In Section4.1, we formally consider the optimization problem (1). Inorder to efficiently solve (1), we translate it into a minimum-cost maximum-flow problem in Section 4.2. Based on thenetwork flow formulation, we derive the practical Thomp-son sampling methods for the combinatorial bandits. Withthe potential of woking with any probabilistic models, wedemonstrate our approach with Gaussian processes and Bayesianlogistic regression in Section 4.3 and Section 4.4.

4.1 Action selection as a constrained optimiza-tion problem

In order to apply the approach described above, we needa way to compactly represent the maximization over thesuper arms πt as in Eq. (1) without enumeration. wefirst denote the expected reward of each (action, position)

pair E[r|ak,m, xt, θt] for display action ak at position pmgiven context xt and sampled parameter wt as ek,m to sim-plify the notation. We also define indicator variable fk,mto stand for whether to display action ak at position pmor not, fk,m ∈ {0, 1}. We next translate a valid superarm into mathematical constraints. First, since each ac-tion can be displayed at most once, it corresponds to theconstraint

∑m fk,m ≤ 1, ∀k. Second, no two actions can be

placed in the same position and thus we have∑k fk,m ≤ 1,

∀m = 1, ...,M . Finally, there should be exactly S actionschosen, which is equivalent to

∑k

∑m fk,m = S. Since the

expected reward of each super arm is the sum of the rewardof each position, it can be written as

∑k

∑m fk,mek,m. Now

we are ready to represent the maximization over the superarms as the following integer programming:

maxf

K∑k=1

M∑m=1

fk,mek,m

subject to

M∑m=1

fk,m ≤ 1, ∀k = 1, . . . ,K

K∑k=1

fk,m ≤ 1, ∀m = 1, . . . ,M

K∑k=1

M∑m=1

fk,m = S

fk,m ∈ {0, 1}, ∀k = 1, . . . ,K, m = 1, . . . ,M(2)

In general, integer programming problems cannot be solvedefficiently. However, the given formulation can be inter-preted as a network flow problem, a class of problems thatadopt polynomial time solutions (and enjoy interesting prop-erties such as the max–flow min–cut duality [24]). We elab-orate on our solution approach in section 4.2.

4.2 Network FlowThis section deals with the exact solution for the optimiza-

tion problem (2). The integer optimization problem (2) canbe interpreted as a minimum-cost maximum-flow formula-tion with edge costs −ek,m as has been depicted in Figure2. The decision variables fk,m represent the amount of flowto be transfered along the edges of a bipartite graph withexpected rewards ek,m. In addition, S represents the totalsize of the network flow. Moreover, the flow capacity of theedges adjacent to the bipartite graph is 1, which implies thatthose edges can accomodate a flow of at most 1 unit. Fur-thermore, we can change integer programming formulationof (2) to a linear programming by relaxing the last set ofconstraints in (3) with their continuous equivalent. Thus wearrive at the following formulation:

minf

K∑k=1

M∑m=1

−ek,mfk,m

subject to

M∑m=1

fk,m ≤ 1, ∀k = 1, . . . ,K

K∑k=1

fk,m ≤ 1, ∀m = 1, . . . ,M

K∑k=1

M∑m=1

fk,m = S

fk,m ∈ [0, 1], ∀k = 1, . . . ,K, m = 1, . . . ,M(3)

However, the constraint matrices of such problems featurespecial properties.

Theorem 1 ([3]). The node-arc incidence matrix of adirected network is totally unimodular.

Hence, we know that the set of constraints in the linearprogramming relaxation of problem (3) can be representedin standard form as Ax = b, x ≥ 0 with a totally unimodularconstraint matrix A. In addition, we also know that thefollowing equivalence relation holds:

Theorem 2 ([3]). Let A be an integer matrix with lin-early independent rows. Then the following conditions areequivalent:

• A is unimodular.

• Every basic feasible solution defined by the constraintsAx = b, x ≥ 0 is integer for any integer vector b.

Since the incidence matrix of a graph has linearly indepen-dent rows and S is an integer, we know that the linear pro-gramming relaxation (3) of the super arm selection problemwill result in an integer optimal solution f∗ ∈ {0, 1}K×M .Furthermore, linear programming problems can be solvedin polynomial time using interior–point methods [19], and

therefore we can solve the super arm selection problem effi-ciently. Please note that in general, specialized algorithmsfor min–cost network flow problems can have better runningtimes than the linear programming approach. However, suchspecialized methods usually do not allow for the introduc-tion of addition constraints which can arise in practice, andthe development and testing of such methods is beyond thescope of the current paper. For these reasons, we used alinear programming solver in our numerical experiments.

It is worth noting that we do not assume any specific func-tion form of the likelihood function of the reward Pr(r|x, a, θ)as long as effective Bayesian inference can be obtained forthe unknown quantities. Hence any probabilistic modelingof user clicks, e.g. the Cascade Model [11] and the Examina-tion Hypothesis [21], has the potential to be incorporated inour combinatorial bandit settings. Besides, even though weare motivated by web user click applications, our approachis not restricted to predicting CTRs with 0/1 reward. Itis generally applicable to any probabilistic modeling of anyreward functions. In the next two sections, we demonstrateour approach in both context-free Gaussian processes opti-mization and a contextual setting where the CTR is pre-dicted by Bayesian logistic regression.

4.3 Thompson sampling for the combinatorialbandits with Gaussian processes

Other than the web clicking problems, many real-worldproblems fit easily into our combinatorial bandits formula-tion. For example, a nice area of application of this logicis dynamic pricing. It can apply to small business that hasa lot of room to change its pricing decisions, but it is notmagnificent enough to change the underlying unknown rev-enue curve resulting from the global supply and demand.An interesting example of such business is recharging sta-tions for mobile phones in a develop country. Cell phoneuse in rural Africa is currently popular since it is harderto stretch a land line out to every home in rural area. Anentrepreneur with small power generator such as a car bat-tery or a solar panel can serve many uses. The question isthe pricing decision of the African entrepreneur running cellphone recharging stations. For example, the entrepreneurcan purchase car batteries or solar panels, change it in townand then travel to different villages to charge phone for a feethat can be any value between $0.1 and $0.3. Suppose theentrepreneur has more than one charging devices and cango to different places day by day, the dynamic pricing appli-cation can be modeled as the combinatorial bandits. Sincenot all the combinations of places make sense, probably dueto travel distances, the entrepreneur can easily add in otherconstraints in the linear programming relaxation.

As before, we make the assumption that there is an un-known reward function value g(·) for each (price, position)pair X . At each time, if we choose to measure a pointx ∈ X , we get to see its function value perturbed by i.i.d.Gaussian noise yt = g(x) + εt, εt ∼ N (0, σ2). Since in dy-namic pricing, the revenue of placing the recharging stationat nearby positions or the revenue of similar prices are highlycorrelated, we can enforce implicit properties like smooth-ness by modeling the unknown function g as a sample ofa Gaussian process (GP) whose realizations consist of ran-dom variables associated with each x ∈ X . Every finitecollection of those random variables has a multivariate nor-mal distribution. Each Gaussian process GP(µ,K) can be

Capacity[cost]

Source Sink

1

2

...

...

K

1

2

...

M

k

m1[− ek,m]

S[0]

1[0]

1[0]

Figure 2: Network flow problem with a maximum capacity of S. The decision variables fk,m represent theamount of flow to be transfered along the edges of a bipartite graph with expected rewards ek,m.

specified by its mean function µ and covariance (or kernel)function K = [k(x, x′)]x,x′∈X . We denote the normal distri-bution N (µ0,K0) as the prior distribution over g. We in-troduce the σ-algebra Ft formed by previous t observationsof super arms Yt = {y1, ...,yt} at points Xt = {π1, ..., πt},where yt = [yt,1, ..., yt,S ]T . We define µt = E[g|Ft] andKt = Cov[g|Ft]. By the property of Gaussian Processes,conditioning on Ft, g ∼ N (µt,Kt).

At time t, for any order of the chosen arm {xi1 , ..., xiS} ={πt(1), ..., πt(S)}, loop over S arms and update the statisticsrecursively, starting from µt+1 ← µt, Kt+1 ←Kt:

µt+1 ← µt+1 +yt+1,j − µt+1,xij

σ2 + σ2t+1(xij )

Kt+1exij ,

Kt+1 ← Kt+1 −Kt+1exij (exij )TKt+1

σ2 + σ2t+1(xij )

,

where σ2t+1(x) = kt+1(x, x) and ex is a column vector with

only the element x one and others zero.The Thompson sampling algorithm then works as follows.

At each round, it first draws random parameter θtk,m fromthe posterior N (µt,Kt) and then uses the linear program-ming to solve optimization problem (3) to select the super

arm πt with ek,m = θtk,m.

4.4 Thompson sampling for the combinatorialbandits with logistic regression

In this section, we consider using Bayesian logistic regres-sion to model the likelihood (CTR) of each (action, posi-tion) pair. Exact Bayesian inference for logistic regressionis intractable since the evaluation of the posterior distribu-tion comprises a product of logistic sigmoid functions andthe integral in the normalization constant is intractable aswell. In our model, the posterior distribution on the weightsis approximated by a Gaussian distribution with a diago-nal covariance matrix. With a Gaussian prior on the weightvector, the Laplace approximation can be obtained by find-ing the mode of the posterior distribution and then fittinga Gaussian distribution centered at that mode (see Chapter4.5 of [6]).

In order to continuously refresh the system with new data,

the Bayesian logistic regression has been extended to lever-age for model updates after each batch of training data [7, 8,26]. To be more specific, the Laplace approximated posteriorserves as a prior on the weights to update the model whena new batch of training data becomes available as describedin Algorithm 2. Note that by setting qj = λ, an l2 regu-larized Logistic regression can be interpreted as a Bayesianmodel with a Gaussian prior on the weights with standarddeviation 1/

√λ.

Algorithm 2: Recursive Bayesian Logistic Regressionwith Batch Updates

input : Regularization parameter λ > 0mj = 0, qj = λ. (Each weight wj has an independentprior N (mj , q

−1j ))

for t = 1 to T do

Get a new batch of training data (xi, yi),i = 1, ..., n.Find w as the maximizer of:− 1

2

∑dj=1 qi(wi−mi)

2−∑ni=1 log(1+exp(−yiwTxi)).

mj = wjqj = qj +

∑ni=1 σ(wTxi)(1− σ(wTxi)x

2ij .

end

The training example is made of triplets (x, a, p). Specifi-cally, suppose each context x and action k is represented byfeature vectors φx, and ψa, respectively, to reflect the ef-fect of different physical positions on the page, the extendedlogistic regression model is σ

(F (x, a, p)

), where

F (x, a, p) = µ+αTφx + βTψa +

M∑m=1

γmI(p, pm). (4)

At each round, after drawing a random parameter wt

from the approximated posterior N (mi, q−1i ), we use the

linear programming developed in Section 4.2 to select thesuper arm π with E[r|ak,m, xt, wt] = σ(ΦT wt), where Φ =[φxt ;ψak ; em]. It is worth noting here that our model doesnot require the action set A to be fixed. This offers greatbenefit for web applications, in which, for example, the pool

of available news articles for recommendation for each uservisit changes over time. The algorithm of Thompson sam-pling for the combinatorial bandits with Bayesian logisticregression (TSLR) is summarized in Algorithm 3.

Algorithm 3: Thompson sampling for the combinatorialbandits with logistic regression

input : Regularization parameter λ > 0mj = 0, qj = λ.for t = 1 to T do

Receive context xt1. Sample from posterior:

Draw wtj from N (mj , q−1j )

2. Select action:Comput ek,m = E[r|ak,m, xt, wt], ∀k,mSolve the optimization problem (3) and get [f tk,m]

Display the super arm according to [f tk,m]Observe rewards r(t)

3. Update posterior:Update mj , qj according to Algorithm 2

end

We close this section by briefly illustrating the flexibilityon different choices of user click models. As an example, weconsider the extended user click models with content qualityfeatures Qi(a, p) (whether the quality of link above/belowis better and the number of links above/below which arebetter) [5]:

F (x, a, p) = µ+αTφx+βTψa+

M∑m=1

γmI(p, pm)+

|Q|∑i=1

ηiQi(a, p).

As studied in [5], a boosting-style algorithm can be usedto learn the parameters of the extended model. In Bayesiansettings, after we get the new batch of training data, first weset each Qk = 0 and run one-step recursive Bayesian logisticregression to find mj and qj . Next, we update the Qk valueusing the updated mj and qj . We then re-run recursiveBayesian logistic regression to get new values of mj and qj .We iterate this process until the first iteration in which thelog-likelihood of the data decreases. We use the final mj

and qj as the posterior parameter value from which we getthe sampled w and then solve the optimization problem (3)to select the super arms. That is to say, due to the uniqueproperties of Thompson sampling, the user click model isencapsulated in its own probabilistic updates, resulting inthe wide applicability of the proposed algorithm.

5. EXPERIMENTSIn this section, we provide experimental results on both

synthesis datasets and Yahoo! Front Page Webscope datasetsto demonstrate the ability of our approach to solve the realworld problems that can be modeled as our contextual com-binatorial bandit problems and that have not been throughlystudied before.

5.1 Baseline algorithmsAs baseline algorithms to compare against our proposed

algorithm in Section 4, we consider the following approaches.Random. Select S actions and S positions uniformly at

random.

Unordered ε-Greedy. This approach is a common heuris-tic which does not take the position bias into consideration.It maintains an estimate of the function value of each action(regardless of positions). For the case of learning-to-rankwith S = M , the exploitation part selects top-M actionsbase on current estimations and places them in order. Forother cases where S < M or for general layout informationwhich can not translate to a ranked list, this approach doesnot apply since it is not obvious on how and where to placethe actions.

Exploitation. This is an extension of pure exploitationalgorithm for traditional bandit problems based on our net-work flow techniques. It maintains an estimate of the func-tion value of each (action, position) pair. At each time step,it needs to select the super arm with the highest estima-tion in value. Instead of applying the naive approach ofenumerating the values on all the super arms or using anyapproximation heuristics, it can in fact benefit from the net-work flow formulation and use linear programming to findthe exact solution of Eq. (3).ε-Greedy. Use Exploitation with probability 1 − ε and

use Random with probability ε. The Exploitation part isdone by our network flow formulation.

Ranked bandits [20]. This approach works only forlearn-to-rank with S = M . It runs an multi-armed ban-dit (MAB) instance MABm for each rank m. Each of theM copies of the multi-armed bandit algorithm maintains avalue (or index) for every action. When deleting the rank-ing to display to users, MAB1 is responsible for select whichaction is displayed at rank 1. MAB2 then recommends theaction to display at rank 2. If this action is already cho-sen at higher ranks, the select action is then picked arbi-trarily. This procedure is repeated until all M actions areselected. The MAB instance is chosen as the UCB1-Normalalgorithm [4] for context-free Gaussian processes optimiza-tion. More precisely, UCB1-Normal selects the action for

which xj +

√α ·

vj−tj x2jtj−1

· ln(t−1)tj

is maximized, where xj is

the average reward for action j, vj is the sum of squaredrewards and tj is the number of time action j is played. Forother cases where S < M or for general layout information,this approach does not apply due to the same reason.

GP-UCB [22]. This approach only works for Gaussianprocess optimization. In order to obtain a valid upper con-fident bound, since the standard deviation of the sum ofGaussian random variables is not equal to the sum of in-dividual standard deviations, a linear programing approachcan not be used to select the best super arm with the highestupper confident bound. As a result, we treat each complexcombination as a super arm and apply the traditional GP-UCB algorithm which involves enumerating the values on allthe super arms at each time step. Since the number of superarms explodes quickly, this approach does not scale beyondtoy examples. Besides, it is not applicable with logistic re-gression since due to the sigmoid function, there is no easyway to compute an upper confident bound even with enu-meration. It is also worth mentioning that the frequentist’sUCB algorithm is not considered here since it requires playeach super arm at once which is computationally intractable.

5.2 Experiments on context-free Gaussian pro-cesses optimization

We first consider learning-to-rank experimental settings

with S = M . In terms of the true function values for each(action, position) pair, similar to the Examination Hypoth-esis [21] that lower-ranked places has lower probability tobe examined, we assume that the value for each action k isdiscounted by e−dk at lowered ranks. Specifically, we usethe arithmetic progression µk = 0.5 − 0.025k, k = 1, ...,K,as the true function value for each arm k at rank 1. Thevalue of (ak, pm) is then ckm = µke

−(m−1)dk . To make thelearning settings more interesting, we allow dk to be differ-ent for different action k. In the experiments, dk is ran-domly generated from [0.3, 0.8] for each k. Different sam-pling noise levels σ and different choices of S,M,K are usedin the experiments. For Gaussian processes, we start witha mean vector of zeros and choose the Squared ExponentialKernel k(x, x′) = α2 exp(−β1(k − k′)2 − β2(m−m′)2) withα = 100, β1 = 0.2, β2 = 0.1 in the experiments.

The experimental results are reported on 100 repetitionsof each algorithm. In order to make a fair comparison, ineach repetition, all the observations are pre-generated andshared across all the competing algorithms. The only dif-ference is the way each strategy select the actions; all therest, including the models update, is identical as describedin Section 4.3.

For Thompson sampling, we also consider the impact ofposterior reshaping. In particular, for normally distributedposterior, decreasing the variance would have the effect ofincreasing exploitation over exploration. In our simulations,we have tried to reshape the covariance matrix Kt to α2Kt.This only affects the posterior sampling step and does notchange the model updates.

We conduct our experiments with different values of tun-ing parameters. The first row in Fig. 3 shows the meanaverage regret of different algorithms at the best value ofits tuning parameter across time. We also report the dis-tribution of the regret at T = 150 of different algorithmswith different parameter values in the second row. On eachbox, the central red line is the median, the edges of the boxare the 25th and 75th percentiles, and outliers are plottedindividually. The correspondence between algorithms andboxes is the following:

• Box 1-3: Thompson sampling (TS) with posterior re-shaping parameter α = 0.25, 0.5, 1.

• Box 4: Exploitation.

• Box 5: Random.

• Box 6-8: Unordered ε-greedy (U-ε-greedy) with ε =0.02, 0.01, 0.005.

• Box 9-11: ε-greedy with ε = 0.02, 0.01, 0.005.

• Box 12-14: GP-UCB with α = 2, 1, 0.5.

• Box 15-17: RBA with α = 4, 2, 1.

It can be seen from the figure that Thompson samplingclearly outperforms others. It not only achieves the lowestregret, but also has small variance. Since in our approach,a linear program is used to provide single-step exact solu-tions for selecting the super arms, the good performance ofThompson sampling is consistent with other empirical eval-uations in existing literature on K-armed bandits in whicheach super arm can be theoretically treated as one arm.

Without explicitly considering the position bias, the un-ordered ε-greedy does not yield good performances. RBAperforms well on some datasets while poorly on others. Onexplanation is that even though we use RBA algorithm formultiple clicks per time, it is designed on only allowing oneclick on the ranked list.

In terms of posterior reshaping, value of smaller α in gen-eral yields lower regret since it is in favor of exploitationover exploration. But the price to pay is higher variance.As for the impact of noise level, the variance of the algo-rithm grows when increasing the noise level. The unorderedε-greedy is the biggest victim and Thompson sampling is theleast affected.

In order to understand the behavior of different algorithmswhen S < M , we conduct another set of experiments. Therankings or the Examination Hypothesis is no longer suit-able for the case S < M . We instead use the 2-dimensionalCamelback test function (see the bottom right figure in Fig.4), to generate true function values for each (action, po-sition) pair. We flip the function values since the standardCamelback function is for minimization. We also experimentwith other standard text functions, including Griewank andBranin functions. We get similar results and thus do notinclude them in the paper. The noise level is set to 0.01.

We experiment with different choices of K and S. Sincewe are dealing with general layout information that is not aranked list, RBA and U-ε-greedy can not be applied. As forGP-UCB, whenK = 50, 100 andM = 8, enumeration on thesuper arms are computationally unaffordable. We compareour approach with Random, Exploitation and ε-greedy withthe results shown in Fig. 4.

The results are consistent with previous findings. Theaverage regret is normalized with respect to the number Sof selected actions, i.e.(

maxπ

E[ S∑s=1

rπ(s)

]− E

[ T∑t=1

S∑s=1

rπt(s)(t)]/T

)/S.

It should be noted that the average regret in case S = 4 de-creases slightly faster than the S = 2 case . At first glance,it is counter-intuitive because S = 4 is a more difficult prob-lem. However, since there are twice as many examples inthe S = 4 case for model updates, the parameter at eachtime step are estimated much better than the S = 2 case.This benefits the arm selection in each algorithm.

5.3 News Article RecommendationIn this section, we consider applying Thompson sampling

for contextual combinatorial bandits to personalization ofnews article recommendation on Yahoo! front page [2, 18,7]. Our work differs from previous literature in that ratherthan only recommending one article (e.g. only at the storyposition) at each user visit (see Fig. 1 for an illustration),we have the ability to recommend more than one articles atdifferent positions. The candidate article pool is dynamicover time with an average size around 20. Suppose the usercan click on more than one articles during each view session.The goal is to choose the most relevant articles to usersor equivalently, maximize the total number of clicks of thedisplayed articles. There is a fundamental exploration v.s.exploitation tradeoff: in order to learn the true CTR of eacharticle, it needs to be displayed for long time benefit, leadingto a potential drop in the shot-term performance.

Figure 3: Comparison of performance: Thompson sampling and various heuristics. Top: average regret as afunction of T over 100 repetitions. Bottom: distribution of the regret at T = 150.

In Yahoo! Webscope datasets [1], for each visit, both theuser and each of the candidate articles are associated with afeature vector of dimension 6 (including a constant feature),constructed by conjoint analysis with a bilinear model [10].We treat each user feature vector as the context informationx and the news articles as actions a, with the feature vectorsdenoted by φx and ψa, respectively. We set the number ofpossible positions M = 5. We use the recursive Bayesianlogistic regression as in Algorithm 2 based on the user clickmodel (4) to predict article CTRs. Note here due to thecombinatorial problems considered in this paper, this logisticregression model is different from that in [7]. Since only onearticle is displayed upon user visit in [7], it is only affordablein [7] to have a different weight vector for each article.

Evaluating an exploration/exploitation policy is difficultsince we do not know the reward of an action that was notchosen at each user visit in the log data. We can not usethe unbiased offline evaluation [18] in our case. Since inthe combinatorial problems, the number of super arms isgigantic (e.g.

(205

)= 15504), it is rare to have a logged data

point that matches the selected super arm. This reduces theeffective data size substantially.

Based on the real world context and article features inthe Yahoo! Webscope datasets, we instead simulate the trueclicks using a weight vector w∗. To make the settings morerealistic, we first use all the user click data on May 1, 2009to get a weight vector using logistic regression and then con-struct w∗ by perturbation. Since in our simulated environ-ment, we know the actual CTR of each (action, pair), we usethe expected number of clicks as the reward for each uservisit. We experiment with different choices of S. For the

reasons explained in Section 5.1, we compare our approachwith Random, Exploitation and ε-greedy.

Figure 4: Comparison of performance on Camelbackfunction: Thompson sampling and various heuris-tics.

Since in a real world system, it is infeasible to update themodel after each user feedback, we model this behavior byrefreshing the system after every 10 minutes. We reportthe average reward that is normalized with respect to the

number of selected actions in Fig. 5 and Table 5.3.

(a) S = 3 (b) S = 5

Figure 5: Average rewards of each competing algo-rithms on Yahoo! Webscope datasets.

In Fig. 5, we drop the first 40000 user visits, since withthe 10 minutes update delay, the system has no knowledgeabout the CTRs at the very beginning, leading to poor per-formance of every algorithm.

Thompson sampling yields the best result. Exploit andε-greedy have similar performance. It is consistent with theprevious findings that the change in context induces someexploration [7]. Thompson sampling with α = 0.5 achievesslightly better performance than the one without posteriorreshaping.

When comparing S = 3 and S = 5, we find that there issome slowdown in Fig. 5(b) or even a decrease in rewardin Fig. 5(a). One explanation is that with only a smallnumber of system refresh, the estimation on the parametersw is not stable. Yet since S = 5 has more training data ineach update than S = 3, it yields a smoother reward curve.Due to the same reason, S = 5 achieves a higher averageCTR at the end of the experiments.

Method TS (0.5) TS(1) ExploitReward (S = 3) 0.0862 0.0861 0.0846Reward (S = 5) 0.0120 0.0118 0.0117

Method Random ε-greedy (0.02) ε-greedy (0.01)Reward (S = 3) 0.0538 0.0849 0.0845Reward (S = 5) 0.0072 0.0117 0.0117

Table 1: Average rewards on Yahoo! Webscopedatasets.

We emphasize here that the main goal of the experimentsis not to claim the superiority of Thompson sampling, butto demonstrate the ability of our approach to solve the realworld problems that can be modeled as our contextual com-binatorial bandit problems.

6. CONCLUSIONIn this paper, we extend the traditional multi-armed ban-

dit problems to a more general contextual combinatorial set-ting. This is motivated by many web applications where in-stead of choosing only one action at a time, one can choose Sactions and place them on S out of M positions. Naive enu-meration on possible combinations is computationally unaf-fordable. By the adaptation of network flow formulations, apractical algorithm based on Thompson sampling is derivedfor the (contextual) combinatorial problem. The network

flow formulations can also be incorporated in ε-greedy andpure exploitation algorithms to provide an exact single-stepsolution. Due to the unique properties of Thompson sam-pling, the system update is encapsulated in the chosen prob-abilistic models. This provides easy incorporation of anyprobabilistic (user click) models in our proposed framework.We demonstrate the algorithms’ performance on syntheticGaussian process problems and on news article recommen-dation dataset from Yahoo! Front Page Today Module toillustrate the ability of our approach to solve the real worldproblems that can be modeled as our contextual combina-torial bandit problems and that have not been throughlystudied before.

7. REFERENCES[1] Yahoo! Webscope dataset

ydata-frontpage-todaymodule-clicks-v1-0.http://labs.yahoo.com/Academic Relations.

[2] D. Agarwal, B.-C. Chen, P. Elango, N. Motgi, S.-T.Park, R. Ramakrishnan, S. Roy, and J. Zachariah.Online models for content optimization. In Advancesin Neural Information Processing Systems, pages17–24, 2009.

[3] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin.Network flows. 1993.

[4] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-timeanalysis of the multiarmed bandit problem. Machinelearning, 47(2-3):235–256, 2002.

[5] H. Becker, C. Meek, and D. M. Chickering. Modelingcontextual factors of click rates. In AAAI, volume 7,pages 1310–1315, 2007.

[6] C. M. Bishop et al. Pattern recognition and machinelearning, volume 4. springer New York, 2006.

[7] O. Chapelle and L. Li. An empirical evaluation ofthompson sampling. In Advances in neural informationprocessing systems, pages 2249–2257, 2011.

[8] O. Chapelle, E. Manavoglu, and R. Rosales. Simpleand scalable response prediction for displayadvertising. ACM Transactions on Intelligent Systemsand Technology (TIST), 5(4):61, 2014.

[9] W. Chu, L. Li, L. Reyzin, and R. E. Schapire.Contextual bandits with linear payoff functions. InInternational Conference on Artificial Intelligence andStatistics, pages 208–214, 2011.

[10] W. Chu, S.-T. Park, T. Beaupre, N. Motgi,A. Phadke, S. Chakraborty, and J. Zachariah. A casestudy of behavior-driven conjoint analysis on yahoo!:Front page today module. In Proceedings of the 15thACM SIGKDD international conference on Knowledgediscovery and data mining, pages 1097–1104. ACM,2009.

[11] N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. Anexperimental comparison of click position-bias models.In Proceedings of the 2008 International Conferenceon Web Search and Data Mining, pages 87–94. ACM,2008.

[12] T. Desautels, A. Krause, and J. W. Burdick.Parallelizing exploration-exploitation tradeoffs ingaussian process bandit optimization. The Journal ofMachine Learning Research, 15(1):3873–3923, 2014.

[13] A. Gopalan, S. Mannor, and Y. Mansour. Thompsonsampling for complex online problems. In Proceedings

of The 31st International Conference on MachineLearning, pages 100–108, 2014.

[14] L. Jie, S. Lamkhede, R. Sapra, E. Hsu, H. Song, andY. Chang. A unified search federation system based ononline user feedback. In Proceedings of the 19th ACMSIGKDD international conference on Knowledgediscovery and data mining, pages 1195–1203. ACM,2013.

[15] S. Kale, L. Reyzin, and R. E. Schapire. Non-stochasticbandit slate problems. In Advances in NeuralInformation Processing Systems, pages 1054–1062,2010.

[16] T. L. Lai and H. Robbins. Asymptotically efficientadaptive allocation rules. Advances in appliedmathematics, 6(1):4–22, 1985.

[17] J. Langford and T. Zhang. The epoch-greedyalgorithm for multi-armed bandits with sideinformation. In Advances in neural informationprocessing systems, pages 817–824, 2008.

[18] L. Li, W. Chu, J. Langford, and X. Wang. Unbiasedoffline evaluation of contextual-bandit-based newsarticle recommendation algorithms. In Proceedings ofthe fourth ACM international conference on Websearch and data mining, pages 297–306. ACM, 2011.

[19] Y. Nesterov, A. Nemirovskii, and Y. Ye. Interior-pointpolynomial algorithms in convex programming,volume 13. SIAM, 1994.

[20] F. Radlinski, R. Kleinberg, and T. Joachims. Learningdiverse rankings with multi-armed bandits. InProceedings of the 25th international conference onMachine learning, pages 784–791. ACM, 2008.

[21] M. Richardson, E. Dominowska, and R. Ragno.Predicting clicks: estimating the click-through rate fornew ads. In Proceedings of the 16th internationalconference on World Wide Web, pages 521–530. ACM,2007.

[22] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger.Gaussian process optimization in the bandit setting:No regret and experimental design. arXiv preprintarXiv:0912.3995, 2009.

[23] W. R. Thompson. On the likelihood that one unknownprobability exceeds another in view of the evidence oftwo samples. Biometrika, pages 285–294, 1933.

[24] V. V. Vazirani. Approximation algorithms. SpringerScience & Business Media, 2013.

[25] Y. Wang, K. G. Reyes, K. A. Brown, C. A. Mirkin,and W. B. Powell. Nested-batch-mode learning andstochastic optimization with an application tosequential multistage testing in materials science.SIAM Journal on Scientific Computing,37(3):B361–B381, 2015.

[26] Y. Wang, C. Wang, and W. Powell. The knowledgegradient with logistic belief models for binaryclassification. arXiv preprint arXiv:1510.02354, 2015.

Documents

Thompson Sampling for Contextual Combinatorial Banditsasamov.com/download/ts_combinatorial.pdftions are used to predict CTR by logistic regression. Due to the Thompson sampling’s