The Knowledge Gradient for Sequential Decision Making with ... · The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks Yingfei Wang, Chu Wang and

The Knowledge Gradient for Sequential DecisionMaking with Stochastic Binary Feedbacks

Yingfei Wang, Chu Wang and Warren B. Powell

Princeton University

Yingfei Wang Optimal Learning Methods June 22, 2016 1 / 21

Motivation

Sequential Decision Problems

• M discrete alternatives

• Unknow truth µx

• Each time n, the learner chooses an alternative xn, receives reward W nxn .

•offline objective maxEπ[µxN ]

online objective maxEπ∑N−1

n=0 [µxn ]

Overview

• Numerous Communities• Multi-armed bandits• Ranking and selection• Stochastic search• Control theory• ......

• Various Applications• Recommendations: ads, news• Packet routing• Revenue management• Laboratory experiments guidance:• ......


Applications with binary outputs


• Revenue management: whether or not a customer books a room.

• Health analytics: success (patient does not need to return for moretreatment) or failure (patient does need followup care).

• Production of single or double-walled nanotubes:controllable parameters: catalyst, laser power, Hydrogen, pressure,temperature, Ar/CO2, ethylene etc.




• Revenue management: whether or not a customer books a room.

• Health analytics: success (patient does not need to return for moretreatment) or failure (patient does need followup care).

• Production of single or double-walled nanotubes:controllable parameters: catalyst, laser power, Hydrogen, pressure,temperature, Ar/CO2, ethylene etc.

Single Wall Double Wall


Outline

1 Sequential Decision Problems with Binary Outputs

2 The Knowledge Gradient Policy

3 Experimental Results


Sequential Decision Problems with Binary Outputs

Layout

1 Sequential Decision Problems with Binary OutputsModelBayesian linear classification and Laplace approximation




Sequential Decision Problems with Binary Outputs Model

Model

• A finite set of alternatives x ∈ X = {x1, . . . , xM}.• Binary outcome y ∈ {−1,+1} with unknown probability p(y = +1|x).

• Goal: given a limited budget N, choose the measurement policy(x0, . . . , xN−1) and the implementation decision that maximizesp(y = +1|x).

• Generalized linear model for modeling probability

p(y = +1|x ,w) = σ(wTx),

where σ(a) = 11+exp(−a) or σ(a) = Φ(a) =

∫ a

−∞N (s|0, 12)ds.


Sequential Decision Problems with Binary Outputs Bayesian linear classification and Laplace approximation

Logistic and probit regression

• Training set D = {(xi , yi )}ni=1

• Likelihood p(D|w) =∏n

i=1 σ(yi ·wTxi ).• w = arg minw

∑ni=1− log(σ(yi ·wTxi )).

Bayesian logistic and probit regression

• p(w |D) = 1Z p(D|w)p(w) ∝ p(w)

∏ni=1 σ(yi ·wTxi ).

• Extend to leverage for sequential model updates:

p(w |D0)x0,y1

−−−→ p(w |D1)x1,y2

−−−→ p(w |D2) · · ·• Exact Bayesian inference for linear classifier is intractable.

• Monte Carlo sampling or analytic approximations to the posterior:Laplace approximation.



Laplace approximation

• Ψ(w) = log p(D|w) + log p(w).

• Second-order Taylor expansion to Ψ around its MAP (maximum aposteriori) solution w = arg maxw Ψ(w):

Ψ(w) ≈ Ψ(w)− 1

2(w − w)TH(w − w), H = −∇2Ψ(w)|w=w .

• Laplace approximation to the posterior p(w |D) ≈ N (w |w ,H−1).



Online Bayesian linear classification based on Laplace approximation

• Extend to leverage for sequential model updates:Laplace approximated posterior serves as prior for the next available data.

• p(wj) = N(wj |m0

j , (q0j )−1

)• (mn

j , qnj ){xn,yn+1}−−−−−−→ (mn+1

j , qn+1j )

• t :=∂2 log σ(yiwT

i x)∂f 2 |f =wT x

mn+1 = arg maxw−1

2

d∑i=1

qni (wi −mni )2 + log(σ(ywTx))

qn+1j = qnj − tx2

j



Online Bayesian linear classification based on Laplace approximation

arg maxw−1

2

d∑i=1

qi (wi −mi )2 + log(σ(ywTx)).

• 1-dimensional bisection method:Set ∂Ψ/∂wi = 0. Define p as p := σ′(ywT x)

σ(ywT x). Then we have wi = mi + yp xi

qi.

p =σ′(p

∑di=1 x

2i /qi + ymT x)

σ(p∑d

i=1 x2i /qi + ymT x)

.

The equation has a unique solution in interval [0, σ′(ymT x)/σ(ymT x)].


The Knowledge Gradient Policy

Layout


2 The Knowledge Gradient PolicyKnowledge Gradient Policy for Lookup Table ModelKnowledge Gradient Policy for Linear Bayesian Classification



The Knowledge Gradient Policy

Model

Characteristics of our problems

• Expensive experiments.

• Small samples.

• Requiring that we learn from our decisions as quickly as possible.


The Knowledge Gradient Policy Knowledge Gradient Policy for Lookup Table Model

Model

Knowledge gradient policy for lookup table model [3]

• M discrete alternatives, unknow truth µx , Wx = µx + ε

• µx |Fn ∼ N (θnx , σnx )

• Knowledge state Sn = (θn,σn), V (s) = maxx θx

νKGx (Sn) = E[V(Sn+1(x)

)− V (Sn)] = E[max

x′θn+1x′ (x)−max

x′θnx′ |Sn].

• The Knowledge Gradient (KG) policy XKG (Sn) = arg maxx νKGx (Sn).

Change in estimate of value of alternative 5

due to measurement.

Change which produces a change in the decision.


The Knowledge Gradient Policy Knowledge Gradient Policy for Linear Bayesian Classification

Model

Knowledge gradient policy for linear Bayesian classification belief model

• yx |w ∼ Bernoulli(σ(wTx))

• wj |Fn ∼ N (mnj , (q

nj )−1)

• Knowledge state Sn = (mn,qn)

• V(s)

= maxx p(yx = +1|x , s)

νKGx (Sn) = E[V(Sn+1(x , y)

)− V (Sn)|Sn]

= E[max

x′p(yx′ = +1|x ′, Sn+1(x , y)

)−max

x′p(yx′ = +1|x ′, Sn)|Sn]

• The Knowledge Gradient (KG) policy XKG (Sn) = argmaxx νKGx (Sn).

• The knowledge gradient policy can work with any choice of link function σ(·) andapproximation procedures by adjusting the transition function Sn+1(x , ·)accordingly.

• Online learning [7]: XOLKG (Sn) = argmaxx p(y = +1|x ,Sn) + (N − n)νKGx (Sn).


Experimental Results

Layout



3 Experimental ResultsBehavior of the KG policyComparison with other Policies


Experimental Results Behavior of the KG policy

Sampling behavior of the KG policy

−2 0 2

−3

−2

−1

0

1

2

3

x1

x2

−2 0 2

−3

−2

−1

0

1

2

3

x1

x2

−2 0 2

−3

−2

−1

0

1

2

3

x1

x2

−2 0 2

−3

−2

−1

0

1

2

3

x1

x2

−2 0 2

−3

−2

−1

0

1

2

3

x1

x2

−2 0 2

−3

−2

−1

0

1

2

3

x1

x2


Experimental Results Behavior of the KG policy

Absolute class distribution error

−2 0 2

−3

−2

−1

0

1

2

3

x1

x 2

0.01

0.02

0.03

0.04

0.05

0.06

−20

2

−2

0

2

−2

−1

0

1

2

x1

x2

x3

0.05

0.1

0.15

0.2

Figure: Absolute distribution error.


Experimental Results Comparison with other Policies

Competing policies

• random sampling (Random)

• a myopic method that selects the most uncertain instance each step(MostUncertain)

• discriminative batch-mode active learning (Disc) [4] with batch size set to 1

• expected improvement (EI) [8] with an initial fit of 5 examples

• Thompson sampling (TS) [2]

• UCB on the latent function wTx (UCB) [6]

Metric

Opportunity Cost (OC)

OC := maxx∈X

p(y = +1|x ,w∗)− p(y = +1|xN+1,w∗).



Comparison with other Policies

#Iterations5 10 15 20 25 30

Opportunity

Cost

0.05

0.1

0.15

0.2

0.25

0.3 EIUCBTSRandomMostUncertainDiscKG

(a) sonar#Iterations

5 10 15 20 25 30

Opportunity

Cost

0.1

0.2

0.3

0.4

0.5

0.6

0.7EIUCBTSRandomMostUncertainDiscKG

(b) glass identification#Iterations

5 10 15 20 25 30

Opportunity

Cost

0

0.1

0.2

0.3

0.4

0.5 EIUCBTSRandomMostUncertainDiscKG

(c) blood transfusion

#Iterations5 10 15 20 25 30

Opportunity

Cost

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24


(d) survival#Iterations

5 10 15 20 25 30

Opportunity

Cost

0.05

0.1

0.15

0.2

0.25


(e) breast cancer (wpbc)#Iterations

5 10 15 20 25 30

Opportunity

Cost

0.11

0.12

0.13

0.14

0.15

0.16


(f) planning relax

Figure: Opportunity cost on UCI.



Thank you! Questions?


Reference

Xavier Boyen and Daphne Koller.

Tractable inference for complex stochastic processes.

In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence,pages 33–42. Morgan Kaufmann Publishers Inc., 1998.

Olivier Chapelle and Lihong Li.

An empirical evaluation of thompson sampling.

In Advances in neural information processing systems, pages 2249–2257, 2011.

Peter I Frazier, Warren B Powell, and Savas Dayanik.

A knowledge-gradient policy for sequential information collection.

SIAM Journal on Control and Optimization, 47(5):2410–2439, 2008.

Yuhong Guo and Dale Schuurmans.

Discriminative batch mode active learning.

In Advances in neural information processing systems, pages 593–600, 2008.

Steffen L Lauritzen.

Propagation of probabilities, means, and variances in mixed graphical association models.

Journal of the American Statistical Association, 87(420):1098–1108, 1992.

Lihong Li, Wei Chu, John Langford, and Robert E Schapire.

A contextual-bandit approach to personalized news article recommendation.

In Proceedings of the 19th international conference on World wide web, pages 661–670.ACM, 2010.


Reference

Ilya O Ryzhov, Warren B Powell, and Peter I Frazier.

The knowledge gradient algorithm for a general class of online learning problems.

Operations Research, 60(1):180–195, 2012.

Matthew Tesch, Jeff Schneider, and Howie Choset.

Expensive function optimization with stochastic binary outcomes.

In Proceedings of The 30th International Conference on Machine Learning, pages1283–1291, 2013.


Documents

The Knowledge Gradient for Sequential Decision Making with ... · The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks Yingfei Wang, Chu Wang and