Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
The Knowledge Gradient for Sequential DecisionMaking with Stochastic Binary Feedbacks
Yingfei Wang, Chu Wang and Warren B. Powell
Princeton University
Yingfei Wang Optimal Learning Methods June 22, 2016 1 / 21
Motivation
Sequential Decision Problems
• M discrete alternatives
• Unknow truth µx
• Each time n, the learner chooses an alternative xn, receives reward W nxn .
•offline objective maxEπ[µxN ]
online objective maxEπ∑N−1
n=0 [µxn ]
Overview
• Numerous Communities• Multi-armed bandits• Ranking and selection• Stochastic search• Control theory• ......
• Various Applications• Recommendations: ads, news• Packet routing• Revenue management• Laboratory experiments guidance:• ......
Yingfei Wang Optimal Learning Methods June 22, 2016 2 / 21
Applications with binary outputs
Applications with binary outputs
• Revenue management: whether or not a customer books a room.
• Health analytics: success (patient does not need to return for moretreatment) or failure (patient does need followup care).
• Production of single or double-walled nanotubes:controllable parameters: catalyst, laser power, Hydrogen, pressure,temperature, Ar/CO2, ethylene etc.
Yingfei Wang Optimal Learning Methods June 22, 2016 3 / 21
Applications with binary outputs
Applications with binary outputs
• Revenue management: whether or not a customer books a room.
• Health analytics: success (patient does not need to return for moretreatment) or failure (patient does need followup care).
• Production of single or double-walled nanotubes:controllable parameters: catalyst, laser power, Hydrogen, pressure,temperature, Ar/CO2, ethylene etc.
Single Wall Double Wall
Yingfei Wang Optimal Learning Methods June 22, 2016 4 / 21
Outline
1 Sequential Decision Problems with Binary Outputs
2 The Knowledge Gradient Policy
3 Experimental Results
Yingfei Wang Optimal Learning Methods June 22, 2016 5 / 21
Sequential Decision Problems with Binary Outputs
Layout
1 Sequential Decision Problems with Binary OutputsModelBayesian linear classification and Laplace approximation
2 The Knowledge Gradient Policy
3 Experimental Results
Yingfei Wang Optimal Learning Methods June 22, 2016 6 / 21
Sequential Decision Problems with Binary Outputs Model
Model
• A finite set of alternatives x ∈ X = {x1, . . . , xM}.• Binary outcome y ∈ {−1,+1} with unknown probability p(y = +1|x).
• Goal: given a limited budget N, choose the measurement policy(x0, . . . , xN−1) and the implementation decision that maximizesp(y = +1|x).
• Generalized linear model for modeling probability
p(y = +1|x ,w) = σ(wTx),
where σ(a) = 11+exp(−a) or σ(a) = Φ(a) =
∫ a
−∞N (s|0, 12)ds.
Yingfei Wang Optimal Learning Methods June 22, 2016 7 / 21
Sequential Decision Problems with Binary Outputs Bayesian linear classification and Laplace approximation
Logistic and probit regression
• Training set D = {(xi , yi )}ni=1
• Likelihood p(D|w) =∏n
i=1 σ(yi ·wTxi ).• w = arg minw
∑ni=1− log(σ(yi ·wTxi )).
Bayesian logistic and probit regression
• p(w |D) = 1Z p(D|w)p(w) ∝ p(w)
∏ni=1 σ(yi ·wTxi ).
• Extend to leverage for sequential model updates:
p(w |D0)x0,y1
−−−→ p(w |D1)x1,y2
−−−→ p(w |D2) · · ·• Exact Bayesian inference for linear classifier is intractable.
• Monte Carlo sampling or analytic approximations to the posterior:Laplace approximation.
Yingfei Wang Optimal Learning Methods June 22, 2016 8 / 21
Sequential Decision Problems with Binary Outputs Bayesian linear classification and Laplace approximation
Laplace approximation
• Ψ(w) = log p(D|w) + log p(w).
• Second-order Taylor expansion to Ψ around its MAP (maximum aposteriori) solution w = arg maxw Ψ(w):
Ψ(w) ≈ Ψ(w)− 1
2(w − w)TH(w − w), H = −∇2Ψ(w)|w=w .
• Laplace approximation to the posterior p(w |D) ≈ N (w |w ,H−1).
Yingfei Wang Optimal Learning Methods June 22, 2016 9 / 21
Sequential Decision Problems with Binary Outputs Bayesian linear classification and Laplace approximation
Online Bayesian linear classification based on Laplace approximation
• Extend to leverage for sequential model updates:Laplace approximated posterior serves as prior for the next available data.
• p(wj) = N(wj |m0
j , (q0j )−1
)• (mn
j , qnj ){xn,yn+1}−−−−−−→ (mn+1
j , qn+1j )
• t :=∂2 log σ(yiwT
i x)∂f 2 |f =wT x
mn+1 = arg maxw−1
2
d∑i=1
qni (wi −mni )2 + log(σ(ywTx))
qn+1j = qnj − tx2
j
Yingfei Wang Optimal Learning Methods June 22, 2016 10 / 21
Sequential Decision Problems with Binary Outputs Bayesian linear classification and Laplace approximation
Online Bayesian linear classification based on Laplace approximation
arg maxw−1
2
d∑i=1
qi (wi −mi )2 + log(σ(ywTx)).
• 1-dimensional bisection method:Set ∂Ψ/∂wi = 0. Define p as p := σ′(ywT x)
σ(ywT x). Then we have wi = mi + yp xi
qi.
p =σ′(p
∑di=1 x
2i /qi + ymT x)
σ(p∑d
i=1 x2i /qi + ymT x)
.
The equation has a unique solution in interval [0, σ′(ymT x)/σ(ymT x)].
Yingfei Wang Optimal Learning Methods June 22, 2016 11 / 21
The Knowledge Gradient Policy
Layout
1 Sequential Decision Problems with Binary Outputs
2 The Knowledge Gradient PolicyKnowledge Gradient Policy for Lookup Table ModelKnowledge Gradient Policy for Linear Bayesian Classification
3 Experimental Results
Yingfei Wang Optimal Learning Methods June 22, 2016 12 / 21
The Knowledge Gradient Policy
Model
Characteristics of our problems
• Expensive experiments.
• Small samples.
• Requiring that we learn from our decisions as quickly as possible.
Yingfei Wang Optimal Learning Methods June 22, 2016 13 / 21
The Knowledge Gradient Policy Knowledge Gradient Policy for Lookup Table Model
Model
Knowledge gradient policy for lookup table model [3]
• M discrete alternatives, unknow truth µx , Wx = µx + ε
• µx |Fn ∼ N (θnx , σnx )
• Knowledge state Sn = (θn,σn), V (s) = maxx θx
νKGx (Sn) = E[V(Sn+1(x)
)− V (Sn)] = E[max
x′θn+1x′ (x)−max
x′θnx′ |Sn].
• The Knowledge Gradient (KG) policy XKG (Sn) = arg maxx νKGx (Sn).
Change in estimate of value of alternative 5
due to measurement.
Change which produces a change in the decision.
Yingfei Wang Optimal Learning Methods June 22, 2016 14 / 21
The Knowledge Gradient Policy Knowledge Gradient Policy for Linear Bayesian Classification
Model
Knowledge gradient policy for linear Bayesian classification belief model
• yx |w ∼ Bernoulli(σ(wTx))
• wj |Fn ∼ N (mnj , (q
nj )−1)
• Knowledge state Sn = (mn,qn)
• V(s)
= maxx p(yx = +1|x , s)
νKGx (Sn) = E[V(Sn+1(x , y)
)− V (Sn)|Sn]
= E[max
x′p(yx′ = +1|x ′, Sn+1(x , y)
)−max
x′p(yx′ = +1|x ′, Sn)|Sn]
• The Knowledge Gradient (KG) policy XKG (Sn) = argmaxx νKGx (Sn).
• The knowledge gradient policy can work with any choice of link function σ(·) andapproximation procedures by adjusting the transition function Sn+1(x , ·)accordingly.
• Online learning [7]: XOLKG (Sn) = argmaxx p(y = +1|x ,Sn) + (N − n)νKGx (Sn).
Yingfei Wang Optimal Learning Methods June 22, 2016 15 / 21
Experimental Results
Layout
1 Sequential Decision Problems with Binary Outputs
2 The Knowledge Gradient Policy
3 Experimental ResultsBehavior of the KG policyComparison with other Policies
Yingfei Wang Optimal Learning Methods June 22, 2016 16 / 21
Experimental Results Behavior of the KG policy
Sampling behavior of the KG policy
−2 0 2
−3
−2
−1
0
1
2
3
x1
x2
−2 0 2
−3
−2
−1
0
1
2
3
x1
x2
−2 0 2
−3
−2
−1
0
1
2
3
x1
x2
−2 0 2
−3
−2
−1
0
1
2
3
x1
x2
−2 0 2
−3
−2
−1
0
1
2
3
x1
x2
−2 0 2
−3
−2
−1
0
1
2
3
x1
x2
Yingfei Wang Optimal Learning Methods June 22, 2016 17 / 21
Experimental Results Behavior of the KG policy
Absolute class distribution error
−2 0 2
−3
−2
−1
0
1
2
3
x1
x 2
0.01
0.02
0.03
0.04
0.05
0.06
−20
2
−2
0
2
−2
−1
0
1
2
x1
x2
x3
0.05
0.1
0.15
0.2
Figure: Absolute distribution error.
Yingfei Wang Optimal Learning Methods June 22, 2016 18 / 21
Experimental Results Comparison with other Policies
Competing policies
• random sampling (Random)
• a myopic method that selects the most uncertain instance each step(MostUncertain)
• discriminative batch-mode active learning (Disc) [4] with batch size set to 1
• expected improvement (EI) [8] with an initial fit of 5 examples
• Thompson sampling (TS) [2]
• UCB on the latent function wTx (UCB) [6]
Metric
Opportunity Cost (OC)
OC := maxx∈X
p(y = +1|x ,w∗)− p(y = +1|xN+1,w∗).
Yingfei Wang Optimal Learning Methods June 22, 2016 19 / 21
Experimental Results Comparison with other Policies
Comparison with other Policies
#Iterations5 10 15 20 25 30
Opportunity
Cost
0.05
0.1
0.15
0.2
0.25
0.3 EIUCBTSRandomMostUncertainDiscKG
(a) sonar#Iterations
5 10 15 20 25 30
Opportunity
Cost
0.1
0.2
0.3
0.4
0.5
0.6
0.7EIUCBTSRandomMostUncertainDiscKG
(b) glass identification#Iterations
5 10 15 20 25 30
Opportunity
Cost
0
0.1
0.2
0.3
0.4
0.5 EIUCBTSRandomMostUncertainDiscKG
(c) blood transfusion
#Iterations5 10 15 20 25 30
Opportunity
Cost
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26EIUCBTSRandomMostUncertainDiscKG
(d) survival#Iterations
5 10 15 20 25 30
Opportunity
Cost
0.05
0.1
0.15
0.2
0.25
0.3EIUCBTSRandomMostUncertainDiscKG
(e) breast cancer (wpbc)#Iterations
5 10 15 20 25 30
Opportunity
Cost
0.11
0.12
0.13
0.14
0.15
0.16
0.17EIUCBTSRandomMostUncertainDiscKG
(f) planning relax
Figure: Opportunity cost on UCI.
Yingfei Wang Optimal Learning Methods June 22, 2016 20 / 21
Experimental Results Comparison with other Policies
Thank you! Questions?
Yingfei Wang Optimal Learning Methods June 22, 2016 21 / 21
Reference
Xavier Boyen and Daphne Koller.
Tractable inference for complex stochastic processes.
In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence,pages 33–42. Morgan Kaufmann Publishers Inc., 1998.
Olivier Chapelle and Lihong Li.
An empirical evaluation of thompson sampling.
In Advances in neural information processing systems, pages 2249–2257, 2011.
Peter I Frazier, Warren B Powell, and Savas Dayanik.
A knowledge-gradient policy for sequential information collection.
SIAM Journal on Control and Optimization, 47(5):2410–2439, 2008.
Yuhong Guo and Dale Schuurmans.
Discriminative batch mode active learning.
In Advances in neural information processing systems, pages 593–600, 2008.
Steffen L Lauritzen.
Propagation of probabilities, means, and variances in mixed graphical association models.
Journal of the American Statistical Association, 87(420):1098–1108, 1992.
Lihong Li, Wei Chu, John Langford, and Robert E Schapire.
A contextual-bandit approach to personalized news article recommendation.
In Proceedings of the 19th international conference on World wide web, pages 661–670.ACM, 2010.
Yingfei Wang Optimal Learning Methods June 22, 2016 21 / 21
Reference
Ilya O Ryzhov, Warren B Powell, and Peter I Frazier.
The knowledge gradient algorithm for a general class of online learning problems.
Operations Research, 60(1):180–195, 2012.
Matthew Tesch, Jeff Schneider, and Howie Choset.
Expensive function optimization with stochastic binary outcomes.
In Proceedings of The 30th International Conference on Machine Learning, pages1283–1291, 2013.
Yingfei Wang Optimal Learning Methods June 22, 2016 21 / 21