Upload
jun-wang
View
98
Download
5
Tags:
Embed Size (px)
Citation preview
Vowpal WabbitA Machine Learning System
John LangfordMicrosoft Research
http://hunch.net/~vw/
git clonegit://github.com/JohnLangford/vowpal_wabbit.git
Why does Vowpal Wabbit exist?
1. Prove research.
2. Curiosity.
3. Perfectionist.
4. Solve problem better.
Why does Vowpal Wabbit exist?
1. Prove research.
2. Curiosity.
3. Perfectionist.
4. Solve problem better.
Why does Vowpal Wabbit exist?
1. Prove research.
2. Curiosity.
3. Perfectionist.
4. Solve problem better.
A user base becomes addictive
1. Mailing list of >400
2. The o�cial strawman for large scale logisticregression @ NIPS :-)
3.
A user base becomes addictive
1. Mailing list of >400
2. The o�cial strawman for large scale logisticregression @ NIPS :-)
3.
A user base becomes addictive
1. Mailing list of >400
2. The o�cial strawman for large scale logisticregression @ NIPS :-)
3.
An example
wget http://hunch.net/~jl/VW_raw.tar.gz
vw -c rcv1.train.raw.txt -b 22 --ngram 2
--skips 4 -l 0.25 --binary provides stellarperformance in 12 seconds.
Surface details
1. BSD license, automated test suite, githubrepository.
2. VW supports all I/O modes: executable, library,port, daemon, service (see next).
3. VW has a reasonable++ input format: sparse,dense, namespaces, etc...
4. Mostly C++, but bindings in other languages ofvarying maturity (python, C#, Java good).
5. A substantial user base + developer base.Thanks to many who have helped.
Surface details
1. BSD license, automated test suite, githubrepository.
2. VW supports all I/O modes: executable, library,port, daemon, service (see next).
3. VW has a reasonable++ input format: sparse,dense, namespaces, etc...
4. Mostly C++, but bindings in other languages ofvarying maturity (python, C#, Java good).
5. A substantial user base + developer base.Thanks to many who have helped.
Surface details
1. BSD license, automated test suite, githubrepository.
2. VW supports all I/O modes: executable, library,port, daemon, service (see next).
3. VW has a reasonable++ input format: sparse,dense, namespaces, etc...
4. Mostly C++, but bindings in other languages ofvarying maturity (python, C#, Java good).
5. A substantial user base + developer base.Thanks to many who have helped.
Surface details
1. BSD license, automated test suite, githubrepository.
2. VW supports all I/O modes: executable, library,port, daemon, service (see next).
3. VW has a reasonable++ input format: sparse,dense, namespaces, etc...
4. Mostly C++, but bindings in other languages ofvarying maturity (python, C#, Java good).
5. A substantial user base + developer base.Thanks to many who have helped.
Surface details
1. BSD license, automated test suite, githubrepository.
2. VW supports all I/O modes: executable, library,port, daemon, service (see next).
3. VW has a reasonable++ input format: sparse,dense, namespaces, etc...
4. Mostly C++, but bindings in other languages ofvarying maturity (python, C#, Java good).
5. A substantial user base + developer base.Thanks to many who have helped.
VW servicehttp://tinyurl.com/vw-azureml
Problem: How to deploy model for large scale use?
Solution:
VW servicehttp://tinyurl.com/vw-azureml
Problem: How to deploy model for large scale use?
Solution:
This Tutorial in 4 parts
How do you:
1. use all the data?
2. solve the right problem?
3. solve complex joint problems?
4. solve interactive problems?
This Tutorial in 4 parts
How do you:
1. use all the data?
2. solve the right problem?
3. solve complex joint problems?
4. solve interactive problems?
This Tutorial in 4 parts
How do you:
1. use all the data?
2. solve the right problem?
3. solve complex joint problems?
4. solve interactive problems?
This Tutorial in 4 parts
How do you:
1. use all the data?
2. solve the right problem?
3. solve complex joint problems?
4. solve interactive problems?
Using all the data: Step 1
Small RAM + large data ⇒ Online Learning
Active research area, 4-5 papers related to onlinelearning algorithms in VW.
Using all the data: Step 1
Small RAM + large data ⇒ Online Learning
Active research area, 4-5 papers related to onlinelearning algorithms in VW.
Using all the data: Step 2
1. 3.2 ∗ 106 labeled emails.
2. 433167 users.
3. ∼ 40 ∗ 106 unique tokens.How do we construct a spam �lter which ispersonalized, yet uses global information?
Bad answer: Construct a global �lter + 433167personalized �lters using a conventional hashmap tospecify features. This might require433167 ∗ 40 ∗ 106 ∗ 4 ∼ 70Terabytes of RAM.
Using all the data: Step 2
1. 3.2 ∗ 106 labeled emails.
2. 433167 users.
3. ∼ 40 ∗ 106 unique tokens.How do we construct a spam �lter which ispersonalized, yet uses global information?
Bad answer: Construct a global �lter + 433167personalized �lters using a conventional hashmap tospecify features. This might require433167 ∗ 40 ∗ 106 ∗ 4 ∼ 70Terabytes of RAM.
Using Hashing
Use hashing to predict according to:〈w , φ(x)〉+ 〈w , φu(x)〉
NEUVotre
Apothekeen
ligneEuro
...
USER123_NEUUSER123_Votre
USER123_ApothekeUSER123_en
USER123_ligneUSER123_Euro
...
+323
05235
00
1230
626232...
text document (email) tokenized, duplicatedbag of words
hashed, sparse vector
h2x
classification
w!xh
x xl xh
(in VW: specify the userid as a feature and use -q)
Results
!"#$%!"#&% !"##% !"##% !%
!"!'%
#"$'%
#"(#%
#")$% #")(%
#"##%
#"'#%
#"*#%
#")#%
#"$#%
!"##%
!"'#%
!$% '#% ''% '*% ')%
!"#$%$
&!!'(#)*%+(*,#-.*%)/%0#!*,&1*2%
0%0&)!%&1%3#!3')#0,*%
+,-./,01/2134%
5362-7/,8934%
./23,873%
226 parameters = 64M parameters = 256MB ofRAM.An x270K savings in RAM requirements.
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to createfantastic learning machines!
I: You fool! The only thing parallel machines aregood for is computational windtunnels!The worst part: he had a point.
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to createfantastic learning machines!I: You fool! The only thing parallel machines aregood for is computational windtunnels!
The worst part: he had a point.
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to createfantastic learning machines!I: You fool! The only thing parallel machines aregood for is computational windtunnels!The worst part: he had a point.
Using all the data: Step 3
Given 2.1 Terafeatures of data, how can you learn agood linear predictor fw(x) =
∑i wixi?
17B Examples16M parameters1K nodesHow long does it take?
70 minutes = 500M features/second: faster than theIO bandwidth of a single machine⇒ faster than allpossible single machine linear learning algorithms.
Using all the data: Step 3
Given 2.1 Terafeatures of data, how can you learn agood linear predictor fw(x) =
∑i wixi?
17B Examples16M parameters1K nodesHow long does it take?
70 minutes = 500M features/second: faster than theIO bandwidth of a single machine⇒ faster than allpossible single machine linear learning algorithms.
Using all the data: Step 3
Given 2.1 Terafeatures of data, how can you learn agood linear predictor fw(x) =
∑i wixi?
17B Examples16M parameters1K nodesHow long does it take?
70 minutes = 500M features/second: faster than theIO bandwidth of a single machine⇒ faster than allpossible single machine linear learning algorithms.
MPI-style AllReduce
7
2 3 4
6
Allreduce initial state
5
1
AllReduce = Reduce+BroadcastProperties:1. How long does it take?
O(1) time(*)
2. How much bandwidth?
O(1) bits(*)
3. How hard to program?
Very easy
(*) When done right.
MPI-style AllReduce
2828 28
Allreduce final state
28 28 28 28
AllReduce = Reduce+Broadcast
Properties:1. How long does it take?
O(1) time(*)
2. How much bandwidth?
O(1) bits(*)
3. How hard to program?
Very easy(*) When done right.
MPI-style AllReduce
2 3 4
6
7
5
1
Create Binary Tree
AllReduce = Reduce+BroadcastProperties:
1. How long does it take?
O(1) time(*)
2. How much bandwidth?
O(1) bits(*)
3. How hard to program?
Very easy
(*) When done right.
MPI-style AllReduce
2 3 4
7
8
1
13
Reducing, step 1
AllReduce = Reduce+BroadcastProperties:1. How long does it take?
O(1) time(*)
2. How much bandwidth?
O(1) bits(*)
3. How hard to program?
Very easy
(*) When done right.
MPI-style AllReduce
2 3 4
8
1
13
Reducing, step 2
28
AllReduce = Reduce+BroadcastProperties:1. How long does it take?
O(1) time(*)
2. How much bandwidth?
O(1) bits(*)
3. How hard to program?
Very easy
(*) When done right.
MPI-style AllReduce
2 3 41
28
Broadcast, step 1
28 28
AllReduce = Reduce+BroadcastProperties:1. How long does it take?
O(1) time(*)
2. How much bandwidth?
O(1) bits(*)
3. How hard to program?
Very easy
(*) When done right.
MPI-style AllReduce
28
28 28
Allreduce final state
28 28 28 28
AllReduce = Reduce+Broadcast
Properties:1. How long does it take?
O(1) time(*)
2. How much bandwidth?
O(1) bits(*)
3. How hard to program?
Very easy
(*) When done right.
MPI-style AllReduce
28
28 28
Allreduce final state
28 28 28 28
AllReduce = Reduce+BroadcastProperties:1. How long does it take?
O(1) time(*)
2. How much bandwidth?
O(1) bits(*)
3. How hard to program?
Very easy(*) When done right.
MPI-style AllReduce
28
28 28
Allreduce final state
28 28 28 28
AllReduce = Reduce+BroadcastProperties:1. How long does it take? O(1) time(*)2. How much bandwidth? O(1) bits(*)3. How hard to program? Very easy
(*) When done right.
An Example Algorithm: Weight averaging
n = AllReduce(1)While (pass number < max)
1. While (examples left)1.1 Do online update.
2. AllReduce(weights)
3. For each weight w ← w/n
Code tour
An Example Algorithm: Weight averaging
n = AllReduce(1)While (pass number < max)
1. While (examples left)1.1 Do online update.
2. AllReduce(weights)
3. For each weight w ← w/n
Code tour
What is Hadoop AllReduce?
1.
DataProgram
�Map� job moves program to data.
2. Delayed initialization: Most failures are diskfailures. First read (and cache) all data, beforeinitializing allreduce.
3. Speculative execution: In a busy cluster, onenode is often slow. Use speculative execution tostart additional mappers.
What is Hadoop AllReduce?
1.
DataProgram
�Map� job moves program to data.2. Delayed initialization: Most failures are disk
failures. First read (and cache) all data, beforeinitializing allreduce.
3. Speculative execution: In a busy cluster, onenode is often slow. Use speculative execution tostart additional mappers.
What is Hadoop AllReduce?
1.
DataProgram
�Map� job moves program to data.2. Delayed initialization: Most failures are disk
failures. First read (and cache) all data, beforeinitializing allreduce.
3. Speculative execution: In a busy cluster, onenode is often slow. Use speculative execution tostart additional mappers.
Robustness & Speedup
0
1
2
3
4
5
6
7
8
9
10
10 20 30 40 50 60 70 80 90 100
Spe
edup
Nodes
Speed per method
Average_10Min_10
Max_10linear
Splice Site Recognition
0 10 20 30 40 500.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Iteration
auP
RC
Online
L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass
L−BFGS
0 5 10 15 20
0.466
0.468
0.47
0.472
0.474
0.476
0.478
0.48
0.482
0.484
Iteration
auP
RC
Online
L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass
L−BFGS
Splice Site Recognition
0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
Effective number of passes over data
auP
RC
L−BFGS w/ one online passZinkevich et al.Dekel et al.
This Tutorial in 4 parts
How do you:
1. use all the data?
2. solve the right problem?
3. solve complex joint problems easily?
4. solve interactive problems?
Applying Machine Learning in Practice
1. Ignore mismatch. Often faster.
2. Understand problem and �nd more suitable tool.Often better.
Applying Machine Learning in Practice
1. Ignore mismatch. Often faster.
2. Understand problem and �nd more suitable tool.Often better.
Importance Weighted Classi�cation
Importance-Weighted Classi�cation
I Given training data
{(x1, y1, c1), . . . , (xn, yn, cn)}, produce a
classi�er h : X → {0, 1}.I Unknown underlying distribution D over
X × {0, 1}×[0,∞).
I Find h with small expected cost:
`(h,D) = E(x ,y ,c)∼D[c · 1(h(x) 6= y)]
Where does this come up?
1. Spam Prediction (Ham predicted as Spam muchworse than Spam predicted as Ham.)
2. Distribution Shifts (Optimize search engineresults for monetizing queries.)
3. Boosting (Reweight problem examples forresidual learning.)
4. Large Scale Learning (Downsample commonclass and importance weight to compensate.)
Multiclass Classi�cation
Distribution D over X × Y , where Y = {1, . . . , k}.Find a classi�er h : X → Y minimizing themulti-class loss on D
`k(h,D) = Pr(x ,y)∼D [h(x) 6= y ]
1. Categorization: Which of k things is it?
2. Actions: Which of k choices should be made?
Multiclass Classi�cation
Distribution D over X × Y , where Y = {1, . . . , k}.Find a classi�er h : X → Y minimizing themulti-class loss on D
`k(h,D) = Pr(x ,y)∼D [h(x) 6= y ]
1. Categorization: Which of k things is it?
2. Actions: Which of k choices should be made?
Use in VW
Multiclass label format: Label [Importance] ['Tag]
Methods:�oaa k one-against-all prediction. O(k) time. Thebaseline�ect k error correcting tournament. O(log(k)) time.�log_multi n Adaptive log time. O(log(n)) time.
One-Against-All (OAA)
Create k binary problems, one per class.For class i predict �Is the label i or not?�
(x , y) 7−→
(x ,1(y = 1))
(x ,1(y = 2))
. . .
(x ,1(y = k))
The inconsistency problem
Given an optimal binary classi�er, one-against-alldoesn't produce an optimal multiclass classi�er.
Prob(label|features) 1
2− δ 1
4+ δ
2
1
4+ δ
2Prediction
1v23 1 0 0 02v13 0 1 0 03v12 0 0 1 0
Solution: always use one-against-all regression.
The inconsistency problem
Given an optimal binary classi�er, one-against-alldoesn't produce an optimal multiclass classi�er.
Prob(label|features) 1
2− δ 1
4+ δ
2
1
4+ δ
2Prediction
1v23 1 0 0 02v13 0 1 0 03v12 0 0 1 0
Solution: always use one-against-all regression.
Cost-sensitive Multiclass
Cost-sensitive multiclass classi�cationDistribution D over X × [0, 1]k , where a vector in[0, 1]k speci�es the cost of each of the k choices.
Find a classi�er h : X → {1, . . . , k} minimizing theexpected cost
cost(c,D) = E(x ,c)∼D [ch(x)].
1. Is this packet {normal,error,attack}?
2. A subroutine used later...
Cost-sensitive Multiclass
Cost-sensitive multiclass classi�cationDistribution D over X × [0, 1]k , where a vector in[0, 1]k speci�es the cost of each of the k choices.
Find a classi�er h : X → {1, . . . , k} minimizing theexpected cost
cost(c,D) = E(x ,c)∼D [ch(x)].
1. Is this packet {normal,error,attack}?
2. A subroutine used later...
Use in VW
Label information via sparse vector.A test example:|Namespace FeatureA test example with only classes 1,2,4 valid:1: 2: 4: |Namespace FeatureA training example with only classes 1,2,4 valid:1:0.4 2:3.1 4:2.2 |Namespace Feature
Methods:�csoaa k cost-sensitive OAA prediction. O(k) time.�csoaa_ldf Label-dependent features OAA.�wap_ldf LDF Weighted-all-pairs.
Use in VW
Label information via sparse vector.A test example:|Namespace FeatureA test example with only classes 1,2,4 valid:1: 2: 4: |Namespace FeatureA training example with only classes 1,2,4 valid:1:0.4 2:3.1 4:2.2 |Namespace FeatureMethods:�csoaa k cost-sensitive OAA prediction. O(k) time.�csoaa_ldf Label-dependent features OAA.�wap_ldf LDF Weighted-all-pairs.
Code Tour
This Tutorial in 4 parts
How do you:
1. use all the data?
2. solve the right problem?
3. solve complex joint problems easily?
4. How do you solve interactive problems?
The Problem: Joint Prediction
How?
1. Each prediction is independent.2. Multitask learning.3. Assume tractable graphical model, optimize.4. Hand-crafted approaches.
The Problem: Joint Prediction
How?1. Each prediction is independent.
2. Multitask learning.3. Assume tractable graphical model, optimize.4. Hand-crafted approaches.
The Problem: Joint Prediction
How?1. Each prediction is independent.2. Multitask learning.
3. Assume tractable graphical model, optimize.4. Hand-crafted approaches.
The Problem: Joint Prediction
How?1. Each prediction is independent.2. Multitask learning.3. Assume tractable graphical model, optimize.
4. Hand-crafted approaches.
The Problem: Joint Prediction
How?1. Each prediction is independent.2. Multitask learning.3. Assume tractable graphical model, optimize.4. Hand-crafted approaches.
What makes a good solution?
1. Programming complexity.
Most complexproblems addressed independently�too complexto do otherwise.
2. Prediction accuracy. It had better work well.
3. Train speed. Debug/development productivity +maximum data input.
4. Test speed. Application e�ciency
What makes a good solution?
1. Programming complexity. Most complexproblems addressed independently�too complexto do otherwise.
2. Prediction accuracy. It had better work well.
3. Train speed. Debug/development productivity +maximum data input.
4. Test speed. Application e�ciency
What makes a good solution?
1. Programming complexity. Most complexproblems addressed independently�too complexto do otherwise.
2. Prediction accuracy. It had better work well.
3. Train speed. Debug/development productivity +maximum data input.
4. Test speed. Application e�ciency
What makes a good solution?
1. Programming complexity. Most complexproblems addressed independently�too complexto do otherwise.
2. Prediction accuracy. It had better work well.
3. Train speed. Debug/development productivity +maximum data input.
4. Test speed. Application e�ciency
What makes a good solution?
1. Programming complexity. Most complexproblems addressed independently�too complexto do otherwise.
2. Prediction accuracy. It had better work well.
3. Train speed. Debug/development productivity +maximum data input.
4. Test speed. Application e�ciency
A program complexity comparison
1
10
100
1000
CRFSGD CRF++ S-SVM Search
lin
es o
f co
de
fo
r P
OS
10-3 10-2 10-1 100 101 102 103
Training time (minutes)
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
Accu
racy
(per
wor
d)94.995.7
96.695.9 95.5
96.1
90.7
96.1
1s 10s 1m 10m 30m1h
POS Tagging (tuned hps)
OAAL2SL2S (ft)CRFsgd
CRF++StrPercStrSVMStrSVM2
NER
POS
0 100 200 300 400 500 600
563
365
520
404
24
5.7
98
13
5.614
5.3
Prediction (test-time) Speed
L2SL2S (ft)CRFsgdCRF++StrPercStrSVMStrSVM2
Thousands of Tokens per Second
How do you program?
Sequential_RUN(examples)
1: for i = 1 to len(examples) do2: prediction ← predict(examples[i ], examples[i ].label)3: loss(prediction 6= examples[i ].label)4: end for
In essence, write the decoder, providing a little bit ofside information for training.
How do you program?
Sequential_RUN(examples)
1: for i = 1 to len(examples) do2: prediction ← predict(examples[i ], examples[i ].label)3: loss(prediction 6= examples[i ].label)4: end for
In essence, write the decoder, providing a little bit ofside information for training.
RunParser(sentence)
1: stack S ← {Root}2: bu�er B ← [words in sentence]3: arcs A ← ∅4: while B 6= ∅ or |S | > 1 do5: ValidActs ← GetValidActions(S ,B)6: features ← GetFeat(S ,B ,A)7: ref ← GetGoldAction(S ,B)8: action ← predict(features, ref, ValidActs)9: S ,B ,A ← Transition(S ,B ,A, action)
10: end while11: loss(A[w ] 6= A∗[w ], ∀w ∈ sentence)12: return output
How does it work?
An Application of �Learning to Search� algorithms(e.g. Searn,DAgger,LOLS [ICML2015]).
Decoder run many times at train time to optimizepredict(...) for loss(...).See Tutorial with Hal Daume @ ICML2015 + LOLSpaper @ICML2015.
Named Entity Recogntion
Is this word part of an organization, person, or not?
10-1 100 101
Training Time (minutes)
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
F-sc
ore
(per
enti
ty)
80.0 79.2
73.3
76.5 75.9 76.5
74.6
78.3
10s 1m 10m
Named entity recognition (tuned hps)
Entity Relation
Goal: �nd the Entities and then �nd their RelationsMethod Entity F1 Relation F1 Train Time
Structured SVM 88.00 50.04 300 secondsL2S 92.51 52.03 13 seconds
Requires about 100 LOC.
Dependency Parsing
Goal: Find dependency structure of sentence.
70
75
80
85
90
95
Ar* Bu Ch Da Du En Ja Po* Sl* Sw Avg
UA
S (
hig
her=
bett
er)
language
L2SDynaSNN
Requires about 300 LOC.
A demonstration
wget http://bilbo.cs.uiuc.edu/~kchang10/tmp/wsj.vw.zip
vw -b 24 -d wsj.train.vw -c �search_task sequence �search 45
�search_alpha 1e-8 �search_neighbor_features -1:w,1:w
�a�x -1w,+1w -f foo.reg; vw -t -i foo.reg wsj.test.vw
This Tutorial in 4 parts
How do you:
1. use all the data?
2. solve the right problem?
3. solve complex joint problems easily?
4. solve interactive problems?
Examples of Interactive Learning
Repeatedly:
1. A user comes to Microsoft (with history of previousvisits, IP address, data related to an account)
2. Microsoft chooses information to present (urls, ads, newsstories)
3. The user reacts to the presented information (clicks onsomething, clicks, comes back and clicks again,...)
Microsoft wants to interactively choose content and use the
observed feedback to improve future content choices.
Another Example: Clinical Decision Making
Repeatedly:
1. A patient comes to a doctorwith symptoms, medical history,test results
2. The doctor chooses a treatment
3. The patient responds to it
The doctor wants a policy forchoosing targeted treatments forindividual patients.
The Contextual Bandit Setting
For t = 1, . . . ,T :
1. The world produces some context x ∈ X
2. The learner chooses an action a ∈ A
3. The world reacts with reward ra ∈ [0, 1]
Goal: Learn a good policy for choosing actionsgiven context.
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated/True
a1 a2x1
.8/.8/.8 ?/1
x2
/.3 .2 /.2
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated/True
a1 a2x1
.8/.8/.8 ?/1
x2
/.3 .2 /.2
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed
/Estimated/True
a1 a2x1 .8
/.8/.8
?
/1
x2 ?
/.3
.2
/.2
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated
/True
a1 a2x1 .8/.8
/.8
?/.5
/1
x2 ?/.5
/.3
.2 /.2
/.2
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated
/True
a1 a2x1 .8/.8
/.8
?/.5
/1
x2 .3/.5
/.3
.2 /.2
/.2
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated
/True
a1 a2x1 .8/.8
/.8
?/.514
/1
x2 .3/.3
/.3
.2 /.014
/.2
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated/True
a1 a2x1 .8/.8/.8 ?/.514/1x2 .3/.3/.3 .2 /.014 /.2
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated/True
a1 a2x1 .8/.8/.8 ?/.514/1x2 .3/.3/.3 .2 /.014 /.2Basic observation 1: Generalization insu�cient.
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated/True
a1 a2x1 .8/.8/.8 ?/.514/1x2 .3/.3/.3 .2 /.014 /.2Basic observation 2: Exploration required.
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated/True
a1 a2x1 .8/.8/.8 ?/.514/1x2 .3/.3/.3 .2 /.014 /.2Basic observation 3: Errors 6= exploration.
The Evaluation Problem
Let π : X → A be a policy mapping features toactions. How do we evaluate it?
Method 1: Deploy algorithm in the world.
Very Expensive!
The Evaluation Problem
Let π : X → A be a policy mapping features toactions. How do we evaluate it?
Method 1: Deploy algorithm in the world.
Very Expensive!
The Importance Weighting Trick
Let π : X → A be a policy mapping features toactions. How do we evaluate it?
One answer: Collect T exploration samples
(x , a, ra, pa),
wherex = contexta = actionra = reward for actionpa = probability of action a
then evaluate:
Value(π) = Average
(ra 1(π(x) = a)
pa
)
The Importance Weighting Trick
Let π : X → A be a policy mapping features toactions. How do we evaluate it?
One answer: Collect T exploration samples
(x , a, ra, pa),
wherex = contexta = actionra = reward for actionpa = probability of action a
then evaluate:
Value(π) = Average
(ra 1(π(x) = a)
pa
)
The Importance Weighting Trick
TheoremFor all policies π, for all IID data distributions D,Value(π) is an unbiased estimate of the expectedreward of π:
E(x ,~r)∼D[rπ(x)
]= E[Value(π) ]
Proof: Ea∼p
[ra1(π(x)=a)
pa
]=∑
a para1(π(x)=a)
pa= rπ(x)
Example:
Action 1 2Reward 0.5 1
Probability 1
4
3
4
Estimate
2 | 0 0 | 4
3
The Importance Weighting Trick
TheoremFor all policies π, for all IID data distributions D,Value(π) is an unbiased estimate of the expectedreward of π:
E(x ,~r)∼D[rπ(x)
]= E[Value(π) ]
Proof: Ea∼p
[ra1(π(x)=a)
pa
]=∑
a para1(π(x)=a)
pa= rπ(x)
Example:
Action 1 2Reward 0.5 1
Probability 1
4
3
4
Estimate 2
| 0
0
| 4
3
The Importance Weighting Trick
TheoremFor all policies π, for all IID data distributions D,Value(π) is an unbiased estimate of the expectedreward of π:
E(x ,~r)∼D[rπ(x)
]= E[Value(π) ]
Proof: Ea∼p
[ra1(π(x)=a)
pa
]=∑
a para1(π(x)=a)
pa= rπ(x)
Example:
Action 1 2Reward 0.5 1
Probability 1
4
3
4
Estimate 2 | 0 0 | 4
3
How do you test things?
Use format:action:cost:probability | featuresExample:1:1:0.5 | tuesday year million short compan vehicl linestat �nanc commit exchang plan corp subsid creditissu debt pay gold bureau prelimin re�n billiontelephon time draw basic relat �le spokesm reut securacquir form prospect period interview regist torontresourc barrick ontario qualif bln prospectusconvertibl vinc borg arequip...
How do you train?
Reduce to cost-sensitive classi�cation.
vw �cb 2 �cb_type dr rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.25Progressive 0/1 loss: 0.04582
vw �cb 2 �cb_type ips rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125Progressive 0/1 loss: 0.05065
vw �cb 2 �cb_type dm rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125
Progressive 0/1 loss: 0.04679
How do you train?
Reduce to cost-sensitive classi�cation.
vw �cb 2 �cb_type dr rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.25Progressive 0/1 loss: 0.04582
vw �cb 2 �cb_type ips rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125Progressive 0/1 loss: 0.05065
vw �cb 2 �cb_type dm rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125
Progressive 0/1 loss: 0.04679
How do you train?
Reduce to cost-sensitive classi�cation.
vw �cb 2 �cb_type dr rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.25Progressive 0/1 loss: 0.04582
vw �cb 2 �cb_type ips rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125Progressive 0/1 loss: 0.05065
vw �cb 2 �cb_type dm rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125
Progressive 0/1 loss: 0.04679
Reminder: Contextual Bandit Setting
For t = 1, . . . ,T :
1. The world produces some context x ∈ X
2. The learner chooses an action a ∈ A
3. The world reacts with reward ra ∈ [0, 1]
Goal: Learn a good policy for choosing actions given context.
What does learning mean? E�ciently competing with somelarge reference class of policies Π = {π : X → A}:
Regret = maxπ∈Π
averaget(rπ(x) − ra)
Building an Algorithm
Let Q1 = uniform distribution
For t = 1, . . . ,T :
1. The world produces some context x ∈ X
2. Draw π ∼ Qt
3. The learner chooses an action a ∈ A
using π(x).
4. The world reacts with reward ra ∈ [0, 1]
5. Update Qt+1
Building an Algorithm
Let Q1 = uniform distributionFor t = 1, . . . ,T :
1. The world produces some context x ∈ X
2. Draw π ∼ Qt
3. The learner chooses an action a ∈ A using π(x).
4. The world reacts with reward ra ∈ [0, 1]
5. Update Qt+1
What is good Qt?
I Exploration: Qt allows discovery of goodpolicies
I Exploitation: Qt small on bad policies
How do you �nd Qt?
by Reduction ... [details complex, but coded]
How do you �nd Qt?
by Reduction ... [details complex, but coded]
How well does this work?
0
0.02
0.04
0.06
0.08
0.1
0.12
eps-greedy
tau-first
LinUCB*
Cover
loss
losses on CCAT RCV1 problem
How long does this take?
1
10
100
1000
10000
100000
1e+06
eps-greedy
tau-first
LinUCB*
Cover
seconds
running times on CCAT RCV1 problem
Next for Vowpal Wabbit
Version 8 series just started.Primary goal: New research (as always)
+ tackle deployability:
1. Backwards compatibility of model to VW.
2. More serious testing.
3. More serious documentation.
4. More and better library interfaces.
5. Dynamic module loading.
Further reading
VW wiki: https://github.com/JohnLangford/vowpal_wabbit/wiki
Search: NYU large scale learning class
NIPS tutorial on Exploration:http://hunch.net/~jl/interact.pdf
ICML Tutorial on Learning to Search: ... comingsoon.
Bibliography
Release Vowpal Wabbit, 2007, http://github.com/JohnLangford/vowpal_wabbit/wiki.
Terascale A. Agarwal et al, A Reliable E�ective TerascaleLinear Learning System,http://arxiv.org/abs/1110.4198
Reduce A. Beygelzimer et al, Learning Reductions thatReally Work,http://arxiv.org/abs/1502.02704
LOLS K. Chang et al, Learning to Search Better ThanYour Teacher, ICML 2014,http://arxiv.org/abs/1502.02206
Explore A. Agarwal et al, Taming the Monster...,http://arxiv.org/abs/1402.0555