122

Vowpal Wabbit A Machine Learning System

Embed Size (px)

Citation preview

Page 1: Vowpal Wabbit A Machine Learning System

Vowpal WabbitA Machine Learning System

John LangfordMicrosoft Research

http://hunch.net/~vw/

git clonegit://github.com/JohnLangford/vowpal_wabbit.git

Page 2: Vowpal Wabbit A Machine Learning System

Why does Vowpal Wabbit exist?

1. Prove research.

2. Curiosity.

3. Perfectionist.

4. Solve problem better.

Page 3: Vowpal Wabbit A Machine Learning System

Why does Vowpal Wabbit exist?

1. Prove research.

2. Curiosity.

3. Perfectionist.

4. Solve problem better.

Page 4: Vowpal Wabbit A Machine Learning System

Why does Vowpal Wabbit exist?

1. Prove research.

2. Curiosity.

3. Perfectionist.

4. Solve problem better.

Page 5: Vowpal Wabbit A Machine Learning System

A user base becomes addictive

1. Mailing list of >400

2. The o�cial strawman for large scale logisticregression @ NIPS :-)

3.

Page 6: Vowpal Wabbit A Machine Learning System

A user base becomes addictive

1. Mailing list of >400

2. The o�cial strawman for large scale logisticregression @ NIPS :-)

3.

Page 7: Vowpal Wabbit A Machine Learning System

A user base becomes addictive

1. Mailing list of >400

2. The o�cial strawman for large scale logisticregression @ NIPS :-)

3.

Page 8: Vowpal Wabbit A Machine Learning System

An example

wget http://hunch.net/~jl/VW_raw.tar.gz

vw -c rcv1.train.raw.txt -b 22 --ngram 2

--skips 4 -l 0.25 --binary provides stellarperformance in 12 seconds.

Page 9: Vowpal Wabbit A Machine Learning System

Surface details

1. BSD license, automated test suite, githubrepository.

2. VW supports all I/O modes: executable, library,port, daemon, service (see next).

3. VW has a reasonable++ input format: sparse,dense, namespaces, etc...

4. Mostly C++, but bindings in other languages ofvarying maturity (python, C#, Java good).

5. A substantial user base + developer base.Thanks to many who have helped.

Page 10: Vowpal Wabbit A Machine Learning System

Surface details

1. BSD license, automated test suite, githubrepository.

2. VW supports all I/O modes: executable, library,port, daemon, service (see next).

3. VW has a reasonable++ input format: sparse,dense, namespaces, etc...

4. Mostly C++, but bindings in other languages ofvarying maturity (python, C#, Java good).

5. A substantial user base + developer base.Thanks to many who have helped.

Page 11: Vowpal Wabbit A Machine Learning System

Surface details

1. BSD license, automated test suite, githubrepository.

2. VW supports all I/O modes: executable, library,port, daemon, service (see next).

3. VW has a reasonable++ input format: sparse,dense, namespaces, etc...

4. Mostly C++, but bindings in other languages ofvarying maturity (python, C#, Java good).

5. A substantial user base + developer base.Thanks to many who have helped.

Page 12: Vowpal Wabbit A Machine Learning System

Surface details

1. BSD license, automated test suite, githubrepository.

2. VW supports all I/O modes: executable, library,port, daemon, service (see next).

3. VW has a reasonable++ input format: sparse,dense, namespaces, etc...

4. Mostly C++, but bindings in other languages ofvarying maturity (python, C#, Java good).

5. A substantial user base + developer base.Thanks to many who have helped.

Page 13: Vowpal Wabbit A Machine Learning System

Surface details

1. BSD license, automated test suite, githubrepository.

2. VW supports all I/O modes: executable, library,port, daemon, service (see next).

3. VW has a reasonable++ input format: sparse,dense, namespaces, etc...

4. Mostly C++, but bindings in other languages ofvarying maturity (python, C#, Java good).

5. A substantial user base + developer base.Thanks to many who have helped.

Page 14: Vowpal Wabbit A Machine Learning System

VW servicehttp://tinyurl.com/vw-azureml

Problem: How to deploy model for large scale use?

Solution:

Page 15: Vowpal Wabbit A Machine Learning System

VW servicehttp://tinyurl.com/vw-azureml

Problem: How to deploy model for large scale use?

Solution:

Page 16: Vowpal Wabbit A Machine Learning System

This Tutorial in 4 parts

How do you:

1. use all the data?

2. solve the right problem?

3. solve complex joint problems?

4. solve interactive problems?

Page 17: Vowpal Wabbit A Machine Learning System

This Tutorial in 4 parts

How do you:

1. use all the data?

2. solve the right problem?

3. solve complex joint problems?

4. solve interactive problems?

Page 18: Vowpal Wabbit A Machine Learning System

This Tutorial in 4 parts

How do you:

1. use all the data?

2. solve the right problem?

3. solve complex joint problems?

4. solve interactive problems?

Page 19: Vowpal Wabbit A Machine Learning System

This Tutorial in 4 parts

How do you:

1. use all the data?

2. solve the right problem?

3. solve complex joint problems?

4. solve interactive problems?

Page 20: Vowpal Wabbit A Machine Learning System

Using all the data: Step 1

Small RAM + large data ⇒ Online Learning

Active research area, 4-5 papers related to onlinelearning algorithms in VW.

Page 21: Vowpal Wabbit A Machine Learning System

Using all the data: Step 1

Small RAM + large data ⇒ Online Learning

Active research area, 4-5 papers related to onlinelearning algorithms in VW.

Page 22: Vowpal Wabbit A Machine Learning System

Using all the data: Step 2

1. 3.2 ∗ 106 labeled emails.

2. 433167 users.

3. ∼ 40 ∗ 106 unique tokens.How do we construct a spam �lter which ispersonalized, yet uses global information?

Bad answer: Construct a global �lter + 433167personalized �lters using a conventional hashmap tospecify features. This might require433167 ∗ 40 ∗ 106 ∗ 4 ∼ 70Terabytes of RAM.

Page 23: Vowpal Wabbit A Machine Learning System

Using all the data: Step 2

1. 3.2 ∗ 106 labeled emails.

2. 433167 users.

3. ∼ 40 ∗ 106 unique tokens.How do we construct a spam �lter which ispersonalized, yet uses global information?

Bad answer: Construct a global �lter + 433167personalized �lters using a conventional hashmap tospecify features. This might require433167 ∗ 40 ∗ 106 ∗ 4 ∼ 70Terabytes of RAM.

Page 24: Vowpal Wabbit A Machine Learning System

Using Hashing

Use hashing to predict according to:〈w , φ(x)〉+ 〈w , φu(x)〉

NEUVotre

Apothekeen

ligneEuro

...

USER123_NEUUSER123_Votre

USER123_ApothekeUSER123_en

USER123_ligneUSER123_Euro

...

+323

05235

00

1230

626232...

text document (email) tokenized, duplicatedbag of words

hashed, sparse vector

h2x

classification

w!xh

x xl xh

(in VW: specify the userid as a feature and use -q)

Page 25: Vowpal Wabbit A Machine Learning System

Results

!"#$%!"#&% !"##% !"##% !%

!"!'%

#"$'%

#"(#%

#")$% #")(%

#"##%

#"'#%

#"*#%

#")#%

#"$#%

!"##%

!"'#%

!$% '#% ''% '*% ')%

!"#$%$

&!!'(#)*%+(*,#-.*%)/%0#!*,&1*2%

0%0&)!%&1%3#!3')#0,*%

+,-./,01/2134%

5362-7/,8934%

./23,873%

226 parameters = 64M parameters = 256MB ofRAM.An x270K savings in RAM requirements.

Page 26: Vowpal Wabbit A Machine Learning System

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to createfantastic learning machines!

I: You fool! The only thing parallel machines aregood for is computational windtunnels!The worst part: he had a point.

Page 27: Vowpal Wabbit A Machine Learning System

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to createfantastic learning machines!I: You fool! The only thing parallel machines aregood for is computational windtunnels!

The worst part: he had a point.

Page 28: Vowpal Wabbit A Machine Learning System

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to createfantastic learning machines!I: You fool! The only thing parallel machines aregood for is computational windtunnels!The worst part: he had a point.

Page 29: Vowpal Wabbit A Machine Learning System

Using all the data: Step 3

Given 2.1 Terafeatures of data, how can you learn agood linear predictor fw(x) =

∑i wixi?

17B Examples16M parameters1K nodesHow long does it take?

70 minutes = 500M features/second: faster than theIO bandwidth of a single machine⇒ faster than allpossible single machine linear learning algorithms.

Page 30: Vowpal Wabbit A Machine Learning System

Using all the data: Step 3

Given 2.1 Terafeatures of data, how can you learn agood linear predictor fw(x) =

∑i wixi?

17B Examples16M parameters1K nodesHow long does it take?

70 minutes = 500M features/second: faster than theIO bandwidth of a single machine⇒ faster than allpossible single machine linear learning algorithms.

Page 31: Vowpal Wabbit A Machine Learning System

Using all the data: Step 3

Given 2.1 Terafeatures of data, how can you learn agood linear predictor fw(x) =

∑i wixi?

17B Examples16M parameters1K nodesHow long does it take?

70 minutes = 500M features/second: faster than theIO bandwidth of a single machine⇒ faster than allpossible single machine linear learning algorithms.

Page 32: Vowpal Wabbit A Machine Learning System

MPI-style AllReduce

7

2 3 4

6

Allreduce initial state

5

1

AllReduce = Reduce+BroadcastProperties:1. How long does it take?

O(1) time(*)

2. How much bandwidth?

O(1) bits(*)

3. How hard to program?

Very easy

(*) When done right.

Page 33: Vowpal Wabbit A Machine Learning System

MPI-style AllReduce

2828 28

Allreduce final state

28 28 28 28

AllReduce = Reduce+Broadcast

Properties:1. How long does it take?

O(1) time(*)

2. How much bandwidth?

O(1) bits(*)

3. How hard to program?

Very easy(*) When done right.

Page 34: Vowpal Wabbit A Machine Learning System

MPI-style AllReduce

2 3 4

6

7

5

1

Create Binary Tree

AllReduce = Reduce+BroadcastProperties:

1. How long does it take?

O(1) time(*)

2. How much bandwidth?

O(1) bits(*)

3. How hard to program?

Very easy

(*) When done right.

Page 35: Vowpal Wabbit A Machine Learning System

MPI-style AllReduce

2 3 4

7

8

1

13

Reducing, step 1

AllReduce = Reduce+BroadcastProperties:1. How long does it take?

O(1) time(*)

2. How much bandwidth?

O(1) bits(*)

3. How hard to program?

Very easy

(*) When done right.

Page 36: Vowpal Wabbit A Machine Learning System

MPI-style AllReduce

2 3 4

8

1

13

Reducing, step 2

28

AllReduce = Reduce+BroadcastProperties:1. How long does it take?

O(1) time(*)

2. How much bandwidth?

O(1) bits(*)

3. How hard to program?

Very easy

(*) When done right.

Page 37: Vowpal Wabbit A Machine Learning System

MPI-style AllReduce

2 3 41

28

Broadcast, step 1

28 28

AllReduce = Reduce+BroadcastProperties:1. How long does it take?

O(1) time(*)

2. How much bandwidth?

O(1) bits(*)

3. How hard to program?

Very easy

(*) When done right.

Page 38: Vowpal Wabbit A Machine Learning System

MPI-style AllReduce

28

28 28

Allreduce final state

28 28 28 28

AllReduce = Reduce+Broadcast

Properties:1. How long does it take?

O(1) time(*)

2. How much bandwidth?

O(1) bits(*)

3. How hard to program?

Very easy

(*) When done right.

Page 39: Vowpal Wabbit A Machine Learning System

MPI-style AllReduce

28

28 28

Allreduce final state

28 28 28 28

AllReduce = Reduce+BroadcastProperties:1. How long does it take?

O(1) time(*)

2. How much bandwidth?

O(1) bits(*)

3. How hard to program?

Very easy(*) When done right.

Page 40: Vowpal Wabbit A Machine Learning System

MPI-style AllReduce

28

28 28

Allreduce final state

28 28 28 28

AllReduce = Reduce+BroadcastProperties:1. How long does it take? O(1) time(*)2. How much bandwidth? O(1) bits(*)3. How hard to program? Very easy

(*) When done right.

Page 41: Vowpal Wabbit A Machine Learning System

An Example Algorithm: Weight averaging

n = AllReduce(1)While (pass number < max)

1. While (examples left)1.1 Do online update.

2. AllReduce(weights)

3. For each weight w ← w/n

Code tour

Page 42: Vowpal Wabbit A Machine Learning System

An Example Algorithm: Weight averaging

n = AllReduce(1)While (pass number < max)

1. While (examples left)1.1 Do online update.

2. AllReduce(weights)

3. For each weight w ← w/n

Code tour

Page 43: Vowpal Wabbit A Machine Learning System

What is Hadoop AllReduce?

1.

DataProgram

�Map� job moves program to data.

2. Delayed initialization: Most failures are diskfailures. First read (and cache) all data, beforeinitializing allreduce.

3. Speculative execution: In a busy cluster, onenode is often slow. Use speculative execution tostart additional mappers.

Page 44: Vowpal Wabbit A Machine Learning System

What is Hadoop AllReduce?

1.

DataProgram

�Map� job moves program to data.2. Delayed initialization: Most failures are disk

failures. First read (and cache) all data, beforeinitializing allreduce.

3. Speculative execution: In a busy cluster, onenode is often slow. Use speculative execution tostart additional mappers.

Page 45: Vowpal Wabbit A Machine Learning System

What is Hadoop AllReduce?

1.

DataProgram

�Map� job moves program to data.2. Delayed initialization: Most failures are disk

failures. First read (and cache) all data, beforeinitializing allreduce.

3. Speculative execution: In a busy cluster, onenode is often slow. Use speculative execution tostart additional mappers.

Page 46: Vowpal Wabbit A Machine Learning System

Robustness & Speedup

0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50 60 70 80 90 100

Spe

edup

Nodes

Speed per method

Average_10Min_10

Max_10linear

Page 47: Vowpal Wabbit A Machine Learning System

Splice Site Recognition

0 10 20 30 40 500.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Iteration

auP

RC

Online

L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass

L−BFGS

0 5 10 15 20

0.466

0.468

0.47

0.472

0.474

0.476

0.478

0.48

0.482

0.484

Iteration

auP

RC

Online

L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass

L−BFGS

Page 48: Vowpal Wabbit A Machine Learning System

Splice Site Recognition

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

Effective number of passes over data

auP

RC

L−BFGS w/ one online passZinkevich et al.Dekel et al.

Page 49: Vowpal Wabbit A Machine Learning System

This Tutorial in 4 parts

How do you:

1. use all the data?

2. solve the right problem?

3. solve complex joint problems easily?

4. solve interactive problems?

Page 50: Vowpal Wabbit A Machine Learning System

Applying Machine Learning in Practice

1. Ignore mismatch. Often faster.

2. Understand problem and �nd more suitable tool.Often better.

Page 51: Vowpal Wabbit A Machine Learning System

Applying Machine Learning in Practice

1. Ignore mismatch. Often faster.

2. Understand problem and �nd more suitable tool.Often better.

Page 52: Vowpal Wabbit A Machine Learning System

Importance Weighted Classi�cation

Importance-Weighted Classi�cation

I Given training data

{(x1, y1, c1), . . . , (xn, yn, cn)}, produce a

classi�er h : X → {0, 1}.I Unknown underlying distribution D over

X × {0, 1}×[0,∞).

I Find h with small expected cost:

`(h,D) = E(x ,y ,c)∼D[c · 1(h(x) 6= y)]

Page 53: Vowpal Wabbit A Machine Learning System

Where does this come up?

1. Spam Prediction (Ham predicted as Spam muchworse than Spam predicted as Ham.)

2. Distribution Shifts (Optimize search engineresults for monetizing queries.)

3. Boosting (Reweight problem examples forresidual learning.)

4. Large Scale Learning (Downsample commonclass and importance weight to compensate.)

Page 54: Vowpal Wabbit A Machine Learning System

Multiclass Classi�cation

Distribution D over X × Y , where Y = {1, . . . , k}.Find a classi�er h : X → Y minimizing themulti-class loss on D

`k(h,D) = Pr(x ,y)∼D [h(x) 6= y ]

1. Categorization: Which of k things is it?

2. Actions: Which of k choices should be made?

Page 55: Vowpal Wabbit A Machine Learning System

Multiclass Classi�cation

Distribution D over X × Y , where Y = {1, . . . , k}.Find a classi�er h : X → Y minimizing themulti-class loss on D

`k(h,D) = Pr(x ,y)∼D [h(x) 6= y ]

1. Categorization: Which of k things is it?

2. Actions: Which of k choices should be made?

Page 56: Vowpal Wabbit A Machine Learning System

Use in VW

Multiclass label format: Label [Importance] ['Tag]

Methods:�oaa k one-against-all prediction. O(k) time. Thebaseline�ect k error correcting tournament. O(log(k)) time.�log_multi n Adaptive log time. O(log(n)) time.

Page 57: Vowpal Wabbit A Machine Learning System

One-Against-All (OAA)

Create k binary problems, one per class.For class i predict �Is the label i or not?�

(x , y) 7−→

(x ,1(y = 1))

(x ,1(y = 2))

. . .

(x ,1(y = k))

Page 58: Vowpal Wabbit A Machine Learning System

The inconsistency problem

Given an optimal binary classi�er, one-against-alldoesn't produce an optimal multiclass classi�er.

Prob(label|features) 1

2− δ 1

4+ δ

2

1

4+ δ

2Prediction

1v23 1 0 0 02v13 0 1 0 03v12 0 0 1 0

Solution: always use one-against-all regression.

Page 59: Vowpal Wabbit A Machine Learning System

The inconsistency problem

Given an optimal binary classi�er, one-against-alldoesn't produce an optimal multiclass classi�er.

Prob(label|features) 1

2− δ 1

4+ δ

2

1

4+ δ

2Prediction

1v23 1 0 0 02v13 0 1 0 03v12 0 0 1 0

Solution: always use one-against-all regression.

Page 60: Vowpal Wabbit A Machine Learning System

Cost-sensitive Multiclass

Cost-sensitive multiclass classi�cationDistribution D over X × [0, 1]k , where a vector in[0, 1]k speci�es the cost of each of the k choices.

Find a classi�er h : X → {1, . . . , k} minimizing theexpected cost

cost(c,D) = E(x ,c)∼D [ch(x)].

1. Is this packet {normal,error,attack}?

2. A subroutine used later...

Page 61: Vowpal Wabbit A Machine Learning System

Cost-sensitive Multiclass

Cost-sensitive multiclass classi�cationDistribution D over X × [0, 1]k , where a vector in[0, 1]k speci�es the cost of each of the k choices.

Find a classi�er h : X → {1, . . . , k} minimizing theexpected cost

cost(c,D) = E(x ,c)∼D [ch(x)].

1. Is this packet {normal,error,attack}?

2. A subroutine used later...

Page 62: Vowpal Wabbit A Machine Learning System

Use in VW

Label information via sparse vector.A test example:|Namespace FeatureA test example with only classes 1,2,4 valid:1: 2: 4: |Namespace FeatureA training example with only classes 1,2,4 valid:1:0.4 2:3.1 4:2.2 |Namespace Feature

Methods:�csoaa k cost-sensitive OAA prediction. O(k) time.�csoaa_ldf Label-dependent features OAA.�wap_ldf LDF Weighted-all-pairs.

Page 63: Vowpal Wabbit A Machine Learning System

Use in VW

Label information via sparse vector.A test example:|Namespace FeatureA test example with only classes 1,2,4 valid:1: 2: 4: |Namespace FeatureA training example with only classes 1,2,4 valid:1:0.4 2:3.1 4:2.2 |Namespace FeatureMethods:�csoaa k cost-sensitive OAA prediction. O(k) time.�csoaa_ldf Label-dependent features OAA.�wap_ldf LDF Weighted-all-pairs.

Page 64: Vowpal Wabbit A Machine Learning System

Code Tour

Page 65: Vowpal Wabbit A Machine Learning System

This Tutorial in 4 parts

How do you:

1. use all the data?

2. solve the right problem?

3. solve complex joint problems easily?

4. How do you solve interactive problems?

Page 66: Vowpal Wabbit A Machine Learning System

The Problem: Joint Prediction

How?

1. Each prediction is independent.2. Multitask learning.3. Assume tractable graphical model, optimize.4. Hand-crafted approaches.

Page 67: Vowpal Wabbit A Machine Learning System

The Problem: Joint Prediction

How?1. Each prediction is independent.

2. Multitask learning.3. Assume tractable graphical model, optimize.4. Hand-crafted approaches.

Page 68: Vowpal Wabbit A Machine Learning System

The Problem: Joint Prediction

How?1. Each prediction is independent.2. Multitask learning.

3. Assume tractable graphical model, optimize.4. Hand-crafted approaches.

Page 69: Vowpal Wabbit A Machine Learning System

The Problem: Joint Prediction

How?1. Each prediction is independent.2. Multitask learning.3. Assume tractable graphical model, optimize.

4. Hand-crafted approaches.

Page 70: Vowpal Wabbit A Machine Learning System

The Problem: Joint Prediction

How?1. Each prediction is independent.2. Multitask learning.3. Assume tractable graphical model, optimize.4. Hand-crafted approaches.

Page 71: Vowpal Wabbit A Machine Learning System

What makes a good solution?

1. Programming complexity.

Most complexproblems addressed independently�too complexto do otherwise.

2. Prediction accuracy. It had better work well.

3. Train speed. Debug/development productivity +maximum data input.

4. Test speed. Application e�ciency

Page 72: Vowpal Wabbit A Machine Learning System

What makes a good solution?

1. Programming complexity. Most complexproblems addressed independently�too complexto do otherwise.

2. Prediction accuracy. It had better work well.

3. Train speed. Debug/development productivity +maximum data input.

4. Test speed. Application e�ciency

Page 73: Vowpal Wabbit A Machine Learning System

What makes a good solution?

1. Programming complexity. Most complexproblems addressed independently�too complexto do otherwise.

2. Prediction accuracy. It had better work well.

3. Train speed. Debug/development productivity +maximum data input.

4. Test speed. Application e�ciency

Page 74: Vowpal Wabbit A Machine Learning System

What makes a good solution?

1. Programming complexity. Most complexproblems addressed independently�too complexto do otherwise.

2. Prediction accuracy. It had better work well.

3. Train speed. Debug/development productivity +maximum data input.

4. Test speed. Application e�ciency

Page 75: Vowpal Wabbit A Machine Learning System

What makes a good solution?

1. Programming complexity. Most complexproblems addressed independently�too complexto do otherwise.

2. Prediction accuracy. It had better work well.

3. Train speed. Debug/development productivity +maximum data input.

4. Test speed. Application e�ciency

Page 76: Vowpal Wabbit A Machine Learning System

A program complexity comparison

1

10

100

1000

CRFSGD CRF++ S-SVM Search

lin

es o

f co

de

fo

r P

OS

Page 77: Vowpal Wabbit A Machine Learning System

10-3 10-2 10-1 100 101 102 103

Training time (minutes)

0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

Accu

racy

(per

wor

d)94.995.7

96.695.9 95.5

96.1

90.7

96.1

1s 10s 1m 10m 30m1h

POS Tagging (tuned hps)

OAAL2SL2S (ft)CRFsgd

CRF++StrPercStrSVMStrSVM2

Page 78: Vowpal Wabbit A Machine Learning System

NER

POS

0 100 200 300 400 500 600

563

365

520

404

24

5.7

98

13

5.614

5.3

Prediction (test-time) Speed

L2SL2S (ft)CRFsgdCRF++StrPercStrSVMStrSVM2

Thousands of Tokens per Second

Page 79: Vowpal Wabbit A Machine Learning System

How do you program?

Sequential_RUN(examples)

1: for i = 1 to len(examples) do2: prediction ← predict(examples[i ], examples[i ].label)3: loss(prediction 6= examples[i ].label)4: end for

In essence, write the decoder, providing a little bit ofside information for training.

Page 80: Vowpal Wabbit A Machine Learning System

How do you program?

Sequential_RUN(examples)

1: for i = 1 to len(examples) do2: prediction ← predict(examples[i ], examples[i ].label)3: loss(prediction 6= examples[i ].label)4: end for

In essence, write the decoder, providing a little bit ofside information for training.

Page 81: Vowpal Wabbit A Machine Learning System

RunParser(sentence)

1: stack S ← {Root}2: bu�er B ← [words in sentence]3: arcs A ← ∅4: while B 6= ∅ or |S | > 1 do5: ValidActs ← GetValidActions(S ,B)6: features ← GetFeat(S ,B ,A)7: ref ← GetGoldAction(S ,B)8: action ← predict(features, ref, ValidActs)9: S ,B ,A ← Transition(S ,B ,A, action)

10: end while11: loss(A[w ] 6= A∗[w ], ∀w ∈ sentence)12: return output

Page 82: Vowpal Wabbit A Machine Learning System

How does it work?

An Application of �Learning to Search� algorithms(e.g. Searn,DAgger,LOLS [ICML2015]).

Decoder run many times at train time to optimizepredict(...) for loss(...).See Tutorial with Hal Daume @ ICML2015 + LOLSpaper @ICML2015.

Page 83: Vowpal Wabbit A Machine Learning System

Named Entity Recogntion

Is this word part of an organization, person, or not?

10-1 100 101

Training Time (minutes)

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

F-sc

ore

(per

enti

ty)

80.0 79.2

73.3

76.5 75.9 76.5

74.6

78.3

10s 1m 10m

Named entity recognition (tuned hps)

Page 84: Vowpal Wabbit A Machine Learning System

Entity Relation

Goal: �nd the Entities and then �nd their RelationsMethod Entity F1 Relation F1 Train Time

Structured SVM 88.00 50.04 300 secondsL2S 92.51 52.03 13 seconds

Requires about 100 LOC.

Page 85: Vowpal Wabbit A Machine Learning System

Dependency Parsing

Goal: Find dependency structure of sentence.

70

75

80

85

90

95

Ar* Bu Ch Da Du En Ja Po* Sl* Sw Avg

UA

S (

hig

her=

bett

er)

language

L2SDynaSNN

Requires about 300 LOC.

Page 86: Vowpal Wabbit A Machine Learning System

A demonstration

wget http://bilbo.cs.uiuc.edu/~kchang10/tmp/wsj.vw.zip

vw -b 24 -d wsj.train.vw -c �search_task sequence �search 45

�search_alpha 1e-8 �search_neighbor_features -1:w,1:w

�a�x -1w,+1w -f foo.reg; vw -t -i foo.reg wsj.test.vw

Page 87: Vowpal Wabbit A Machine Learning System

This Tutorial in 4 parts

How do you:

1. use all the data?

2. solve the right problem?

3. solve complex joint problems easily?

4. solve interactive problems?

Page 88: Vowpal Wabbit A Machine Learning System

Examples of Interactive Learning

Repeatedly:

1. A user comes to Microsoft (with history of previousvisits, IP address, data related to an account)

2. Microsoft chooses information to present (urls, ads, newsstories)

3. The user reacts to the presented information (clicks onsomething, clicks, comes back and clicks again,...)

Microsoft wants to interactively choose content and use the

observed feedback to improve future content choices.

Page 89: Vowpal Wabbit A Machine Learning System

Another Example: Clinical Decision Making

Repeatedly:

1. A patient comes to a doctorwith symptoms, medical history,test results

2. The doctor chooses a treatment

3. The patient responds to it

The doctor wants a policy forchoosing targeted treatments forindividual patients.

Page 90: Vowpal Wabbit A Machine Learning System

The Contextual Bandit Setting

For t = 1, . . . ,T :

1. The world produces some context x ∈ X

2. The learner chooses an action a ∈ A

3. The world reacts with reward ra ∈ [0, 1]

Goal: Learn a good policy for choosing actionsgiven context.

Page 91: Vowpal Wabbit A Machine Learning System

The �Direct method�

Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).

Example: Deployed policy always takes a1 on x1 anda2 on x2.

Observed/Estimated/True

a1 a2x1

.8/.8/.8 ?/1

x2

/.3 .2 /.2

Page 92: Vowpal Wabbit A Machine Learning System

The �Direct method�

Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).

Example: Deployed policy always takes a1 on x1 anda2 on x2.

Observed/Estimated/True

a1 a2x1

.8/.8/.8 ?/1

x2

/.3 .2 /.2

Page 93: Vowpal Wabbit A Machine Learning System

The �Direct method�

Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).

Example: Deployed policy always takes a1 on x1 anda2 on x2.

Observed

/Estimated/True

a1 a2x1 .8

/.8/.8

?

/1

x2 ?

/.3

.2

/.2

Page 94: Vowpal Wabbit A Machine Learning System

The �Direct method�

Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).

Example: Deployed policy always takes a1 on x1 anda2 on x2.

Observed/Estimated

/True

a1 a2x1 .8/.8

/.8

?/.5

/1

x2 ?/.5

/.3

.2 /.2

/.2

Page 95: Vowpal Wabbit A Machine Learning System

The �Direct method�

Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).

Example: Deployed policy always takes a1 on x1 anda2 on x2.

Observed/Estimated

/True

a1 a2x1 .8/.8

/.8

?/.5

/1

x2 .3/.5

/.3

.2 /.2

/.2

Page 96: Vowpal Wabbit A Machine Learning System

The �Direct method�

Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).

Example: Deployed policy always takes a1 on x1 anda2 on x2.

Observed/Estimated

/True

a1 a2x1 .8/.8

/.8

?/.514

/1

x2 .3/.3

/.3

.2 /.014

/.2

Page 97: Vowpal Wabbit A Machine Learning System

The �Direct method�

Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).

Example: Deployed policy always takes a1 on x1 anda2 on x2.

Observed/Estimated/True

a1 a2x1 .8/.8/.8 ?/.514/1x2 .3/.3/.3 .2 /.014 /.2

Page 98: Vowpal Wabbit A Machine Learning System

The �Direct method�

Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).

Example: Deployed policy always takes a1 on x1 anda2 on x2.

Observed/Estimated/True

a1 a2x1 .8/.8/.8 ?/.514/1x2 .3/.3/.3 .2 /.014 /.2Basic observation 1: Generalization insu�cient.

Page 99: Vowpal Wabbit A Machine Learning System

The �Direct method�

Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).

Example: Deployed policy always takes a1 on x1 anda2 on x2.

Observed/Estimated/True

a1 a2x1 .8/.8/.8 ?/.514/1x2 .3/.3/.3 .2 /.014 /.2Basic observation 2: Exploration required.

Page 100: Vowpal Wabbit A Machine Learning System

The �Direct method�

Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).

Example: Deployed policy always takes a1 on x1 anda2 on x2.

Observed/Estimated/True

a1 a2x1 .8/.8/.8 ?/.514/1x2 .3/.3/.3 .2 /.014 /.2Basic observation 3: Errors 6= exploration.

Page 101: Vowpal Wabbit A Machine Learning System

The Evaluation Problem

Let π : X → A be a policy mapping features toactions. How do we evaluate it?

Method 1: Deploy algorithm in the world.

Very Expensive!

Page 102: Vowpal Wabbit A Machine Learning System

The Evaluation Problem

Let π : X → A be a policy mapping features toactions. How do we evaluate it?

Method 1: Deploy algorithm in the world.

Very Expensive!

Page 103: Vowpal Wabbit A Machine Learning System

The Importance Weighting Trick

Let π : X → A be a policy mapping features toactions. How do we evaluate it?

One answer: Collect T exploration samples

(x , a, ra, pa),

wherex = contexta = actionra = reward for actionpa = probability of action a

then evaluate:

Value(π) = Average

(ra 1(π(x) = a)

pa

)

Page 104: Vowpal Wabbit A Machine Learning System

The Importance Weighting Trick

Let π : X → A be a policy mapping features toactions. How do we evaluate it?

One answer: Collect T exploration samples

(x , a, ra, pa),

wherex = contexta = actionra = reward for actionpa = probability of action a

then evaluate:

Value(π) = Average

(ra 1(π(x) = a)

pa

)

Page 105: Vowpal Wabbit A Machine Learning System

The Importance Weighting Trick

TheoremFor all policies π, for all IID data distributions D,Value(π) is an unbiased estimate of the expectedreward of π:

E(x ,~r)∼D[rπ(x)

]= E[Value(π) ]

Proof: Ea∼p

[ra1(π(x)=a)

pa

]=∑

a para1(π(x)=a)

pa= rπ(x)

Example:

Action 1 2Reward 0.5 1

Probability 1

4

3

4

Estimate

2 | 0 0 | 4

3

Page 106: Vowpal Wabbit A Machine Learning System

The Importance Weighting Trick

TheoremFor all policies π, for all IID data distributions D,Value(π) is an unbiased estimate of the expectedreward of π:

E(x ,~r)∼D[rπ(x)

]= E[Value(π) ]

Proof: Ea∼p

[ra1(π(x)=a)

pa

]=∑

a para1(π(x)=a)

pa= rπ(x)

Example:

Action 1 2Reward 0.5 1

Probability 1

4

3

4

Estimate 2

| 0

0

| 4

3

Page 107: Vowpal Wabbit A Machine Learning System

The Importance Weighting Trick

TheoremFor all policies π, for all IID data distributions D,Value(π) is an unbiased estimate of the expectedreward of π:

E(x ,~r)∼D[rπ(x)

]= E[Value(π) ]

Proof: Ea∼p

[ra1(π(x)=a)

pa

]=∑

a para1(π(x)=a)

pa= rπ(x)

Example:

Action 1 2Reward 0.5 1

Probability 1

4

3

4

Estimate 2 | 0 0 | 4

3

Page 108: Vowpal Wabbit A Machine Learning System

How do you test things?

Use format:action:cost:probability | featuresExample:1:1:0.5 | tuesday year million short compan vehicl linestat �nanc commit exchang plan corp subsid creditissu debt pay gold bureau prelimin re�n billiontelephon time draw basic relat �le spokesm reut securacquir form prospect period interview regist torontresourc barrick ontario qualif bln prospectusconvertibl vinc borg arequip...

Page 109: Vowpal Wabbit A Machine Learning System

How do you train?

Reduce to cost-sensitive classi�cation.

vw �cb 2 �cb_type dr rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.25Progressive 0/1 loss: 0.04582

vw �cb 2 �cb_type ips rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125Progressive 0/1 loss: 0.05065

vw �cb 2 �cb_type dm rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125

Progressive 0/1 loss: 0.04679

Page 110: Vowpal Wabbit A Machine Learning System

How do you train?

Reduce to cost-sensitive classi�cation.

vw �cb 2 �cb_type dr rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.25Progressive 0/1 loss: 0.04582

vw �cb 2 �cb_type ips rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125Progressive 0/1 loss: 0.05065

vw �cb 2 �cb_type dm rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125

Progressive 0/1 loss: 0.04679

Page 111: Vowpal Wabbit A Machine Learning System

How do you train?

Reduce to cost-sensitive classi�cation.

vw �cb 2 �cb_type dr rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.25Progressive 0/1 loss: 0.04582

vw �cb 2 �cb_type ips rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125Progressive 0/1 loss: 0.05065

vw �cb 2 �cb_type dm rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125

Progressive 0/1 loss: 0.04679

Page 112: Vowpal Wabbit A Machine Learning System

Reminder: Contextual Bandit Setting

For t = 1, . . . ,T :

1. The world produces some context x ∈ X

2. The learner chooses an action a ∈ A

3. The world reacts with reward ra ∈ [0, 1]

Goal: Learn a good policy for choosing actions given context.

What does learning mean? E�ciently competing with somelarge reference class of policies Π = {π : X → A}:

Regret = maxπ∈Π

averaget(rπ(x) − ra)

Page 113: Vowpal Wabbit A Machine Learning System

Building an Algorithm

Let Q1 = uniform distribution

For t = 1, . . . ,T :

1. The world produces some context x ∈ X

2. Draw π ∼ Qt

3. The learner chooses an action a ∈ A

using π(x).

4. The world reacts with reward ra ∈ [0, 1]

5. Update Qt+1

Page 114: Vowpal Wabbit A Machine Learning System

Building an Algorithm

Let Q1 = uniform distributionFor t = 1, . . . ,T :

1. The world produces some context x ∈ X

2. Draw π ∼ Qt

3. The learner chooses an action a ∈ A using π(x).

4. The world reacts with reward ra ∈ [0, 1]

5. Update Qt+1

Page 115: Vowpal Wabbit A Machine Learning System

What is good Qt?

I Exploration: Qt allows discovery of goodpolicies

I Exploitation: Qt small on bad policies

Page 116: Vowpal Wabbit A Machine Learning System

How do you �nd Qt?

by Reduction ... [details complex, but coded]

Page 117: Vowpal Wabbit A Machine Learning System

How do you �nd Qt?

by Reduction ... [details complex, but coded]

Page 118: Vowpal Wabbit A Machine Learning System

How well does this work?

0

0.02

0.04

0.06

0.08

0.1

0.12

eps-greedy

tau-first

LinUCB*

Cover

loss

losses on CCAT RCV1 problem

Page 119: Vowpal Wabbit A Machine Learning System

How long does this take?

1

10

100

1000

10000

100000

1e+06

eps-greedy

tau-first

LinUCB*

Cover

seconds

running times on CCAT RCV1 problem

Page 120: Vowpal Wabbit A Machine Learning System

Next for Vowpal Wabbit

Version 8 series just started.Primary goal: New research (as always)

+ tackle deployability:

1. Backwards compatibility of model to VW.

2. More serious testing.

3. More serious documentation.

4. More and better library interfaces.

5. Dynamic module loading.

Page 121: Vowpal Wabbit A Machine Learning System

Further reading

VW wiki: https://github.com/JohnLangford/vowpal_wabbit/wiki

Search: NYU large scale learning class

NIPS tutorial on Exploration:http://hunch.net/~jl/interact.pdf

ICML Tutorial on Learning to Search: ... comingsoon.

Page 122: Vowpal Wabbit A Machine Learning System

Bibliography

Release Vowpal Wabbit, 2007, http://github.com/JohnLangford/vowpal_wabbit/wiki.

Terascale A. Agarwal et al, A Reliable E�ective TerascaleLinear Learning System,http://arxiv.org/abs/1110.4198

Reduce A. Beygelzimer et al, Learning Reductions thatReally Work,http://arxiv.org/abs/1502.02704

LOLS K. Chang et al, Learning to Search Better ThanYour Teacher, ICML 2014,http://arxiv.org/abs/1502.02206

Explore A. Agarwal et al, Taming the Monster...,http://arxiv.org/abs/1402.0555