25
1 Learning CRFs with Hierarchical Features: An Application to Go Scott Sanner Thore Graepel Ralf Herbrich Tom Minka – University of Toronto Microsoft Research (with thanks to David Stern and Mykel Kochenderfer)

Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

1

Learning CRFs with Hierarchical Features: An Application to Go

Scott SannerThore Graepel

Ralf HerbrichTom Minka

– University of Toronto

Microsoft Research

(with thanks to David Stern and Mykel Kochenderfer)

Page 2: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

2

The Game of Go

• Rules: – Turn: One stone placed on vertex.– Capture.

• Started about 4000 years ago in ancient China• About 60 million players worldwide• 2 Players: Black and White• Board: 19×19 grid

White territory

Blackterritory

• Aim: Gather territory by surrounding it

Page 3: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

3

Territory Prediction• Goal: predict territory distribution given board...

• How to predict territory?– Could use search:

• Full α-β impractical (avg. 180 moves/turn, >100 moves/game)• Monte Carlo averaging a reasonable estimator, but costly

– We learn to directly predict territory:• Learn from expert data

~c P (~s |~c)

P (~s |~c)

Page 4: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

4

Talk Outline• Hierarchical pattern features

• Independent classifiers– What is best way to combine features?

• CRF models– Coupling factor model (w/ patterns)– Exact inference intractable– What is best training / inference method

to circumvent intractability?

• Evaluation and Conclusions

Page 5: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

5

Hierarchical Patterns• Centered on a

single position

• Exact config.of stones

• Fixed match region (template)

• 8 nested templates

• 3.8 million patterns mined

Page 6: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

6

Models(a) Independent pattern-

based classifiers(b) (Coupling) CRF (c) Pattern CRF

Vertex Variables: si, ci

Unary pattern-based factors: ψi(si = +1,~c) = exp

⎛⎝Xπ̃∈Π̃

λπ̃ · Iπ̃(~c, i)

⎞⎠ψ(i,j)(si, sj , ci, cj) = exp

Ã36Xk=1

λk · Ik(si, sj , ci, cj , k)

!Couplingfactors:

Page 7: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

7

Independent Pattern-based

Classifiers

Page 8: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

8

Inference and Training

• Up to 8 patterns may matchat any vertex

• Which pattern to use?– Smallest pattern – Largest pattern

• Or, combine all patterns:– Logistic regression– Bayesian model averaging…

ψi(si = +1, ~c) = P (si|π̃min(~c, i))ψi(si = +1, ~c) = P (si|π̃max(~c, i))

ψi(si = +1,~c)

= exp

⎛⎝Xπ̃∈Π̃

λπ̃ · Iπ̃(~c, i)

⎞⎠

Page 9: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

9

Bayesian Model Averaging• Bayesian approach to combining models:

• Now examine the model “weight”:

• Model τ must apply to all data!

P (τ |~c,D) =P (D|τ,~c)P (τ |~c)Pτ∈Υ P (D|τ,~c)P (τ |~c)

P (sj|~c,D) =Xτ∈Υ

P (sj |τ, ~c,D)P (τ |~c,D)

Page 10: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

10

Hierarchical Tree Models• Arrange patterns into decision trees τ:

• Model τ provides predictions on all data

Page 11: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

11

Coupling CRF

& Pattern CRF

Page 12: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

12

Inference and Training• Inference

– Intractable for 19x19 grids– Loopy BP is biased

• but an option– Sampling is unbiased

• but much slower

• Training– Max likelihood requires inference!

– Other approximate methods…

∂l

∂λj=Xd∈D

ÃIj(~s(d),~c(d))−

X~s

Ij(~s,~c(d))P (~s|~c(d))

!

Page 13: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

13

Pseudolikelihood• Standard log likelihood:

• Pseudo log-likelihood (clamp Markov blanket):

• Then inference during training is purely local• Long range effects captured in data• Note: only valid for training

– in presence of fully labeled data

l(~λ) =Xd∈D

logP (~s(d)|~c(d))

pl(~λ) =Xd∈D

Xf∈F

logP (~s(d)f |~c

(d)f ,MBF (f)

(d))

Page 14: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

14

Local Training• Break CRF into max likelihood trained pieces…• Piecewise:

• Shared Unary Piecewise:

Page 15: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

15

Evaluation

Page 16: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

16

Models & Algorithms• Model & algorithm specification:

– Model / Training (/ Inference, if not obvious)

• Models & algorithms evaluated:– Indep / {Smallest, Largest} Pattern– Indep / BMA-Tree {Uniform, Exp}– Indep / Log Regr– CRF / ML Loopy BP (/ Swendsen-Wang)– Pattern CRF / Pseudolikelihood (Edge)– Pattern CRF / (S. U.) Piecewise– Monte Carlo

Page 17: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

17

Training Time• Approximate time for various models and

algorithms to reach convergence:

Algorithm Training TimeIndep / Largest Pattern < 45 minIndep / BMA-Tree < 45 minPattern CRF / Piecewise ~ 2 hrsIndep / Log Regr ~ 5 hrsPattern CRF / Pseudolikelihood ~ 12 hrsCRF / ML Loopy BP > 2 days

Page 18: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

18

Inference Time• Average time to evaluate for various

models and algorithms on a 19x19 board:

Algorithm Inference TimeIndep / Sm. & Largest Pattern 1.7 msIndep / BMA-Tree & Log Regr 6.0 msCRF / Loopy BP 101.0 msPattern CRF / Loopy BP 214.6 msMonte Carlo 2,967.5 msCRF / Swend.-Wang Sampling 10,568.7 ms

P (~s |~c)

Page 19: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

19

Performance Metrics• Vertex Error: (classification error)

• Net Error: (score error)

• Log Likelihood: (model fit)

1|G|P|G|

i=1 I(sgn(EP (~s|~c(d))[si]) 6= sgn(s(d)i ))

12|G| |

P|G|i=1 EP (~s|~c(d))[si]−

P|G|i=1 s

(d)i |

logP (~s(d)|~c(d))

Page 20: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

20

Performance Tradeoffs I

0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.340

0.02

0.04

0.06

0.08

0.1

0.12

Vertex Error

Net

Err

or

Net Error vs. Vertex Error Tradeoff

Indep / Smallest Pattern Indep / Largest Pattern Indep / BMA-Tree Uniform Indep / BMA-Tree Exp Indep / Log Regr CRF / ML Loopy BP CRF / ML Loopy BP / Swendsen-Wang Pattern CRF / Pseudolikelihood EdgePattern CRF / Pseudolikelihood Pattern CRF / Piecewise Pattern CRF / S.U. Piecewise Monte Carlo

Page 21: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

21

Why is Vertex Error better for CRFs?• Coupling factors help realize stable configurations• Compare previous unary-only independent model to

unary and coupling model:– Independent models make inconsistent predictions– Loopy BP smoothes these predictions (but too much?)

BMA-Tree Model Coupling Model with Loopy BP

Page 22: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

22

Why is Net Error worse for CRFs?• Use sampling to examine bias of Loopy BP

– Unbiased inference in limit– Can run over all test data but still too costly for training

• Smoothing gets rid of local inconsistencies• But errors reinforce each other!

Loopy Belief Propagation Swendsen-Wang Sampling

Page 23: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

23

Bias of Local Training• Problems with Piecewise training:

– Very biased when used in conjunction with Loopy BP

– Predictions still good, just saturated– Accounts for poor Log Likelihood & Net Error…

ML Trained Piecewise Trained

Page 24: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

24

Performance Tradeoffs II

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.30

0.02

0.04

0.06

0.08

0.1

0.12

-Log Likelihood

Net

Err

or

Net Error vs. -Log Likelihood Tradeoff

Indep / Smallest Pattern Indep / Largest Pattern Indep / BMA-Tree Uniform Indep / BMA-Tree Exp Indep / Log Regr CRF / ML Loopy BP CRF / ML Loopy BP / Swendsen-Wang Pattern CRF / Pseudolikelihood EdgePattern CRF / Pseudolikelihood Pattern CRF / Piecewise Pattern CRF / S.U. Piecewise Monte Carlo

Page 25: Learning CRFs with Hierarchical Features: An Application to Gousers.rsise.anu.edu.au/~ssanner/Papers/crfgo_slides.pdf · 1 Learning CRFs with Hierarchical Features: An Application

25

ConclusionsTwo general messages:

(1) CRFs vs. Independent Models:• CRFs should theoretically be better• However, time cost is high• Can save time with approximate training /

inference• But then CRFs may perform worse than

independent classifiers – depends on metric

(2) For Independent Models:• Problem of choosing appropriate neighborhood

can be finessed by Bayesian model averaging