49
Section 6: Approximation-aware Training 1

Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

Embed Size (px)

Citation preview

Page 1: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

1

Section 6:Approximation-aware Training

Page 2: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic

programming algorithm within a single factor. BP coordinates multiple factors.

5. Tweaked Algorithm: Finish in fewer steps and make the steps faster.

6. Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions.

7. Software: Build the model you want!

2

Page 3: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic

programming algorithm within a single factor. BP coordinates multiple factors.

5. Tweaked Algorithm: Finish in fewer steps and make the steps faster.

6. Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions.

7. Software: Build the model you want!

3

Page 4: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

4

Modern NLP

Linguistics

Mathematical

Modeling

Machine Learning

Combinatorial

Optimization

NLP

Page 5: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

5

Machine Learning for NLPLinguistics

No semantic interpretatio

n

Linguistics

inspires the

structures we

want to predict

Page 6: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

6

Machine Learning for NLPLinguistics Mathematica

l Modeling

pθ( ) = 0.50

pθ( ) = 0.25

pθ( ) = 0.10

pθ( ) = 0.01

Our model

defines a score for

each structure

Page 7: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

7

Machine Learning for NLPLinguistics Mathematica

l Modeling

pθ( ) = 0.50

pθ( ) = 0.25

pθ( ) = 0.10

pθ( ) = 0.01

It also tells us what to optimize

Our model

defines a score for

each structure

Page 8: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

8

Machine Learning for NLPLinguistics Mathematica

l Modeling

Machine Learning

Learning tunes the parameters of the model

x1:

y1:

x2:

y2:

x3:

y3:

Given training instances{(x1, y1), (x2, y2),…, (xn, yn)}

Find the best model parameters, θ

Page 9: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

9

Machine Learning for NLPLinguistics Mathematica

l Modeling

Machine Learning

Learning tunes the parameters of the model

x1:

y1:

x2:

y2:

x3:

y3:

Given training instances{(x1, y1), (x2, y2),…, (xn, yn)}

Find the best model parameters, θ

Page 10: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

10

Machine Learning for NLPLinguistics Mathematica

l Modeling

Machine Learning

Learning tunes the parameters of the model

x1:

y1:

x2:

y2:

x3:

y3:

Given training instances{(x1, y1), (x2, y2),…, (xn, yn)}

Find the best model parameters, θ

Page 11: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

11

xnew:y*:

Machine Learning for NLPLinguistics Mathematica

l Modeling

Machine Learning

Combinatorial

Optimization

Inference finds

the best structure for a new sentence

xnew:y*:

• Given a new sentence, xnew

• Search over the set of all possible structures (often exponential in size of xnew)

• Return the highest scoring structure, y*

(Inference is usually called as a subroutine in

learning)

Page 12: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

12

xnew:y*:

Machine Learning for NLPLinguistics Mathematica

l Modeling

Machine Learning

Combinatorial

Optimization

Inference finds

the best structure for a new sentence

xnew:y*:

• Given a new sentence, xnew

• Search over the set of all possible structures (often exponential in size of xnew)

• Return the Minimum Bayes Risk (MBR) structure, y*

(Inference is usually called as a subroutine in

learning)

Page 13: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

13

Machine Learning for NLPLinguistics Mathematica

l Modeling

Machine Learning

Combinatorial

Optimization

Inference finds

the best structure for a new sentence

• Given a new sentence, xnew

• Search over the set of all possible structures (often exponential in size of xnew)

• Return the Minimum Bayes Risk (MBR) structure, y*

(Inference is usually called as a subroutine in

learning) 13

Easy Polynomial time NP-hard

Page 14: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

14

Modern NLPLinguistics inspires the

structures we want to predict

It also tells us what to optimize

Our model defines a score

for each structure

Learning tunes the parameters

of the model

Inference finds the best structure for a new sentence

Linguistics Mathematical Modeling

Machine Learning

Combinatorial

Optimization

NLP

(Inference is usually called as a subroutine

in learning)

Page 15: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

15

An Abstraction for ModelingMathematica

l Modeling

Now we can work at this

level of abstraction.

Page 16: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

16

TrainingThus far, we’ve seen how to compute (approximate) marginals, given a factor graph…

…but where do the potential tables ψα come from?

– Some have a fixed structure (e.g. Exactly1, CKYTree)

– Others could be trained ahead of time (e.g. TrigramHMM)

– For the rest, we define them parametrically and learn the parameters!

Two ways to learn:

1. Standard CRF Training(very simple; often yields state-of-the-art results)

2. ERMA (less simple; but takes approximations and loss function into account)

Page 17: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

17

Standard CRF Parameterization

Define each potential function in terms of a fixed set of feature functions:

Observed

variables

Predictedvariables

Page 18: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

18

Standard CRF Parameterization

Define each potential function in terms of a fixed set of feature functions:

time flies like an arrow

n ψ2 v ψ4 p ψ6 d ψ8 n

ψ1 ψ3 ψ5 ψ7 ψ9

Page 19: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

19

Standard CRF Parameterization

Define each potential function in terms of a fixed set of feature functions:

n

ψ1

ψ2 v

ψ3

ψ4 p

ψ5

ψ6 d

ψ7

ψ8 n

ψ9

time likeflies an arrow

npψ10

vpψ12

ppψ11

sψ13

Page 20: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

20

What is Training?

That’s easy:

Training = picking good model parameters!

But how do we know if the model parameters are any “good”?

Page 21: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

21

Machine Learning

Conditional Log-likelihood Training

1. Choose modelSuch that derivative in #3 is ea

2. Choose objective: Assign high probability to the things we observe and low probability to everything else3. Compute derivative by hand using the chain rule

4. Replace exact inference by approximate inference

Page 22: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

22

Conditional Log-likelihood Training

1. Choose model Such that derivative in #3 is easy

2. Choose objective: Assign high probability to the things we observe and low probability to everything else3. Compute derivative by hand using the chain rule

4. Replace exact inference by approximate inference

Machine Learning

We can approximate the factor marginals by the (normalized) factor beliefs from BP!

Page 23: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

24

What’s wrong with the usual approach?

If you add too many factors, your predictions might get worse!

• The model might be richer, but we replace the true marginals with approximate marginals (e.g. beliefs computed by BP)

• Approximate inference can cause gradients for structured learning to go awry! (Kulesza & Pereira, 2008).

Page 24: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

25

What’s wrong with the usual approach?

Mistakes made by Standard CRF Training:1. Using BP (approximate)2. Not taking loss function into account3. Should be doing MBR decoding

Big pile of approximations……which has tunable

parameters.

Treat it like a neural net, and run backprop!

Page 25: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

26

Modern NLPLinguistics inspires the

structures we want to predict

It also tells us what to optimize

Our model defines a score

for each structure

Learning tunes the parameters

of the model

Inference finds the best structure for a new sentence

Linguistics Mathematical Modeling

Machine Learning

Combinatorial

Optimization

NLP

(Inference is usually called as a subroutine

in learning)

Page 26: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

27

Empirical Risk Minimization

1. Given training data:

2. Choose each of these:– Decision function

– Loss function

Examples: Linear regression, Logistic regression, Neural Network

Examples: Mean-squared error, Cross Entropy

x1:

y1:

x2:

y2:

x3:

y3:

Page 27: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

28

Empirical Risk Minimization

1. Given training data:

3. Define goal:

2. Choose each of these:– Decision function

– Loss function

4. Train with SGD:(take small steps opposite the gradient)

Page 28: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

29

1. Given training data:

3. Define goal:

2. Choose each of these:– Decision function

– Loss function

4. Train with SGD:(take small steps opposite the gradient)

Empirical Risk Minimization

Page 29: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

30

Conditional Log-likelihood Training

1. Choose model Such that derivative in #3 is easy

2. Choose objective: Assign high probability to the things we observe and low probability to everything else3. Compute derivative by hand using the chain rule

4. Replace true inference by approximate inference

Machine Learning

Page 30: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

31

What went wrong?How did we compute these approximate marginal probabilities anyway?

By Belief Propagation of course!

Machine Learning

Page 31: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

Error Back-Propagation

32Slide from (Stoyanov & Eisner, 2012)

Page 32: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

Error Back-Propagation

33Slide from (Stoyanov & Eisner, 2012)

Page 33: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

Error Back-Propagation

34Slide from (Stoyanov & Eisner, 2012)

Page 34: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

Error Back-Propagation

35Slide from (Stoyanov & Eisner, 2012)

Page 35: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

Error Back-Propagation

36Slide from (Stoyanov & Eisner, 2012)

Page 36: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

Error Back-Propagation

37Slide from (Stoyanov & Eisner, 2012)

Page 37: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

Error Back-Propagation

38Slide from (Stoyanov & Eisner, 2012)

Page 38: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

Error Back-Propagation

39Slide from (Stoyanov & Eisner, 2012)

Page 39: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

Error Back-Propagation

40Slide from (Stoyanov & Eisner, 2012)

Page 40: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

Error Back-Propagation

41

y3

P(y3=noun|x)

μ(y1y2)=μ(y3y1)*μ(y4y1)

ϴ

Slide from (Stoyanov & Eisner, 2012)

Page 41: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

Error Back-Propagation• Applying the chain rule of

differentiation over and over.• Forward pass:– Regular computation (inference +

decoding) in the model (+ remember intermediate quantities).

• Backward pass:– Replay the forward pass in reverse,

computing gradients.

42

Page 42: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

43

Background: Backprop through time

Recurrent neural network:

BPTT: 1. Unroll the computation over time

(Robinson & Fallside, 1987)

(Werbos, 1988)

(Mozer, 1995)

a xt

bt

xt+1

yt+1

a x1

b1

x2

b2

x3

b3

x4

y4

2. Run backprop through the resulting feed-forward network

Page 43: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

44

What went wrong?How did we compute these approximate marginal probabilities anyway?

By Belief Propagation of course!

Machine Learning

Page 44: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

45

ERMAEmpirical Risk Minimization under Approximations (ERMA)• Apply Backprop through

time to Loopy BP• Unrolls the BP

computation graph• Includes inference,

decoding, loss and all the approximations along the way

(Stoyanov, Ropson, & Eisner, 2011)

Page 45: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

46

ERMA1. Choose model to be

the computation with all its approximations

2. Choose objective to likewise include the approximations

3. Compute derivative by backpropagation (treating the entire computation as if it were a neural network)

4. Make no approximations!(Our gradient is exact)

Machine Learning

Key idea: Open up the black box!

(Stoyanov, Ropson, & Eisner, 2011)

Page 46: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

47

ERMA

Empirical Risk Minimization

Machine Learning

Key idea: Open up the black box!

Minimum Bayes Risk (MBR) Decoder

(Stoyanov, Ropson, & Eisner, 2011)

Page 47: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

48

Approximation-aware LearningWhat if we’re using Structured BP instead of regular BP?• No problem, the

same approach still applies!

• The only difference is that we embed dynamic programming algorithms inside our computation graph.

Machine Learning

Key idea: Open up the black box!

(Gormley, Dredze, & Eisner, 2015)

Page 48: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

49

Connection to Deep Learning

y

exp(Θyf(x))

(Gormley, Yu, & Dredze, In submission)

Page 49: Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been

50

Empirical Risk Minimization under Approximations (ERMA)

Approximation AwareNo Yes

Loss Aware

No

Yes SVMstruct

[Finley and Joachims, 2008]

M3N [Taskar et al., 2003]

Softmax-margin [Gimpel & Smith, 2010]

ERMA

MLE

Figure from (Stoyanov & Eisner, 2012)