21
Multi-Task Learning for NLP 2017/04/17 Parsing Group Motoki Sato

Multi-Task Learning for NLP

Embed Size (px)

Citation preview

Page 1: Multi-Task Learning for NLP

Multi-Task Learning for NLP

2017/04/17 Parsing Group

Motoki Sato

Page 2: Multi-Task Learning for NLP

What is Multi-task?

l Single task

2

l Multi task

Model 1

Input (sentence)

POS (task1)

Model 2

Input (sentence)

Chunking (task2)

Model

Input (sentence)

POS (task1)

Chunking (task2)

Page 3: Multi-Task Learning for NLP

Multi-task learning Paper (1)

3

l (Søgaard, 2016) ACL 2016 short. l Tasks:

–  POS (low level task) –  Chunking (high level task)

Page 4: Multi-Task Learning for NLP

Multi-task learning Paper (2)

4

l (Hashimoto, 2016) arxiv. l Tasks (many tasks):

–  POS, Chunking, Dependency parsing, –  Semantic relatedness, Textual entailment

Page 5: Multi-Task Learning for NLP

Dataset

5

(Søgaard, 2016) (Hashimoto, 2016)

POS Penn Treebank Penn Treebank

Chunking Penn Treebank Penn Treebank

CCG Penn Treebank -

Dependency parsing - Penn Treebank

Semantic relatedness - SICK

Textual entailment - SICK

Page 6: Multi-Task Learning for NLP

(Søgaard, 2016) (Søgaard, 2016)

POS Low level task

Chunking High level task

CCG High level task

6

Input Words and Predict Tag Examples:

Page 7: Multi-Task Learning for NLP

Multi-task for Vision?

l  Cha Zhang, et al. “Improving Multiview Face Detection with Multi-Task Deep

Convolutional Neural Networks” 7

Share hidden layers

(shared representation)

Page 8: Multi-Task Learning for NLP

Multi-task for NLP?

l  Collobert, et al. “Natural Language Processing (Almost) from Scratch”

8

Share

hidden

layers

Individual

layer for

each task

Page 9: Multi-Task Learning for NLP

(Søgaard, 2016) Outermost ver.

9

Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM

Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM

w0 w1 w2 w3

Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM

POS Tag

Chunk Tag

POS Tag

Chunk Tag… …

3-th layer

2-th layer

1-th layer

Previous multi-task learning shared hidden layers,

Share

hidden

layers

Page 10: Multi-Task Learning for NLP

(Søgaard, 2016) lower-layer ver.

10

Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM

Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM

w0 w1 w2 w3

Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM

Chunk Tag

Chunk Tag… …

3-th layer

2-th layer

1-th layer

Previous multi-task learning shared hidden layers,

POS Tag

POS Tag

POS Tag

POS Tag

Page 11: Multi-Task Learning for NLP

Experiments

11

Low-level task High-level task

Single task

Multi task

It is consistently better to have POS supervision at the innermost rather than the outermost layer.

Page 12: Multi-Task Learning for NLP

(Søgaard, 2016) Domain Adaptation

l What is domain adaptation?

12

Source

Trained Model

Trained Model

Target

(ex.) News domain (ex.) Twitter domain

Page 13: Multi-Task Learning for NLP

(Søgaard, 2016) Source Training

13

Source

Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM

Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM

w0 w1 w2 w3

Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM

Chunk Tag

Chunk Tag… …

3-th layer

2-th layer

1-th layer

POS Tag

POS Tag

POS Tag

POS Tag

WSJ newswire

Page 14: Multi-Task Learning for NLP

(Søgaard, 2016) Target Training

14

Target

Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM

Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM

w0 w1 w2 w3

Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM

Chunk Tag

Chunk Tag… …

3-th layer

2-th layer

1-th layer

Re-train POS at Target

Domain

POS Tag

POS Tag

POS Tag

broadcast, weblogs domain

No Chunk

training at

Target

Domain

Page 15: Multi-Task Learning for NLP

Domain Adaptation Experiments

15

High-level task supervision in the source domain, lower-level task supervision in the target domain.

Page 16: Multi-Task Learning for NLP

(Hashimoto, 2016)

16

Page 17: Multi-Task Learning for NLP

(Hashimoto, 2016)

17

Page 18: Multi-Task Learning for NLP

(Hashimoto, 2016)

18

Page 19: Multi-Task Learning for NLP

Training Loss for Multi Task Learning

l In (Hashimoto, 2016),

19

L2-norm regularization term

The embedding parameter after training the final task in the top-most layer at the previous training epoch.

Page 20: Multi-Task Learning for NLP

Dataset

20

(Søgaard, 2016) (Hashimoto, 2016)

POS Penn Treebank Penn Treebank

Chunking Penn Treebank Penn Treebank

CCG Penn Treebank -

Dependency parsing - Penn Treebank

Semantic relatedness - SICK

Textual entailment - SICK

Since (Søgaard, 2016) uses same dataset (same

input), they can use the sum of loss for multi-tasks.

Page 21: Multi-Task Learning for NLP

Catastrophic Forgetting

l  “Overcoming Catastrophic Forgetting in Neural Networks”, James Kirkpatrick, Raia Hadsell, et al. https://arxiv.org/abs/1612.00796

l  https://theneuralperspective.com/2017/04/01/overcoming-catastrophic-forgetting-in-neural-networks/ 21