Upload
motoki-sato
View
289
Download
0
Embed Size (px)
Citation preview
Multi-Task Learning for NLP
2017/04/17 Parsing Group
Motoki Sato
What is Multi-task?
l Single task
2
l Multi task
Model 1
Input (sentence)
POS (task1)
Model 2
Input (sentence)
Chunking (task2)
Model
Input (sentence)
POS (task1)
Chunking (task2)
Multi-task learning Paper (1)
3
l (Søgaard, 2016) ACL 2016 short. l Tasks:
– POS (low level task) – Chunking (high level task)
Multi-task learning Paper (2)
4
l (Hashimoto, 2016) arxiv. l Tasks (many tasks):
– POS, Chunking, Dependency parsing, – Semantic relatedness, Textual entailment
Dataset
5
(Søgaard, 2016) (Hashimoto, 2016)
POS Penn Treebank Penn Treebank
Chunking Penn Treebank Penn Treebank
CCG Penn Treebank -
Dependency parsing - Penn Treebank
Semantic relatedness - SICK
Textual entailment - SICK
(Søgaard, 2016) (Søgaard, 2016)
POS Low level task
Chunking High level task
CCG High level task
6
Input Words and Predict Tag Examples:
Multi-task for Vision?
l Cha Zhang, et al. “Improving Multiview Face Detection with Multi-Task Deep
Convolutional Neural Networks” 7
Share hidden layers
(shared representation)
Multi-task for NLP?
l Collobert, et al. “Natural Language Processing (Almost) from Scratch”
8
Share
hidden
layers
Individual
layer for
each task
(Søgaard, 2016) Outermost ver.
9
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
w0 w1 w2 w3
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
POS Tag
Chunk Tag
POS Tag
Chunk Tag… …
3-th layer
2-th layer
1-th layer
Previous multi-task learning shared hidden layers,
Share
hidden
layers
(Søgaard, 2016) lower-layer ver.
10
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
w0 w1 w2 w3
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
Chunk Tag
Chunk Tag… …
3-th layer
2-th layer
1-th layer
Previous multi-task learning shared hidden layers,
POS Tag
POS Tag
POS Tag
POS Tag
Experiments
11
Low-level task High-level task
Single task
Multi task
It is consistently better to have POS supervision at the innermost rather than the outermost layer.
(Søgaard, 2016) Domain Adaptation
l What is domain adaptation?
12
Source
Trained Model
Trained Model
Target
(ex.) News domain (ex.) Twitter domain
(Søgaard, 2016) Source Training
13
Source
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
w0 w1 w2 w3
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
Chunk Tag
Chunk Tag… …
3-th layer
2-th layer
1-th layer
POS Tag
POS Tag
POS Tag
POS Tag
WSJ newswire
(Søgaard, 2016) Target Training
14
Target
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
w0 w1 w2 w3
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
Chunk Tag
Chunk Tag… …
3-th layer
2-th layer
1-th layer
Re-train POS at Target
Domain
POS Tag
POS Tag
POS Tag
broadcast, weblogs domain
No Chunk
training at
Target
Domain
Domain Adaptation Experiments
15
High-level task supervision in the source domain, lower-level task supervision in the target domain.
(Hashimoto, 2016)
16
(Hashimoto, 2016)
17
(Hashimoto, 2016)
18
Training Loss for Multi Task Learning
l In (Hashimoto, 2016),
19
L2-norm regularization term
The embedding parameter after training the final task in the top-most layer at the previous training epoch.
Dataset
20
(Søgaard, 2016) (Hashimoto, 2016)
POS Penn Treebank Penn Treebank
Chunking Penn Treebank Penn Treebank
CCG Penn Treebank -
Dependency parsing - Penn Treebank
Semantic relatedness - SICK
Textual entailment - SICK
Since (Søgaard, 2016) uses same dataset (same
input), they can use the sum of loss for multi-tasks.
Catastrophic Forgetting
l “Overcoming Catastrophic Forgetting in Neural Networks”, James Kirkpatrick, Raia Hadsell, et al. https://arxiv.org/abs/1612.00796
l https://theneuralperspective.com/2017/04/01/overcoming-catastrophic-forgetting-in-neural-networks/ 21