Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
BERT:
Pre-training of
Deep Bidirectional Transformers for
Language Understanding
Source : NAACL-HLT 2019
Speaker : Ya-Fang, Hsiao
Advisor : Jia-Ling, Koh
Date : 2019/09/02
CONTENTS
1
Introduction
Related
Work
Method
Experiment
Conclusion
2
3
4
5
1Introduction
Introduction
Bidirectional Encoder Representations from Transformers
Language Model
𝑃 𝑤1, 𝑤2, … , 𝑤𝑇 =ෑ
𝑡=1
𝑇
ሻ𝑃(𝑤𝑡|𝑤1, 𝑤2, … , 𝑤𝑡−1
Pre-trained Language Model
BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding
2 Related
Work
Related Work
Pre-trained Language Model
Feature-based
Fine-tuning
: ELMo
: OpenAI GPT
Related Work
Pre-trained Language Model
Feature-based
Fine-tuning
: ELMo
: OpenAI GPT
1. Unidirectional language model
2. Same objective function
Bidirectional Encoder Representations
from Transformers
Masked Language Models (MLM)
Next Sentence Prediction (NSP)
《Attention is all you need》Vaswani et al. (NIPS2017)
Transformers
Encoder Decoder
Sequence2sequence
RNN : hard to parallel
Encoder-Decoder
《Attention is all you need》Vaswani et al. (NIPS2017)
Transformers
《Attention is all you need》Vaswani et al. (NIPS2017)
Transformers
Encoder-Decoder *6
Self-attention layer
can be parallelly computed
《Attention is all you need》Vaswani et al. (NIPS2017)
Transformers
Self-Attentionquery (to match others)
key (to be matched)
information to be extracted
《Attention is all you need》Vaswani et al. (NIPS2017)
Transformers
Multi-Head Attention
Transformers 《Attention is all you need》Vaswani et al. (NIPS2017)
BERTBASE
(L=12, H=768, A=12, Parameters=110M)
BERTLARGE
(L=24, H=1024, A=16, Parameters=340M)
L
A
4H
BERTBidirectional Encoder Representations
from Transformers
3Method
Framework
Pre-training : trained on unlabeled data over different
pre-training tasks.
Fine-Tuning : fine-tuned parameters using labeled data
from the downstream tasks.
Input
Token Embedding : WordPiece embeddings with a 30,000 token vocabulary.
[CLS] : classification token
[SEP] : separate token
Segment Embedding : Learned embeddings belong to sentence A or sentence B.
Position Embedding : Learned positional embeddings.
Pre-training corpus : BooksCorpus、English Wikipedia
Pre-training
Two unsupervised tasks:
1. Masked Language Models (MLM)
2. Next Sentence Prediction (NSP)
Task1. MLM Masked Language Models
Hung-Yi Lee - BERT ppt
Mask 15% of all WordPiece tokens
in each sequence at random for prediction.
Replace the token with
(1) the [MASK] token 80% of the time.
(2) a random token 10% of the time.
(3) the unchanged i-th token 10% of the time.
Task2. NSP Next Sentence Prediction
Hung-Yi Lee - BERT ppt
Input = [CLS] the man went to [MASK] store [SEP]
he bought a gallon [MASK] milk [SEP]
Label = IsNext
Input = [CLS] the man [MASK] to the store [SEP]
penguin [MASK] are flight ##less birds [SEP]
Label = NotNext
Fine-Tuning
Fine-Tuning : fine-tuned parameters using labeled
data from the downstream tasks.
Task 1 (b)
BERT
[CLS] w1 w2 w3
Linear Classifier
class
Input: single sentence, output: class
sentence
Example:Sentiment analysis Document Classification
Trained from Scratch
Fine-tune
Hung-Yi Lee - BERT ppt
Single Sentence Classification Tasks
BERT
[CLS] w1 w2 w3
Linear Cls
class
Input: single sentence, output: class of each word
sentence
Example: Slot filling
Linear Cls
class
Linear Cls
class
Hung-Yi Lee - BERT ppt
Task 2 (d) Single Sentence Tagging Tasks
Linear Classifier
w1 w2
BERT
[CLS] [SEP]
Class
Sentence 1 Sentence 2
w3 w4 w5
Input: two sentences,output: class
Example: Natural Language Inference
Hung-Yi Lee - BERT ppt
Task 3 (a) Sentence Pair Classification Tasks
𝐷 = 𝑑1, 𝑑2, ⋯ , 𝑑𝑁
𝑄 = 𝑞1, 𝑞2, ⋯ , 𝑞𝑁
QAModel
output: two integers (𝑠, 𝑒)
𝐴 = 𝑞𝑠, ⋯ , 𝑞𝑒
Document:
Query:
Answer:
𝐷
𝑄
𝑠
𝑒
17
77 79
𝑠 = 17, 𝑒 = 17
𝑠 = 77, 𝑒 = 79
Hung-Yi Lee - BERT ppt
Task 4 (c) Question Answering Tasks
q1 q2
BERT
[CLS] [SEP]
question document
d1 d2 d3
dot product
Softmax
0.50.3 0.2
The answer is “d2d3”.
s = 2, e = 3
Learned from scratch
Hung-Yi Lee - BERT ppt
Task 4 (c) Question Answering Tasks
q1 q2
BERT
[CLS] [SEP]
question document
d1 d2 d3
The answer is “d2d3”.
s = 2, e = 3
Learned from scratch
Hung-Yi Lee - BERT ppt
dot product
Softmax
0.20.1 0.7
Task 4 (c) Question Answering Tasks
Experiment
4
Experiments Fine-tuning results on 11 NLP tasks
Implements LeeMeng-進擊的BERT (Pytorch)
Implements LeeMeng-進擊的BERT (Pytorch)
Implements LeeMeng-進擊的BERT (Pytorch)
Implements LeeMeng-進擊的BERT (Pytorch)
Implements LeeMeng-進擊的BERT (Pytorch)
5 Conclusion
References
語言模型發展 http://bit.ly/nGram2NNLM
語言模型預訓練方法 http://bit.ly/ELMo_OpenAIGPT_BERT
Attention Is All You Need http://bit.ly/AttIsAllUNeed
BERT http://bit.ly/BERTpaper
李弘毅-Transformer(Youtube) http://bit.ly/HungYiLee_Transformer
Illustrated Transformer http://bit.ly/illustratedTransformer
詳解Transformer http://bit.ly/explainTransformer
github/codertimo - BERT(pytorch) http://bit.ly/BERT_pytorch
實作假新聞分類 http://bit.ly/implementpaircls
Pytorch.org_BERT http://bit.ly/pytorchorgBERT