18
Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from various sources (see reference page) Carnegie Mellon University 1 Yifeng Tao Introduction to Machine Learning

Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

Language Models and Transfer LearningYifeng Tao

School of Computer ScienceCarnegie Mellon University

Slides adapted from various sources (see reference page)

Carnegie Mellon University 1Yifeng Tao

Introduction to Machine Learning

Page 2: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

What is a Language Model

oA statistical language model is a probability distribution over sequences of words

oGiven such a sequence, say of length m, it assigns a probability to the whole sequence:

oMain problem: data sparsity

Yifeng Tao Carnegie Mellon University 2

[Slide from https://en.wikipedia.org/wiki/Language_model.]

Page 3: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

Unigram model: Bag of words

oGeneral probability distribution:

oUnigram model assumption:

oEssentially, bag of words modeloEstimation of unigram params: count word

frequency in the doc

Yifeng Tao Carnegie Mellon University 3

[Slide from https://en.wikipedia.org/wiki/Language_model.]

Page 4: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

n-gram model

on-gram assumption:

oEstimation of n-gram params:

Yifeng Tao Carnegie Mellon University 4

[Slide from https://en.wikipedia.org/wiki/Language_model.]

Page 5: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

Word2Vec

oWord2Vec: Learns distributed representations of wordsoContinuous bag-of-words (CBOW)

oPredicts current word from a window of surrounding context wordsoContinuous skip-gram

oUses current word to predict surrounding window of context wordsoSlower but does a better job for infrequent words

Yifeng Tao Carnegie Mellon University 5

[Slide from https://www.tensorflow.org/tutorials/representation/word2vec.]

Page 6: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

Skip-gram Word2Vec

oAll words: oParameters of skip-gram word2vec model

oWord embedding for each word:oContext embedding for each word:

oAssumption:

Yifeng Tao Carnegie Mellon University 6

[Slide from https://www.tensorflow.org/tutorials/representation/word2vec.]

Page 7: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

Distributed Representations of Words

oThe trained parameters of words in skip-gram word2vec modeloSemantics and embedding space

Yifeng Tao Carnegie Mellon University 7

[Slide from https://www.tensorflow.org/tutorials/representation/word2vec.]

Page 8: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

Word Embeddings in Transfer Learning

oTransfer learning:oLabeled data are limitedoUnlabeled text corpus enormousoPretrained word embeddings can be transferred to other supervised tasks.

E.g., POS, NER, QA, MT, Sentiment classification

Yifeng Tao Carnegie Mellon University 8

[Slide from Matt Gormley.]

Page 9: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

SOTA Language Models: ELMo

oEmbeddings from Language Models: ELMooFits full conditional probability in forward direction:

oFits full conditional probability in both directions using LSTM:

Yifeng Tao Carnegie Mellon University 9

[Slide from https://arxiv.org/pdf/1802.05365.pdf andhttps://arxiv.org/abs/1810.04805.]

Page 10: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

SOTA Language Models: OpenAI GPT & BERT

oUses transformer other than LSTM to model languageoOpenAI GPT: single directionoBERT: bi-direction

Yifeng Tao Carnegie Mellon University 10

[Slide from https://arxiv.org/abs/1810.04805.]

Page 11: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

SOTA Language Models: BERT

oAdditional language modeling task: predict whether sentences come from same paragraph.

Yifeng Tao Carnegie Mellon University 11

[Slide from https://arxiv.org/abs/1810.04805.]

Page 12: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

SOTA Language Models: BERT

oInstead of extract embeddings and hidden layer outputs, can be fine-tuned to specific supervised learning tasks.

Yifeng Tao Carnegie Mellon University 12

[Slide from https://arxiv.org/abs/1810.04805.]

Page 13: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

The Transformer and Attention Mechanism

oAn encoder-decoder structureoOur focus: encoder and attention mechanism

Yifeng Tao Carnegie Mellon University 13

[Slide from https://jalammar.github.io/illustrated-transformer/.]

Page 14: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

The Transformer and Attention Mechanism

oSelf-attentiono Ignores positions of words, assign weights globally.oCan be parallelized, in contrast to LSTM.

oE.g., the attention weights related to word “it_”:

Yifeng Tao Carnegie Mellon University 14

[Slide from https://jalammar.github.io/illustrated-transformer/.]

Page 15: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

Self-attention Mechanism

Yifeng Tao Carnegie Mellon University 15

[Slide from https://jalammar.github.io/illustrated-transformer/.]

Page 16: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

Self-attention Mechanism

oMore…ohttps://jalammar.githu

b.io/illustrated-transformer/

Yifeng Tao Carnegie Mellon University 16

[Slide from https://jalammar.github.io/illustrated-transformer/.]

Page 17: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

Take home message

oLanguage models suffer from data sparsityoWord2vec portrays language probability using distributed word

embedding parametersoELMo, OpenAI GPT, BERT model language using deep neural

networksoPre-trained language models or their parameters can be transferred

to supervised learning problems in NLPoSelf-attention has the advantage over LSTM that it can be

parallelized and consider interactions across the whole sentence

Carnegie Mellon University 17Yifeng Tao

Page 18: Language Models and Transfer Learningyifengt/courses/machine-learning/... · Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University

References

oWikipedia: https://en.wikipedia.org/wiki/Language_modeloTensorflow. Vector Representations of Words:

https://www.tensorflow.org/tutorials/representation/word2vecoMatt Gormley. 10601 Introduction to Machine Learning:

http://www.cs.cmu.edu/~mgormley/courses/10601/index.htmloMatthew E. Peters et al. Deep contextualized word representations:

https://arxiv.org/pdf/1802.05365.pdfoJacob Devlin et al. BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding: https://arxiv.org/abs/1810.04805

oJay Alammar. The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/

Carnegie Mellon University 18Yifeng Tao