GPT-3: Few-Shot Learning with a Giant Language Model · 2020. 12. 16. · GPT-3: Few-Shot Learning...

Preview:

Citation preview

GPT-3: Few-Shot Learning with a Giant Language ModelMelanie Subbiah

1

OpenAI Columbia University

2

What is the goal?

3

Humans learn new tasks through demonstrations and instructions.

We’d like general-purpose agents that could do the same.

What is the goal?

4

Humans learn new tasks through demonstrations and instructions.

We’d like general-purpose agents that can do the same.

Typical Approach

5

Disadvantages to Fine-tuning

6

• Creates a task-specific model

• Requires large high-quality supervised datasets

• more likely to exploit spurious correlations

Few-shot transfer is more similar to how humans learn a new task

• Creates a task-specific model

• Requires large high-quality supervised datasets

• more likely to exploit spurious correlations

Few-shot transfer is more similar to how humans learn a new task

7

Disadvantages to Fine-tuning

• Creates a task-specific model

• Requires large high-quality supervised datasets

• more likely to exploit spurious correlations

Few-shot transfer is more similar to how humans learn a new task

8Yogatama, et al. Learning and Evaluating General Linguistic Intelligence. 2019

Disadvantages to Fine-tuning

What is an alternative?

9Radford, et al. Language Models are Unsupervised Multitask Learners. 2019

Can we further improve on this level of generation and generalization?

10

Can we further improve on this level of generation and generalization?

11

GPT-3 175 Billion parameters

Critical Aspects of GPT-3

• Model Size

• Training Objective

12

Model Size

13Kaplan, et al. Scaling Laws for Neural Language Models. 2020

Model Size

14

Transformers scale well!

Kaplan, et al. Scaling Laws for Neural Language Models. 2020

Motivating the Training Objective

15

Predict the next word in a sequence.

Alec Radford @ Berkeley 4/15/20

Motivating the Training Objective

16

P(“The cat sat on the mat.”) = ???

Alec Radford @ Berkeley 4/15/20

Motivating the Training Objective

17

P(“The cat sat on the mat.”) = ???

Alec Radford @ Berkeley 4/15/20

“But it must be recognized that the notion of ‘probability of a sentence’ is an entirely useless one, under any known interpretation of this term.”

- Noam Chomsky, 1969

Motivating the Training Objective

18

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

Alec Radford @ Berkeley 4/15/20

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 starts” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Arithmetic

Sentiment Analysis

Motivating the Training Objective

19

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

Alec Radford @ Berkeley 4/15/20

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 starts” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Arithmetic

Sentiment Analysis

Motivating the Training Objective

20

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

Alec Radford @ Berkeley 4/15/20

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 starts” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Arithmetic

Sentiment Analysis

Motivating the Training Objective

21

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

Alec Radford @ Berkeley 4/15/20

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 starts” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Arithmetic

Sentiment Analysis

Motivating the Training Objective

22

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

Alec Radford @ Berkeley 4/15/20

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 starts” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Arithmetic

Sentiment Analysis

Motivating the Training Objective

23

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

Alec Radford @ Berkeley 4/15/20

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 starts” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Arithmetic

Sentiment Analysis

Motivating the Training Objective

24

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

Alec Radford @ Berkeley 4/15/20

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 stars” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Addition

Sentiment Analysis

Motivating the Training Objective

25

P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)

Alec Radford @ Berkeley 4/15/20

P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)

P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)

P(“1 star” | “That movie was terrible. I’d give it”)

P(“5 stars” | “That movie was terrible. I’d give it”) >

Grammar

World Knowledge

Addition

Sentiment Analysis

Approach

26

Model

http://jalammar.github.io/illustrated-gpt2/ 27

Model

Model

The trophy didn’t fit in the suitcase because

the trophy was too big.

28

Model Sizes

29

Para

met

er C

ount

(in

milli

ons)

0

2250

4500

6750

9000

11250

13500

15750

18000

Small Medium Large XL 2.7B 6.7B 13B 175B

Model Sizes

Para

met

er C

ount

(in

milli

ons)

0

2250

4500

6750

9000

11250

13500

15750

18000

Small Medium Large XL 2.7B 6.7B 13B 175B

BERT

-Lar

ge

T5-L

arge

Meg

atro

n

30

Devlin, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018 Raffel, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. 2019 Shoeybi, et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. 2019 Microsoft. Turing-NLG: A 17-Billion Parameter Language Model by Microsoft. 2020

T-NLG

Model SizesPa

ram

eter

Cou

nt (i

n m

illion

s)

0

22500

45000

67500

90000

112500

135000

157500

180000

Small Medium Large XL 2.7B 6.7B 13B 175B

31

Model SizesPa

ram

eter

Cou

nt (i

n m

illion

s)

1

10

100

1000

10000

100000

1000000

Small Medium Large XL 2.7B 6.7B 13B 175B

32

Model SizesPa

ram

eter

Cou

nt (i

n m

illion

s)

1

10

100

1000

10000

100000

1000000

Small Medium Large XL 2.7B 6.7B 13B 175B

Batch Size

Learning Rate

33

Compute

34

Dataset

• Common Crawl (filtered) - general web crawl, filtered based on similarity to high-quality reference and de-duplication

• WebText2 - expanded version of GPT-2 training data, scrape of outbound links from Reddit posts with reasonably high ratings

• Books1 & Books2 - internet-based books

• Wikipedia - English-language Wikipedia

35https://commoncrawl.org/the-data

Dataset

• Common Crawl (filtered) - general web crawl, filtered based on similarity to high-quality reference and de-duplication

• WebText2 - expanded version of GPT-2 training data, scrape of outbound links from Reddit posts with reasonably high ratings

• Books1 & Books2 - internet-based books

• Wikipedia - English-language Wikipedia

36Radford, et al. Language Models are Unsupervised Multitask Learners. 2019

Dataset

• Common Crawl (filtered) - general web crawl, filtered based on similarity to high-quality reference and de-duplication

• WebText2 - expanded version of GPT-2 training data, scrape of outbound links from Reddit posts with reasonably high ratings

• Books1 & Books2 - internet-based books

• Wikipedia - English-language Wikipedia

37

Dataset

• Common Crawl (filtered) - general web crawl, filtered based on similarity to high-quality reference and de-duplication

• WebText2 - expanded version of GPT-2 training data, scrape of outbound links from Reddit posts with reasonably high ratings

• Books1 & Books2 - internet-based books

• Wikipedia - English-language Wikipedia

38

Dataset Mix

Toke

ns (i

n bi

llions

)

0

100

200

300

400

500

Common Crawl WebText2 Books1 Books2 Wkipedia test

39

Dataset Mix

Wei

ght i

n tra

inin

g m

ix (p

erce

nt)

0

20

40

60

80

100

Common Crawl WebText2 Books1 Books2 Wkipedia test

40

Dataset Mix

Epoc

hs e

laps

ed (f

or 3

00B

toke

ns)

0

0.8

1.6

2.4

3.2

4

Common Crawl WebText2 Books1 Books2 Wkipedia test

41

Evaluations

42

Let’s try it!

43

tdaeef = ?

Let’s try it!

44

Please unscramble the letters into a word and write that word. tdaeef = ?

Let’s try it!

45

Please unscramble the letters into a word and write that word. tdaeef = ?Zero-Shot

Let’s try it!

46

Please unscramble the letters into a word and write that word. pcirlaroc = reciprocal tdaeef = ?One-Shot

Let’s try it!

47

Please unscramble the letters into a word and write that word. pcirlaroc = reciprocal elapac = palace tdaeef = ?

Few-Shot

48

49

50

VS.

51

Metalearning

52

Sample Output

Model

The trophy didn’t fit in the suitcase because

the trophy was too big.

53

LM-Likelihood

Model

The trophy didn’t fit in the suitcase because the trophy was too big.

54

2.78 3.45 10.00 0.50 25.12

Methods of Evaluation

55

Randomly select K examples from the training dataset to build the context

Feed each context + possible completion through the model separately

Sample from the model up to a

newline

Normalize LM likelihood over the

completion and select the completion with

the highest likelihood

BLEU

F1

Exact-match accuracy

Multiple-choice

Free-form

Methods of Evaluation

56

Randomly select K examples from the training dataset to build the context

Feed each context + possible completion through the model separately

Sample from the model up to a

newline

Normalize LM likelihood over the

completion and select the completion with

the highest likelihood

BLEU

F1

Exact-match accuracy

Multiple-choice

Free-form

Methods of Evaluation

57

Randomly select K examples from the training dataset to build the context

Feed each context + possible completion through the model separately

Sample from the model up to a

newline

Normalize LM likelihood over the

completion and select the completion with

the highest likelihood

BLEU

F1

Exact-match accuracy

Multiple-choice

Free-form

Methods of Evaluation

58

Randomly select K examples from the training dataset to build the context

Feed each context + possible completion through the model separately

Sample from the model up to a

newline

Normalize LM likelihood over the

completion and select the completion with

the highest likelihood

BLEU

F1

Exact-match accuracy

Multiple-choice

Free-form

Methods of Evaluation

59

Randomly select K examples from the training dataset to build the context

Feed each context + possible completion through the model separately

Sample from the model up to a

newline

Normalize LM likelihood over the

completion and select the completion with

the highest likelihood

BLEU

F1

Exact-match accuracy

Multiple-choice

Free-form

Complete List of Tasks

60

Reading Comprehension • QuAC• SQuADv2• DROP• CoQA• RACE

Language Modeling • PTB

Close and Completion • ROC Stories• HellaSwag• LAMBADA

Trivia-style Questions • NaturalQs• WebQs• TriviaQA

Translation • En <-> Fr• En <-> De• En <-> Ro

Winograd-style • Winograd• Winogrande

Commonsense Reasoning • PiQA• ARC• OpenBookQA

Comprehensive Benchmarks • SuperGLUE

Inference • ANLI• RTE

Synthetic and Qualitative • Arithmetic• Word scrambling• Character-level manipulation• SAT analogies• Article generation• Learning and using novel words• Correcting English grammar

Summary of Performance

61

Task Class Few-Shot Performance

Close, Completion, and Language Modeling Very Good

Question Answering / Knowledge Base Very Good

Translation Good

Winograd / Winogrande Good

Commonsense Reasoning Mixed

Reading Comprehension Mixed

SuperGLUE Mixed

NLI Poor

Bias Probes Poor

Dario Amodei @ NeurIPS 12/7/20

Strengths

62Joshi, et al. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. 2017

Strengths

63Paperno, et al. The lambada dataset: Word prediction requiring a broad discourse contex. 2016

Strengths

64Paperno, et al. The lambada dataset: Word prediction requiring a broad discourse contex. 2016

Strengths

65

Strengths

66

Task Accuracy Without Commas

Accuracy With Commas

4 Digit Addition 25.5% 91.1%

4 Digit Subtraction 26.9% 89.7%

5 Digit Addition 9.3% 90.2%

5 Digit Subtraction 9.9% 82.2%

6 Digit Addition 3% 78.5%

6 Digit Subtraction 3% 73.9%

Dario Amodei @ NeurIPS 12/7/20, and Gwern Branwen!

3456 -> 3,456

Limitations

67Nie, et al. Adversarial nli: A new benchmark for natural language understanding. 2019

Limitations

68Choi, et al. Quac : Question answering in context. 2018; Dua, et al. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. 2019

Key Insights

69

Few-shot transfer to new tasks is possible without any gradient updates, and it presents a flexible framework for specifying new tasks to a model.

70

Bigger models can learn more from context

71

Bigger models have more emergent abilities

72

More context helps up to a point

73Wang, et al. Superglue: A stickier benchmark for general-purpose language understanding systems. 2019

Performance continues to scale with compute

74

Lingering Questions

75

Lingering Questions

• Methods of Evaluation

• Training Datasets and Memorization

• Real-World Applications

76

Methods of Evaluation

77

78

Reading Comprehension • QuAC• SQuADv2• DROP• CoQA• RACE

Language Modeling • PTB

Close and Completion • ROC Stories• HellaSwag• LAMBADA

Trivia-style Questions • NaturalQs• WebQs• TriviaQA

Translation • En <-> Fr• En <-> De• En <-> Ro

Winograd-style • Winograd• Winogrande

Commonsense Reasoning • PiQA• ARC• OpenBookQA

Comprehensive Benchmarks • SuperGLUE

Inference • ANLI• RTE

Synthetic and Qualitative • Arithmetic• Word scrambling• Character-level manipulation• SAT analogies• Article generation• Learning and using novel words• Correcting English grammar

Methods of Evaluation

• What would it take to feel confident that a model possessed a complex ability?

• Can we build comprehensive benchmarks so that we could identify the set of abilities a model possesses?

• How do we evaluate one of the model’s biggest strengths - creative generation?

Methods of Evaluation

79

• What would it take to feel confident that a model possessed a complex ability?

• Can we build comprehensive benchmarks so that we could identify the set of abilities a model possesses?

• How do we evaluate one of the model’s biggest strengths - creative generation?

Methods of Evaluation

80

Methods of Evaluation

81

• What would it take to feel confident that a model possessed a complex ability?

• Can we build comprehensive benchmarks so that we could identify the set of abilities a model possesses?

• How do we evaluate one of the model’s biggest strengths - creative generation?

Training Datasets and Memorization

82

• Quality of Data

• Duplication of Benchmarks

Training Datasets and Memorization - Data Quality

83

CommonCrawl filtering

1. Train a classifier to distinguish between unfiltered CommonCrawl and WebText/Books/Wikipedia

2. Sample filtered CommonCrawl with higher probability of selection based on classifier score of quality

Training Datasets and Memorization - Data Quality

84

CommonCrawl filtering

1. Train a classifier to distinguish between unfiltered CommonCrawl and WebText/Books/Wikipedia

2. Sample filtered CommonCrawl with higher probability of selection based on classifier score of quality

Training Datasets and Memorization - Data Quality

85

How can we better define and identify high quality data?

• Gender

• Race

• Religion

“The detective was a _______” —> 83% male

“The competent detective was a _______”

“The incompetent detective was a _______”

86

Training Datasets and Memorization - Harmful Data

• Gender

• Race

• Religion

87

Training Datasets and Memorization - Harmful Data

Male-biased Descriptive Words •Large•Mostly•Lazy•Fantastic•Eccentric•Protect•Jolly•Stable•Personable•Survive

Female-biased Descriptive Words •Optimistic•Bubbly•Naughty•Easy-going•Petite•Tight•Pregnant•Gorgeous•Sucked•Beautiful

• Gender

• Race

• Religion

88

Training Datasets and Memorization - Harmful Data

• Gender

• Race

• Religion

89

Training Datasets and Memorization - Harmful Data

How do we make sure models trained on huge amounts of web data don’t get the chance to memorize eval benchmarks?

90

Training Datasets and Memorization - Eval Memorization

Removing benchmarks from training data

1. Look for overlap in phrases between benchmarks and training documents

2. Found a quarter of benchmarks had over 50% overlap with the training dataset!

3. Remove training documents that overlap with eval benchmarks

4. Compare performance on benchmarks between full dataset and only test examples that don't appear in the training data

91

Training Datasets and Memorization - Eval Memorization

Removing benchmarks from training data

1. Look for overlap in phrases between benchmarks and training documents

2. Found a quarter of benchmarks had over 50% overlap with the training dataset!

3. Remove training documents that overlap with eval benchmarks

4. Compare performance on benchmarks between full dataset and only test examples that don't appear in the training data

92

Training Datasets and Memorization - Eval Memorization

Removing benchmarks from training data

1. Look for overlap in phrases between benchmarks and training documents

2. Found a quarter of benchmarks had over 50% overlap with the training dataset!

3. Remove training documents that overlap with eval benchmarks

4. Compare performance on benchmarks between full dataset and only test examples that don't appear in the training data

93

Training Datasets and Memorization - Eval Memorization

Removing benchmarks from training data

1. Look for overlap in phrases between benchmarks and training documents

2. Found a quarter of benchmarks had over 50% overlap with the training dataset!

3. Remove training documents that overlap with eval benchmarks

4. Compare performance on benchmarks between full dataset and only test examples that don't appear in the training data

94

Training Datasets and Memorization - Eval Memorization

Removing benchmarks from training data

1. Look for overlap in phrases between benchmarks and training documents

2. Found a quarter of benchmarks had over 50% overlap with the training dataset!

3. Remove training documents that overlap with eval benchmarks

4. Compare performance on benchmarks between full dataset and only test examples that don't appear in the training data

95

Training Datasets and Memorization - Eval Memorization

96

Training Datasets and Memorization - Eval Memorization

Real-World Applications

Important considerations

1. Potential for harmful outputs

2. Reliability of performance

97

Real-World Applications

98

• Semantic search

• Turn a script into a novel

• Turn a sentence into an email

• Smart formatting and code generation

• Emoji storytelling

https://andrewmayneblog.wordpress.com

Real-World Applications - Emoji Storytelling

99https://andrewmayneblog.wordpress.com

Real-World Applications - Emoji Storytelling

100https://andrewmayneblog.wordpress.com

Real-World Applications

101

• What are the useful applications of a model like GPT-3?

• Are there times when GPT-3 can be convincing enough, even if not perfectly reliable?

Real-World Applications

• What are the useful applications of a model like GPT-3?

• Are there times when GPT-3 can be convincing enough, even if not perfectly reliable?

102

103

Real-World Applications - Writing News

Real-World Applications - Writing News

104

• Language modeling performance appears to continue to scale with compute

• Large models can transfer few-shot to new tasks without any fine-tuning

• There are many complexities to evaluations, training datasets, and applications for large models

105

Conclusion

• Language modeling performance appears to continue to scale with compute

• Large models can transfer few-shot to new tasks without any fine-tuning

• There are many complexities to evaluations, training datasets, and applications for large models

106

Conclusion

• Language modeling performance appears to continue to scale with compute

• Large models can transfer few-shot to new tasks without any fine-tuning

• There are many complexities to evaluations, training datasets, and applications for large models

107

Conclusion

108

Questions?

109

Recommended