View
2
Download
0
Category
Preview:
Citation preview
GPT-3: Few-Shot Learning with a Giant Language ModelMelanie Subbiah
1
OpenAI Columbia University
2
What is the goal?
3
Humans learn new tasks through demonstrations and instructions.
We’d like general-purpose agents that could do the same.
What is the goal?
4
Humans learn new tasks through demonstrations and instructions.
We’d like general-purpose agents that can do the same.
Typical Approach
5
Disadvantages to Fine-tuning
6
• Creates a task-specific model
• Requires large high-quality supervised datasets
• more likely to exploit spurious correlations
Few-shot transfer is more similar to how humans learn a new task
• Creates a task-specific model
• Requires large high-quality supervised datasets
• more likely to exploit spurious correlations
Few-shot transfer is more similar to how humans learn a new task
7
Disadvantages to Fine-tuning
• Creates a task-specific model
• Requires large high-quality supervised datasets
• more likely to exploit spurious correlations
Few-shot transfer is more similar to how humans learn a new task
8Yogatama, et al. Learning and Evaluating General Linguistic Intelligence. 2019
Disadvantages to Fine-tuning
What is an alternative?
9Radford, et al. Language Models are Unsupervised Multitask Learners. 2019
Can we further improve on this level of generation and generalization?
10
Can we further improve on this level of generation and generalization?
11
GPT-3 175 Billion parameters
Critical Aspects of GPT-3
• Model Size
• Training Objective
12
Model Size
13Kaplan, et al. Scaling Laws for Neural Language Models. 2020
Model Size
14
Transformers scale well!
Kaplan, et al. Scaling Laws for Neural Language Models. 2020
Motivating the Training Objective
15
Predict the next word in a sequence.
Alec Radford @ Berkeley 4/15/20
Motivating the Training Objective
16
P(“The cat sat on the mat.”) = ???
Alec Radford @ Berkeley 4/15/20
Motivating the Training Objective
17
P(“The cat sat on the mat.”) = ???
Alec Radford @ Berkeley 4/15/20
“But it must be recognized that the notion of ‘probability of a sentence’ is an entirely useless one, under any known interpretation of this term.”
- Noam Chomsky, 1969
Motivating the Training Objective
18
P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)
Alec Radford @ Berkeley 4/15/20
P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)
P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)
P(“1 star” | “That movie was terrible. I’d give it”)
P(“5 starts” | “That movie was terrible. I’d give it”) >
Grammar
World Knowledge
Arithmetic
Sentiment Analysis
Motivating the Training Objective
19
P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)
Alec Radford @ Berkeley 4/15/20
P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)
P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)
P(“1 star” | “That movie was terrible. I’d give it”)
P(“5 starts” | “That movie was terrible. I’d give it”) >
Grammar
World Knowledge
Arithmetic
Sentiment Analysis
Motivating the Training Objective
20
P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)
Alec Radford @ Berkeley 4/15/20
P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)
P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)
P(“1 star” | “That movie was terrible. I’d give it”)
P(“5 starts” | “That movie was terrible. I’d give it”) >
Grammar
World Knowledge
Arithmetic
Sentiment Analysis
Motivating the Training Objective
21
P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)
Alec Radford @ Berkeley 4/15/20
P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)
P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)
P(“1 star” | “That movie was terrible. I’d give it”)
P(“5 starts” | “That movie was terrible. I’d give it”) >
Grammar
World Knowledge
Arithmetic
Sentiment Analysis
Motivating the Training Objective
22
P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)
Alec Radford @ Berkeley 4/15/20
P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)
P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)
P(“1 star” | “That movie was terrible. I’d give it”)
P(“5 starts” | “That movie was terrible. I’d give it”) >
Grammar
World Knowledge
Arithmetic
Sentiment Analysis
Motivating the Training Objective
23
P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)
Alec Radford @ Berkeley 4/15/20
P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)
P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)
P(“1 star” | “That movie was terrible. I’d give it”)
P(“5 starts” | “That movie was terrible. I’d give it”) >
Grammar
World Knowledge
Arithmetic
Sentiment Analysis
Motivating the Training Objective
24
P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)
Alec Radford @ Berkeley 4/15/20
P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)
P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)
P(“1 star” | “That movie was terrible. I’d give it”)
P(“5 stars” | “That movie was terrible. I’d give it”) >
Grammar
World Knowledge
Addition
Sentiment Analysis
Motivating the Training Objective
25
P(“The cat sat on the mat.”) > P(“The cat sats on the mat.”)
Alec Radford @ Berkeley 4/15/20
P(“The cat sat on the mat.”) > P(“The whale sat on the mat.”)
P(“4” | “2 + 2 =”) > P(“5” | “2 + 2 =”)
P(“1 star” | “That movie was terrible. I’d give it”)
P(“5 stars” | “That movie was terrible. I’d give it”) >
Grammar
World Knowledge
Addition
Sentiment Analysis
Approach
26
Model
http://jalammar.github.io/illustrated-gpt2/ 27
Model
Model
The trophy didn’t fit in the suitcase because
the trophy was too big.
28
Model Sizes
29
Para
met
er C
ount
(in
milli
ons)
0
2250
4500
6750
9000
11250
13500
15750
18000
Small Medium Large XL 2.7B 6.7B 13B 175B
Model Sizes
Para
met
er C
ount
(in
milli
ons)
0
2250
4500
6750
9000
11250
13500
15750
18000
Small Medium Large XL 2.7B 6.7B 13B 175B
BERT
-Lar
ge
T5-L
arge
Meg
atro
n
30
Devlin, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018 Raffel, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. 2019 Shoeybi, et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. 2019 Microsoft. Turing-NLG: A 17-Billion Parameter Language Model by Microsoft. 2020
T-NLG
Model SizesPa
ram
eter
Cou
nt (i
n m
illion
s)
0
22500
45000
67500
90000
112500
135000
157500
180000
Small Medium Large XL 2.7B 6.7B 13B 175B
31
Model SizesPa
ram
eter
Cou
nt (i
n m
illion
s)
1
10
100
1000
10000
100000
1000000
Small Medium Large XL 2.7B 6.7B 13B 175B
32
Model SizesPa
ram
eter
Cou
nt (i
n m
illion
s)
1
10
100
1000
10000
100000
1000000
Small Medium Large XL 2.7B 6.7B 13B 175B
Batch Size
Learning Rate
33
Compute
34
Dataset
• Common Crawl (filtered) - general web crawl, filtered based on similarity to high-quality reference and de-duplication
• WebText2 - expanded version of GPT-2 training data, scrape of outbound links from Reddit posts with reasonably high ratings
• Books1 & Books2 - internet-based books
• Wikipedia - English-language Wikipedia
35https://commoncrawl.org/the-data
Dataset
• Common Crawl (filtered) - general web crawl, filtered based on similarity to high-quality reference and de-duplication
• WebText2 - expanded version of GPT-2 training data, scrape of outbound links from Reddit posts with reasonably high ratings
• Books1 & Books2 - internet-based books
• Wikipedia - English-language Wikipedia
36Radford, et al. Language Models are Unsupervised Multitask Learners. 2019
Dataset
• Common Crawl (filtered) - general web crawl, filtered based on similarity to high-quality reference and de-duplication
• WebText2 - expanded version of GPT-2 training data, scrape of outbound links from Reddit posts with reasonably high ratings
• Books1 & Books2 - internet-based books
• Wikipedia - English-language Wikipedia
37
Dataset
• Common Crawl (filtered) - general web crawl, filtered based on similarity to high-quality reference and de-duplication
• WebText2 - expanded version of GPT-2 training data, scrape of outbound links from Reddit posts with reasonably high ratings
• Books1 & Books2 - internet-based books
• Wikipedia - English-language Wikipedia
38
Dataset Mix
Toke
ns (i
n bi
llions
)
0
100
200
300
400
500
Common Crawl WebText2 Books1 Books2 Wkipedia test
39
Dataset Mix
Wei
ght i
n tra
inin
g m
ix (p
erce
nt)
0
20
40
60
80
100
Common Crawl WebText2 Books1 Books2 Wkipedia test
40
Dataset Mix
Epoc
hs e
laps
ed (f
or 3
00B
toke
ns)
0
0.8
1.6
2.4
3.2
4
Common Crawl WebText2 Books1 Books2 Wkipedia test
41
Evaluations
42
Let’s try it!
43
tdaeef = ?
Let’s try it!
44
Please unscramble the letters into a word and write that word. tdaeef = ?
Let’s try it!
45
Please unscramble the letters into a word and write that word. tdaeef = ?Zero-Shot
Let’s try it!
46
Please unscramble the letters into a word and write that word. pcirlaroc = reciprocal tdaeef = ?One-Shot
Let’s try it!
47
Please unscramble the letters into a word and write that word. pcirlaroc = reciprocal elapac = palace tdaeef = ?
Few-Shot
48
49
50
VS.
51
Metalearning
52
Sample Output
Model
The trophy didn’t fit in the suitcase because
the trophy was too big.
53
LM-Likelihood
Model
The trophy didn’t fit in the suitcase because the trophy was too big.
54
2.78 3.45 10.00 0.50 25.12
Methods of Evaluation
55
Randomly select K examples from the training dataset to build the context
Feed each context + possible completion through the model separately
Sample from the model up to a
newline
Normalize LM likelihood over the
completion and select the completion with
the highest likelihood
BLEU
F1
Exact-match accuracy
Multiple-choice
Free-form
Methods of Evaluation
56
Randomly select K examples from the training dataset to build the context
Feed each context + possible completion through the model separately
Sample from the model up to a
newline
Normalize LM likelihood over the
completion and select the completion with
the highest likelihood
BLEU
F1
Exact-match accuracy
Multiple-choice
Free-form
Methods of Evaluation
57
Randomly select K examples from the training dataset to build the context
Feed each context + possible completion through the model separately
Sample from the model up to a
newline
Normalize LM likelihood over the
completion and select the completion with
the highest likelihood
BLEU
F1
Exact-match accuracy
Multiple-choice
Free-form
Methods of Evaluation
58
Randomly select K examples from the training dataset to build the context
Feed each context + possible completion through the model separately
Sample from the model up to a
newline
Normalize LM likelihood over the
completion and select the completion with
the highest likelihood
BLEU
F1
Exact-match accuracy
Multiple-choice
Free-form
Methods of Evaluation
59
Randomly select K examples from the training dataset to build the context
Feed each context + possible completion through the model separately
Sample from the model up to a
newline
Normalize LM likelihood over the
completion and select the completion with
the highest likelihood
BLEU
F1
Exact-match accuracy
Multiple-choice
Free-form
Complete List of Tasks
60
Reading Comprehension • QuAC• SQuADv2• DROP• CoQA• RACE
Language Modeling • PTB
Close and Completion • ROC Stories• HellaSwag• LAMBADA
Trivia-style Questions • NaturalQs• WebQs• TriviaQA
Translation • En <-> Fr• En <-> De• En <-> Ro
Winograd-style • Winograd• Winogrande
Commonsense Reasoning • PiQA• ARC• OpenBookQA
Comprehensive Benchmarks • SuperGLUE
Inference • ANLI• RTE
Synthetic and Qualitative • Arithmetic• Word scrambling• Character-level manipulation• SAT analogies• Article generation• Learning and using novel words• Correcting English grammar
Summary of Performance
61
Task Class Few-Shot Performance
Close, Completion, and Language Modeling Very Good
Question Answering / Knowledge Base Very Good
Translation Good
Winograd / Winogrande Good
Commonsense Reasoning Mixed
Reading Comprehension Mixed
SuperGLUE Mixed
NLI Poor
Bias Probes Poor
Dario Amodei @ NeurIPS 12/7/20
Strengths
62Joshi, et al. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. 2017
Strengths
63Paperno, et al. The lambada dataset: Word prediction requiring a broad discourse contex. 2016
Strengths
64Paperno, et al. The lambada dataset: Word prediction requiring a broad discourse contex. 2016
Strengths
65
Strengths
66
Task Accuracy Without Commas
Accuracy With Commas
4 Digit Addition 25.5% 91.1%
4 Digit Subtraction 26.9% 89.7%
5 Digit Addition 9.3% 90.2%
5 Digit Subtraction 9.9% 82.2%
6 Digit Addition 3% 78.5%
6 Digit Subtraction 3% 73.9%
Dario Amodei @ NeurIPS 12/7/20, and Gwern Branwen!
3456 -> 3,456
Limitations
67Nie, et al. Adversarial nli: A new benchmark for natural language understanding. 2019
Limitations
68Choi, et al. Quac : Question answering in context. 2018; Dua, et al. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. 2019
Key Insights
69
Few-shot transfer to new tasks is possible without any gradient updates, and it presents a flexible framework for specifying new tasks to a model.
70
Bigger models can learn more from context
71
Bigger models have more emergent abilities
72
More context helps up to a point
73Wang, et al. Superglue: A stickier benchmark for general-purpose language understanding systems. 2019
Performance continues to scale with compute
74
Lingering Questions
75
Lingering Questions
• Methods of Evaluation
• Training Datasets and Memorization
• Real-World Applications
76
Methods of Evaluation
77
78
Reading Comprehension • QuAC• SQuADv2• DROP• CoQA• RACE
Language Modeling • PTB
Close and Completion • ROC Stories• HellaSwag• LAMBADA
Trivia-style Questions • NaturalQs• WebQs• TriviaQA
Translation • En <-> Fr• En <-> De• En <-> Ro
Winograd-style • Winograd• Winogrande
Commonsense Reasoning • PiQA• ARC• OpenBookQA
Comprehensive Benchmarks • SuperGLUE
Inference • ANLI• RTE
Synthetic and Qualitative • Arithmetic• Word scrambling• Character-level manipulation• SAT analogies• Article generation• Learning and using novel words• Correcting English grammar
Methods of Evaluation
• What would it take to feel confident that a model possessed a complex ability?
• Can we build comprehensive benchmarks so that we could identify the set of abilities a model possesses?
• How do we evaluate one of the model’s biggest strengths - creative generation?
Methods of Evaluation
79
• What would it take to feel confident that a model possessed a complex ability?
• Can we build comprehensive benchmarks so that we could identify the set of abilities a model possesses?
• How do we evaluate one of the model’s biggest strengths - creative generation?
Methods of Evaluation
80
Methods of Evaluation
81
• What would it take to feel confident that a model possessed a complex ability?
• Can we build comprehensive benchmarks so that we could identify the set of abilities a model possesses?
• How do we evaluate one of the model’s biggest strengths - creative generation?
Training Datasets and Memorization
82
• Quality of Data
• Duplication of Benchmarks
Training Datasets and Memorization - Data Quality
83
CommonCrawl filtering
1. Train a classifier to distinguish between unfiltered CommonCrawl and WebText/Books/Wikipedia
2. Sample filtered CommonCrawl with higher probability of selection based on classifier score of quality
Training Datasets and Memorization - Data Quality
84
CommonCrawl filtering
1. Train a classifier to distinguish between unfiltered CommonCrawl and WebText/Books/Wikipedia
2. Sample filtered CommonCrawl with higher probability of selection based on classifier score of quality
Training Datasets and Memorization - Data Quality
85
How can we better define and identify high quality data?
• Gender
• Race
• Religion
“The detective was a _______” —> 83% male
“The competent detective was a _______”
“The incompetent detective was a _______”
86
Training Datasets and Memorization - Harmful Data
• Gender
• Race
• Religion
87
Training Datasets and Memorization - Harmful Data
Male-biased Descriptive Words •Large•Mostly•Lazy•Fantastic•Eccentric•Protect•Jolly•Stable•Personable•Survive
Female-biased Descriptive Words •Optimistic•Bubbly•Naughty•Easy-going•Petite•Tight•Pregnant•Gorgeous•Sucked•Beautiful
• Gender
• Race
• Religion
88
Training Datasets and Memorization - Harmful Data
• Gender
• Race
• Religion
89
Training Datasets and Memorization - Harmful Data
How do we make sure models trained on huge amounts of web data don’t get the chance to memorize eval benchmarks?
90
Training Datasets and Memorization - Eval Memorization
Removing benchmarks from training data
1. Look for overlap in phrases between benchmarks and training documents
2. Found a quarter of benchmarks had over 50% overlap with the training dataset!
3. Remove training documents that overlap with eval benchmarks
4. Compare performance on benchmarks between full dataset and only test examples that don't appear in the training data
91
Training Datasets and Memorization - Eval Memorization
Removing benchmarks from training data
1. Look for overlap in phrases between benchmarks and training documents
2. Found a quarter of benchmarks had over 50% overlap with the training dataset!
3. Remove training documents that overlap with eval benchmarks
4. Compare performance on benchmarks between full dataset and only test examples that don't appear in the training data
92
Training Datasets and Memorization - Eval Memorization
Removing benchmarks from training data
1. Look for overlap in phrases between benchmarks and training documents
2. Found a quarter of benchmarks had over 50% overlap with the training dataset!
3. Remove training documents that overlap with eval benchmarks
4. Compare performance on benchmarks between full dataset and only test examples that don't appear in the training data
93
Training Datasets and Memorization - Eval Memorization
Removing benchmarks from training data
1. Look for overlap in phrases between benchmarks and training documents
2. Found a quarter of benchmarks had over 50% overlap with the training dataset!
3. Remove training documents that overlap with eval benchmarks
4. Compare performance on benchmarks between full dataset and only test examples that don't appear in the training data
94
Training Datasets and Memorization - Eval Memorization
Removing benchmarks from training data
1. Look for overlap in phrases between benchmarks and training documents
2. Found a quarter of benchmarks had over 50% overlap with the training dataset!
3. Remove training documents that overlap with eval benchmarks
4. Compare performance on benchmarks between full dataset and only test examples that don't appear in the training data
95
Training Datasets and Memorization - Eval Memorization
96
Training Datasets and Memorization - Eval Memorization
Real-World Applications
Important considerations
1. Potential for harmful outputs
2. Reliability of performance
97
Real-World Applications
98
• Semantic search
• Turn a script into a novel
• Turn a sentence into an email
• Smart formatting and code generation
• Emoji storytelling
https://andrewmayneblog.wordpress.com
Real-World Applications - Emoji Storytelling
99https://andrewmayneblog.wordpress.com
Real-World Applications - Emoji Storytelling
100https://andrewmayneblog.wordpress.com
Real-World Applications
101
• What are the useful applications of a model like GPT-3?
• Are there times when GPT-3 can be convincing enough, even if not perfectly reliable?
Real-World Applications
• What are the useful applications of a model like GPT-3?
• Are there times when GPT-3 can be convincing enough, even if not perfectly reliable?
102
103
Real-World Applications - Writing News
Real-World Applications - Writing News
104
• Language modeling performance appears to continue to scale with compute
• Large models can transfer few-shot to new tasks without any fine-tuning
• There are many complexities to evaluations, training datasets, and applications for large models
105
Conclusion
• Language modeling performance appears to continue to scale with compute
• Large models can transfer few-shot to new tasks without any fine-tuning
• There are many complexities to evaluations, training datasets, and applications for large models
106
Conclusion
• Language modeling performance appears to continue to scale with compute
• Large models can transfer few-shot to new tasks without any fine-tuning
• There are many complexities to evaluations, training datasets, and applications for large models
107
Conclusion
108
Questions?
109
Recommended