Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
EstimatingQueryingDecoding
Language Model Algorithms
Kenneth Heafield
Carnegie Mellon, University of Edinburgh
March 7, 2013
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
MT is Expensive
I “Since decoding is very time-intensive” [Jehl et al, 2012]
I “Based on the amount of memory we can afford” [Wuebker et al, 2012]
Much of the computational cost is due to the language model.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
MT is Expensive
I “Since decoding is very time-intensive” [Jehl et al, 2012]
I “Based on the amount of memory we can afford” [Wuebker et al, 2012]
Much of the computational cost is due to the language model.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
Desiderata
I More data
I Less CPU time
I Less RAMI High Quality
I Search AccuracyI BLEU
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
Language Model Algorithms
Estimating Text → ARPA file with probabilities.
Querying ARPA file → efficient data structure.
Decoding Searching for high-scoring translations.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Estimating: The Problem
SRILM uses too much memory =⇒ limit on data size.Also, annoying to compile.
IRSTLM does not estimate modified Kneser-Ney models.Also, segfaults.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Estimating: The Problem
SRILM uses too much memory =⇒ limit on data size.Also, annoying to compile.
IRSTLM does not estimate modified Kneser-Ney models.Also, segfaults.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Stupid Backoff [Brants et al, 2007]
1. Count n-grams offline
count(wn1 )
2. Compute pseudo-probabilities at runtime
s(wn|wn−11 ) =
{ count(wn1 )
count(wn−11 )
if count(wn) > 0
0.4s(wn|wn−12 ) if count(wn) = 0
NB: s does not sum to 1.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Stupid Backoff [Brants et al, 2007]
1. Count n-grams offline
count(wn1 )
2. Compute pseudo-probabilities at runtime
s(wn|wn−11 ) =
{ count(wn1 )
count(wn−11 )
if count(wn) > 0
0.4s(wn|wn−12 ) if count(wn) = 0
NB: s does not sum to 1.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Counting n-grams
<s> Australia is one of the few
5-gram Count<s> Australia is one of 1Australia is one of the 1is one of the few 1
Hash table?
Runs out of RAM.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Counting n-grams
<s> Australia is one of the few
5-gram Count<s> Australia is one of 1Australia is one of the 1is one of the few 1
Hash table?Runs out of RAM.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Spill to Disk When RAM Runs Out
Text
Hash Table
Hash Table Hash Table
Sort Sort
File File
File
Merge Sort
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Split Data
Text
Hash Table Hash Table
Sort Sort
File File
File
Merge Sort
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Split and Merge
Text
Hash Table Hash Table
Sort Sort
File File
File
Merge Sort
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Fanout: Merging More Than 2 Files
logfanout |data| passes, each of which reads and writes |data|.
Small fanout =⇒ more passes.Large fanout =⇒ more disk seeks.
Compromise: 64 MB buffer per file; fanout = (RAM/64MB)− 1.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Optimizing Merge Sort
I Laziness: do the last merge as the file is being read
I Sharding: partition the data by key
I Combine counts during merge passes
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Now we can estimate Stupid Backoff models.
What about modified Kneser-Ney?
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Modified Kneser-NeyText
Extract 5-Grams
Adjust Counts
Normalize
Interpolate
ARPA file
Sort in Suffix Order
Sort in Context Order
Sort in Suffix Order
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Sorting Orders
Suffix Order5 4 3 2 1A A A A AC A A A AA Y A B AA Y Z B AA A A A YC A B A Z
Context Order4 3 2 1 5A A A A AA A A A YC A A A AC A B A ZA Y A B AA Y Z B A
Each step is a streaming algorithm in suffix or context order.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Adjust Counts
adjusted(wn1 ) =
{count(wn
1 ) if n = 5 or w1 = <s>
|v : count(vwn1 ) > 0| otherwise
Suffix order makes vwn1 consecutive.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Normalize
normalized(wn1 ) =
adjusted(wn1 )− discount(adjusted(wn
1 ))∑v adjusted(wn−1
1 v)
Context order makes wn−11 v consecutive.
The denominator is computed by a thread that reads ahead.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Interpolate
p(wn|wn−11 ) = normalized(wn
1 ) + backoff(wn−11 )p(wn|wn−1
2 )
Suffix order means p(wn|wn−12 ) and p(wn|wn−1
1 ) are close.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
SortingModified Kneser-Ney
Estimation Results
lmplz 132 billion tokens, 4.3 days, one machine (140 GBRAM), lossless
Google 2007 31 billion tokens, 2 days, 400 machines, prunedsingleton words
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
LookupOptimizing
Querying: The Problem
The LM is queried ≈2.7 million times per sentence translated.=⇒ Represent the LM to make queries fast.
The University of Edinburgh owns a machine with 1 TB RAM.=⇒ Find more training data!
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
LookupOptimizing
Querying: The Problem
The LM is queried ≈2.7 million times per sentence translated.=⇒ Represent the LM to make queries fast.
The University of Edinburgh owns a machine with 1 TB RAM.=⇒ Find more training data!
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
LookupOptimizing
Example Language ModelUnigrams
Words log p log b<s> -∞ -2.0iran -4.1 -0.8is -2.5 -1.4one -3.3 -0.9of -2.5 -1.1
BigramsWords log p log b<s> iran -3.3 -1.2iran is -1.7 -0.4is one -2.0 -0.9one of -1.4 -0.6
TrigramsWords log p<s> iran is -1.1iran is one -2.0is one of -0.3
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
LookupOptimizing
Example QueriesUnigrams
Words log p log b<s> -∞ -2.0iran -4.1 -0.8is -2.5 -1.4one -3.3 -0.9of -2.5 -1.1
BigramsWords log p log b<s> iran -3.3 -1.2iran is -1.7 -0.4is one -2.0 -0.9one of -1.4 -0.6
TrigramsWords log p<s> iran is -1.1iran is one -2.0is one of -0.3
Query: <s> iran is
log p(is | <s> iran) = -1.1
Query: iran is oflog p(of) -2.5log backoff(is) -1.4log backoff(iran is) + -0.4
log p(of | iran is) = -4.3
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
LookupOptimizing
Lookup I: Giant Hash Table
Hash every n-gram to a 64-bit integer. Ignore collisions.Store n-grams in custom linear probing hash tables.
Fastest.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
LookupOptimizing
Lookup II: Minimal Perfect Hash [Talbot and Brants, 2008]
Function from seen n-grams to an integers:
Perfect Each seen n-gram has a unique integer.
Minimal Integers are in [0, |n-grams|).
Use the integer to index an array of probability and backoff.
False Positives
Problem Unseen n-grams collide with seen n-grams.
Solution Store an f -bit signature =⇒ 2−f false positives.
Lowest memory but has false positives.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
LookupOptimizing
Lookup II: Minimal Perfect Hash [Talbot and Brants, 2008]
Function from seen n-grams to an integers:
Perfect Each seen n-gram has a unique integer.
Minimal Integers are in [0, |n-grams|).
Use the integer to index an array of probability and backoff.
False Positives
Problem Unseen n-grams collide with seen n-grams.
Solution Store an f -bit signature =⇒ 2−f false positives.
Lowest memory but has false positives.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
LookupOptimizing
Lookup III: Reverse Trie
Australia <s>
are
oneare
is Australiais Australia <s>
<s>
of oneare
is
Most popular format. Exact, middle memory usage.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
LookupOptimizing
Optimizing Storage
Bit-Level Packing
Use only as many bits as needed.
Chop Bits [Raj and Whittaker, 2003]
In a sorted array, encode bits by the offset where they roll over.
Quantization [Whittaker and Raj, 2001] [IRSTLM]
Cluter floats into 2q bins then store q bits/float.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
LookupOptimizing
Storing Less
Filtering
Remove n-grams that will not be queried during decoding.
Pruning
Remove low-count n-grams.
Encode Less Than Probability and Backoff
I Stupid backoff has one value: count.
I Collapse probability and backoff (requires more CPU time).
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
LookupOptimizing
Moses Benchmarks: Single Threaded
8
8
TrieProbing
Chop SRIIRST
8 Rand Backoff 2−8 false
,
/
0 2 4 6 8 10 120
1
2
3
4
Memory (GB)
Tim
e(h
)
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
LookupOptimizing
Querying Takeaways
I Optimizing the LM optimizes the decoder.
I Approximations can have negligible impact on quality27.29 BLEU → 27.09 BLEU by quantizing to 4 bits.
I There is no single best data structure.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
Decoding
March 5: Cube Pruning
Can we do better?
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
Recall: Cube Pruning
I Each constituent is visited in bottom-up (topological) order.
I Generate a fixed number of hypothese per constituent.
I Estimated probabilities for words on the edge.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
A Beam
X :X
countries that a maintain diplomatic relations ` with North Korea .
Left State Right Staterelations
tiescountries that have a an embassy in ` DPR Korea .country a that maintains some diplomatic ties ` in North Korea .nations which has a some diplomatic ties ` with DPR Korea .country a that maintains some diplomatic ties ` with DPR Korea .
� denotes words omitted by state.Rule “is a X :X”
Cube pruning tests variants of the same bad idea.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
A Beam With State
X :X
Left State Right State Score(countries that a � ` with North Korea .) -2(nations which has a � ` with DPR Korea .) -4(countries that have a � ` DPR Korea .) -5(country a � ` in North Korea .) -8(country a � ` with DPR Korea .) -9
� denotes words omitted by state.
Rule “is a X :X”Cube pruning tests variants of the same bad idea.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
A Beam With State
X :X
Left State Right State Score(countries that a � ` with North Korea .) -2(nations which has a � ` with DPR Korea .) -4(countries that have a � ` DPR Korea .) -5(country a � ` in North Korea .) -8(country a � ` with DPR Korea .) -9
is ais ais a
is a
� denotes words omitted by state.
Rule “is a X :X”Cube pruning tests variants of the same bad idea.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
High Level Idea of Incremental Search
Group hypotheses by common outer words.Score outer words first, see how well they do.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
Make a tree
(ε � ε)
(country a � Korea .)
(country a � ` with DPR Korea .)
(country a � ` in North Korea .)
(nations which has a � ` with DPR Korea .)
(countries that � Korea .)
(countries that have a � ` DPR Korea .)
(countries that a � ` with North Korea .)
[1]
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
Use the Tree
Try “in (country a � ` Korea .)” once and see if you like it.
Pop top candiate off the priority queue, expand children, push.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
Use the Tree
Try “in (country a � ` Korea .)” once and see if you like it.
Pop top candiate off the priority queue, expand children, push.
Kenneth Heafield Language Model Algorithms
EstimatingQueryingDecoding
Results
Kenneth Heafield Language Model Algorithms