Upload
lori
View
70
Download
2
Embed Size (px)
DESCRIPTION
Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies. Jun Wu Advisor: Sanjeev Khudanpur Department of Computer Science Johns Hopkins University Baltimore, MD 21218 April, 2001 NSF STIMULATE Grant No. IRI-9618874. Outline. - PowerPoint PPT Presentation
Citation preview
Center for Language and Speech Processing, The Johns Hopkins University. April 2001 1
Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational
Dependencies
Jun Wu Advisor: Sanjeev Khudanpur
Department of Computer ScienceJohns Hopkins University
Baltimore, MD 21218
April, 2001NSF STIMULATE Grant No. IRI-9618874
2 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Outline
Language modeling in speech recognition The maximum entropy (ME) principle Semantic (Topic) dependencies in natural
language Syntactic dependencies in natural language ME models with topic and syntactic dependencies Conclusion and future work Topic assignment during test (15min) Role of syntactic head (15min) Training ME models in an efficient way (1 hour)
3 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Outline
Language modeling in speech recognition The maximum entropy (ME) principle Semantic (Topic) dependencies in natural
language Syntactic dependencies in natural language ME models with topic and syntactic dependencies Conclusion and future work Topic assignment during test (15min)Topic assignment during test (15min) Role of syntactic head (15min)Role of syntactic head (15min) Training ME models in an efficient way (1 hour)Training ME models in an efficient way (1 hour)
4 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Motivation
Example: A research team led byA research team led by two Johns Hopkins
scientists ___ found the strongest evidence yet found the strongest evidence yet that a virus may …...that a virus may …... have has his
5 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Motivation
Example: A research team led by two Johns Hopkins
scientists ___ found the strongest evidence yet that a virus may …... have has his
6 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Language Models in Speech Recognition
Speaker’s Mind
Speech Producer
Acoustic Processor
Linguistic Decoder W* Speech A
W ̂
Speaker Acoustic Channel
Speech Recognizer
Role of language models
)()|(
)()|(
)(/)()|(
)|(^
WPWAPMaxArg
WPWAPMaxArg
APWPWAPMaxArg
AWPMaxArgW
LAW
W
W
W
7 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Language Modeling in Speech Recognition
N-gram models
In practice, N=1,2,3,or 4. Even these values of N pose data sparseness problem. For , a trigram model has free parameters. There are millions of unseen bigrams and billions of unseen trigrams for which we need an estimate of the probability .
),|(3 vuwP trillionV 8~|| 3
),( wv),,( wvu
3P
modelgramN
11121
11121
21
)|(),(
)|()|()()()(
m
NiiNiiNN
mm
m
wwwPwwwP
wwwPwwPwPwwwPWP
KV 20|~|
8 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Smoothing Techniques Relative frequency estimates:
Deleted Interpolation: Jelinek, et al. 1980
Back-off: Katz 1987, Witten-Bell 1990, Ney, et al. 1994.
01233 )()|(),|(),|( wfvwfvuwfvuwP
.0),,(#),|(
,),,(#0,),(#
)],,([#,),,(#),,|(
),|(
2
3
wvuifvwP
Twvuifvu
wvudiscountTwvuifvuwf
vuwP
),(#),,(#),|(),|(
vuwvuvuwfvuwP
9 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Measuring the Quality of Language Models Word Error Rate:
Reference: The contract ended with a loss of *** seven cents. Hypothesis: A contract ended with * loss of some even
cents. Scores: S C C C D C C I S C
Perplexity:
Perplexity measures the average number of words that can follow a given history under a language model.
)(#)(#
DSCIDSWER
)(2
)(log)()(
LP PH
LW
LP
PPL
WPWPPH
10 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Measuring the Quality of Language Models Word Error Rate:
Reference: The contract ended with a loss of *** seven cents. Hypothesis: A contract ended with * loss of some even
cents. Scores: S C C C D C C I S C
Perplexity:
Perplexity measures the average number of words that can follow a given history under a language model.
)(#)(#
DSCIDSWER
)(2
)(log)()(
LP PH
LW
LP
PPL
WPWPPH
11 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Experimental Setup for Switchboard
American English conversations over the telephone. Vocabulary: 22K (closed), LM training set: 1100 conversations, 2.1M words.
Test set: WS97 dev-test set. 19 conversations (2 hours), 18K words, PPL=79 (back-off trigram model), State-of-the-art systems: 30-35% WER.
Evaluation: 100-best list rescoring.
Speech Recognizer
(Baseline LM) 100 Best HypSpeech
Rescoring(New LM)
1 hypothesis
12 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Experimental Setup for Broadcast News
American English television broadcast. Vocabulary: open (>100K). LM training set: 125K stories, 130M words.
Test set: Hub-4 96 dev-test set. 21K words, PPL=174 (back-off trigram model), State-of-the-art systems: 25% WER.
The evaluation is based on rescoring 100-best lists of the first pass speech recognition.
13 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
The Maximum Entropy Principle The maximum entropy (ME) principle
When we make inferences based on incomplete information, we should choose the probability distribution which has the maximum entropy permitted by the information we do have.
Example (Dice) Let be the probability that the facet with
dots faces-up. Seek model , that maximizes
From Lagrangian
So , choose :
6,2,1, ipi i),,( 621 pppP
.log)( ii
i ppPH
i
iii
i pppPL )1(log),(
0log1 i
i
ppL
1621 , eppp
61,, 621 ppp .1
6
1
i
ip
14 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
The Maximum Entropy Principle (Cont.)
Example 2: Seek probability distribution with constraints. ( is the empirical distribution.)
The feature:
Empirical expectation:
Maximize subject to
So
41ˆ 2 p p̂
otherwise
iiff
01
41)(ˆ)(ˆ ifpfE
iiP
ii
i ppPH log)( )()( ˆ fEfE PP
i
ii
iii
i ifppppPL )41)(()1(log),( 21
203,
41
6312 pppp
15 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Maximum Entropy Language Modeling
Use the short-hand notation For words u, v, w , define a collection of binary
features:
Obtain their target expectations from the training data.
Find It can be shown that
Kaaa ,, 21
}
)()|(:{,)|(*
2121
1 xZxyQQxyQMaxArgP
KfK
ffn
iiiQ
wyvux );,(
.,, 21 Kfff
}][:{),(* iiPafEPPHMaxArgP
P
P
otherwise0apples,y likes,John xif1
{),(
yxfk
16 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Advantages and Disadvantage of Maximum Entropy Language Modeling
Advantages: Creating a “smooth” model that satisfies all empirical
constraints. Incorporating various sources of information in a unified
language model. Disadvantage:
Computation complexity of model parameter estimation procedure.
17 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Training an ME Model
Darroch and Ratcliff 1972: Generalized Iterative Scaling (GIS).
Della Pietra, et al 1996 : Unigram Caching and Improved Iterative Scaling (IIS).
Wu and Khudanpur 2000: Hierarchical Training Methods. For N-gram models and many other models, the training
time per iteration is strictly bounded by which is the same as that of training a back-off model.
A real running time speed-up of one to two orders of magnitude is achieved compared to IIS.
See Wu and Khudanpur ICSLP2000 for details.
)(LO
18 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Motivation for Exploiting Semantic and Syntactic Dependencies
N-gram models only take local correlation between words into account.
Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency.
Need a model that exploits topic and syntax.
Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange.
19 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Motivation for Exploiting Semantic and Syntactic Dependencies
N-gram models only take local correlation between words into account.
Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency.
Need a model that exploits topic and syntax.
Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange.
20 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Training a Topic Sensitive Model Cluster the training data by topic.
TF-IDF vector (excluding stop words). Cosine similarity. K-means clustering.
Select topic dependent words:
Estimate an ME model with topic unigram constraints:threshold
wfwfwf t
t
)()(log)(
),,(),,|(
12
),(),,(),()(
12
121
topicwwZeeeetopicwwwP
ii
wtopicwwwwww
iii
iiiiiii
where
][#],[#)|,,( 12 topic
wtopictopicwwwP iiii
12 ,ww ii
E[f]a
21 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Recognition Using a Topic-Sensitive Model
Detect the current topic from Recognizer’s N-best hypotheses vs. reference
transcriptions.Using N-best hypotheses causes little degradation (in
perplexity and WER). Assign a new topic for each
Conversation vs. utterance.Topic assignment for each utterance is better than
topic assignment for the whole conversation. See Khudanpur and Wu ICASSP’99 paper and
Florian and Yarowsky ACL’99 for details.
22 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Performance of the Topic Model
The ME model with only N-gram constraints duplicates the performance of the corresponding back-off model.
The Topic dependent ME model reduces the perplexity by 7% and WER by 0.7% absolute.
PPL79.0 79.0
73.5
68
72
76
80
BO3gram ME3gram METopic
WER38.5%
38.3%
37.8%
37.4%
37.6%
37.8%
38.0%
38.2%
38.4%
38.6%
BO3gram ME3gram METopic
23 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Content Words vs. Stop Words
37.6%
42.2%
40.8%
37.0%34%
36%
38%
40%
42%
44%
stop Content
Trigram Topic
1/5 of tokens in the test data are content-bearing words.
The WER of the baseline trigram model is relatively high for content words.
Topic dependencies are much more helpful in reducing WER of content words (1.4%) than they are for stop words (0.6%).
78%
22%
Stop WdsCont. Wds
24 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
A Syntactic Parse and Syntactic Heads
The contract ended with a loss of 7 cents after …DT NN VBD IN DT NN IN CD NNS …
centsof
loss
losswith
ended
contract
ended
NP
PP
NP
NP
NP
VP
S’
PP
25 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Exploiting Syntactic Dependencies
A stack of parse trees for each sentence prefix is generated.
All sentences in the training set are parsed by a left-to-right parser.
iT
contractNP
endedVP
The
h
contract ended with a loss of 7 cents after
h w w
DT NN VBD IN DT NN IN CD NNS
i-2 i-1 i-2 i-1w i
nt i-1
nt i-2
26 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Exploiting Syntactic Dependencies (Cont.)
A probability is assigned to each word as:
contractNP
endedVP
The
h
contract ended with a loss of 7 cents after
h w w
DT NN VBD IN DT NN IN CD NNS
i-2 i-1 i-2 i-1w i
nt i-1
nt i-2
ii
ii
ST
iiiiiiiiii
ST
iiii
iii
ii
WTntnthhwwwP
WTTWwPWwP
)|(),,,,,|(
)|(),|()|(
1121212
1111
27 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Exploiting Syntactic Dependencies (Cont.)
A probability is assigned to each word as:
ii
ii
ST
iiiiiiiiii
ST
iiii
iii
ii
WTntnthhwwwP
WTTWwPWwP
)|(),,,,,|(
)|(),|()|(
1121212
1111
It is assumed that most of the useful information is embedded in the 2 preceding words and 2 preceding heads.
28 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Training a Syntactic ME Model
Estimate an ME model with syntactic constraints:
where
),,,,,(
),|(
212121
),,(),(),,(),(),,(),()( 121121121
iiiiii
wntntwntwhhwhwwwwww
i
ntnthhwwZeeeeeee
,,, 2121 iiii ntnthh, 21 ii wwwPiiiiiiiiiiii
1212
1212
1212
,, 12
12122121
,, 12
12122121
,,, 12
12122121
],[#],,[#),|,,,,(
],[#],,[#),|,,,,(
],[#],,[#),|,,,,(
ii
i
iii
hhww ii
iiiiiiiii
ntntww ii
iiiiiiiiii
ntnthh ii
iiiiiiiiii
ntntwntntntntwhhwwP
hhwhhhhwntntwwP
wwwwwwwwntnthhP
See Chelba and Jelinek ACL’98 and Wu and Khudanpur ICASSP’00 for details.
i
29 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Experimental Results of Syntactic LMsPPL
79.0
74.0
68
72
76
80
3gram syntax
WER
38.5%
37.5%
36.8%
37.2%
37.6%
38.0%
38.4%
38.8%
3gram syntax
Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute compared to those of trigrams.
30 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Head Words inside vs. outside 3gram Range
contractNP
endedVP
The
h
contract ended with a loss of 7 cents after
h w w
DT NP VBD IN DT NN IN CD NNS
i-2 i-1 i-2 i-1wi
contract
i-2 i-1
NP with
The
h
contract ended with a loss
h
w w
DT NP
VBD
IN DT
i-2 i-1 w i
INa
DTended
VBD
31 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Syntactic Heads inside vs. outside Trigram Range
37.8%
40.3%
38.9%
36.9%
35%
36%
37%
38%
39%
40%
41%
Inside Outside
Trigram Both
73%
27%
Inside
Outside
1/4 of syntactic heads are outside trigram range.
The WER of the baseline trigram model is relatively high when syntactic heads are beyond trigram range.
Lexical heads words are much more helpful in reducing WER when they are outside trigram range (1.4%) than they are within trigram range.
32 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Combining Topic, Syntactic and N-gram Dependencies in an ME Framework
Probabilities are assigned as:
Only marginal constraints are necessary.
The ME composite model is trained:
ii ST
iiiiiiiiii
ii WTtopicntnthhwwwPWwP )|(),,,,,,|()|( 1
1212121
1
),,,,,,(
),,,,,,|(
121212
),(),,(),(),,(),(),,(),()(121212
121121121
topicntnthhwwZeeeeeeee
topicntnthhwwwP
iiiiii
wtopicwhntwntwhhwhwwwwwwiiiiiii
iiiiiiiiiiiiiiiii
33 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Overall Experimental ResultsPPL79.0
606468727680
3gram
Baseline trigram WER is 38.5%.
Topic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute.
Syntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER.
Topic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute.
WER
38.5%
36.0%36.5%37.0%37.5%38.0%38.5%39.0%
3gram
The gains from topic and syntactic dependencies are nearly additive.
34 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Overall Experimental ResultsPPL79.0
73.5 74.0
606468727680
3gram Topic Syntax
Baseline trigram WER is 38.5%.
Topic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute.
Syntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER.
Topic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute.
WER
38.5%
37.8%37.5%
36.0%36.5%37.0%37.5%38.0%38.5%39.0%
3gram Topic Syntax
The gains from topic and syntactic dependencies are nearly additive.
35 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Overall Experimental ResultsPPL79.0
73.5 74.0
67.9
606468727680
3gram Topic Syntax Comp
Baseline trigram WER is 38.5%.
Topic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute.
Syntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER.
Topic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute.
WER
38.5%
37.8%37.5%
37.0%
36.0%36.5%37.0%37.5%38.0%38.5%39.0%
3gram Topic Syntax Comp
The gains from topic and syntactic dependencies are nearly additive.
36 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Content Words vs. Stop words
The topic sensitive model reduces WER by 1.4% on content words, which is twice as much as the overall improvement (0.7%).
The syntactic model improves WER on both content words and stop words evenly.
The composite model has the advantage of both models and reduces WER on content words more significantly (2.1%).
37.6%
42.2%
37.0%
40.8%41.9%
36.3%
40.1%
36.2%
32%
34%
36%
38%
40%
42%
44%
Stop Wds Content Wds
Trigram Topic Syntactic Composite
78%
22%
Stop WdsCont. Wds
37 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Head Words inside vs. outside 3gram Range
The WER of the baseline trigram model is relatively high when head words are beyond trigram range.
Topic model helps when trigram is inappropriate.
The WER reduction for syntactic model (1.4%) is more than the overall reduction (1.0%) when head words are outside trigram range.
The WER reduction for composite model (2.2%) is more than the overall reduction (1.5%) when head words are inside trigram range.
37.8%
40.3%
39.1%
37.3%
38.9%
36.9%
38.1%
36.5%
34%
35%
36%
37%
38%
39%
40%
41%
Inside Outside
Trigram Topic Syntactic Composite
73%
27%
Inside
Outside
39 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Nominal Speed-up
Nominal Speed-up
The hierarchical training methods achieve a nominal speed-up of two orders of magnitude
for Switchboard, and Three orders of magnitude
for Broadcast News.
)(#)(#methodnewtheinoperations
IISinoperationsSwitchboard
1 100 10000
composite
topic
3gram NewIIS
Boradcast News
1 100 10000 1000000
topic
3gram
40 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Real Speed-up The real speed-up is 15-30
folds for the Switchboard task: 30 for the trigram model. 25 for the topic model. 15 for the composite model.
This simplification in the training procedure make it possible the implement of ME models for large corpora. 40 minutes for the trigram
model, 2.3 hours for the topic model.
Switchboard
1 10 100 1000 10000
composite
topic
3gram NewIIS
Boradcast News
1 10 100 1000
topic
3gram
41 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
More Experimental Results: Topic Dependent Models for BroadCast News
ME models are created for Broadcast News corpus (130M words).
The topic dependent model reduces the perplexity by 10% and WER by 0.6% absolute.
ME method is an effective means of integrating topic-dependent and topic-independent constraints.
PPL
168 164156 157
174
140150160170180
None
+T1gram
+T2gram
+T3gram ME
WER
34.6% 34.5%34.3%
34.1% 34.0%
33.5%
34.0%
34.5%
35.0%
None
+T1g
ram
+T2g
ram
+T3g
ram ME
Model 3gram +topic 1-gram
+topic 2-gram
+topic 3-gram
ME
Size 9.1M +100*64K +100*400K +100*600K +250K
42 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Concluding Remarks
Non-local and syntactic dependencies have been successfully integrated with N-grams. Their benefit have been demonstrated in the speech recognition application. Switchboard: 13% reduction in PPL, 1.5% (absolute) in WER.
(Eurospeech99 best student paper award.) Broadcast News: 10% reduction in PPL, 0.6% in WER. (Topic
constraints only; syntactic constraints in progress.) The computational requirements for the estimation and use
of maximum entropy techniques have been vastly simplified for a large class of ME models. Nominal speedup: 100-1000 fold. “Real” speedup: 15+ fold.
A General purpose toolkit for ME models is being developed for public release.
43 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Concluding Remarks
Non-local and syntactic dependencies have been successfully integrated with N-grams. Their benefit have been demonstrated in the speech recognition application. Switchboard: 13% reduction in PPL, 1.5% (absolute) in WER.
(Eurospeech99 best student paper award.) Broadcast News: 10% reduction in PPL, 0.6% in WER. (Topic
constraints only; syntactic constraints in progress.) The computational requirements for the estimation and use
of maximum entropy techniques have been vastly simplified for a large class of ME models. Nominal speedup: 100-1000 fold. “Real” speedup: 15+ fold.
A General purpose toolkit for ME models is being developed for public release.
44 Center for Language and Speech Processing, The Johns Hopkins University. April 2001
Acknowledgement I thank my advisor Sanjeev Khudanpur who leads me to this
field and always gives me wisdom advice and help when necessary and David Yarowsky who gives generous help during my Ph.D. program.
I thank Radu Florian and David Yarowsky for their help on topic detection and data clustering, Ciprian Chelba and Frederick Jelinek for providing the syntactic model (parser) for the SWBD experimental results reported here, and Shankar Kumar and Vlasios Doumpiotis for their help on generating N-best lists for the BN experiments.
I thank all people in the NLP lab and CLSP for their assistance in my thesis work.
This work is supported by National Science Foundation, a STIMULATE grant (IRI-9618874).