Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

Center for Language and Speech Processing, The Johns Hopkins University. April 2001 1

Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational

Dependencies

Jun Wu Advisor: Sanjeev Khudanpur

Department of Computer ScienceJohns Hopkins University

Baltimore, MD 21218

April, 2001NSF STIMULATE Grant No. IRI-9618874

2 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Outline

Language modeling in speech recognition The maximum entropy (ME) principle Semantic (Topic) dependencies in natural

language Syntactic dependencies in natural language ME models with topic and syntactic dependencies Conclusion and future work Topic assignment during test (15min) Role of syntactic head (15min) Training ME models in an efficient way (1 hour)


Outline

Language modeling in speech recognition The maximum entropy (ME) principle Semantic (Topic) dependencies in natural

language Syntactic dependencies in natural language ME models with topic and syntactic dependencies Conclusion and future work Topic assignment during test (15min)Topic assignment during test (15min) Role of syntactic head (15min)Role of syntactic head (15min) Training ME models in an efficient way (1 hour)Training ME models in an efficient way (1 hour)


Motivation

Example: A research team led byA research team led by two Johns Hopkins

scientists ___ found the strongest evidence yet found the strongest evidence yet that a virus may …...that a virus may …... have has his


Motivation

Example: A research team led by two Johns Hopkins

scientists ___ found the strongest evidence yet that a virus may …... have has his


Language Models in Speech Recognition

Speaker’s Mind

Speech Producer

Acoustic Processor

Linguistic Decoder W* Speech A

W ̂

Speaker Acoustic Channel

Speech Recognizer

Role of language models

)()|(

)()|(

)(/)()|(

)|(^

WPWAPMaxArg

WPWAPMaxArg

APWPWAPMaxArg

AWPMaxArgW

LAW

W

W

W


Language Modeling in Speech Recognition

N-gram models

In practice, N=1,2,3,or 4. Even these values of N pose data sparseness problem. For , a trigram model has free parameters. There are millions of unseen bigrams and billions of unseen trigrams for which we need an estimate of the probability .

),|(3 vuwP trillionV 8~|| 3

),( wv),,( wvu

3P

modelgramN

11121

11121

21

)|(),(

)|()|()()()(

m

NiiNiiNN

mm

m

wwwPwwwP

wwwPwwPwPwwwPWP

KV 20|~|


Smoothing Techniques Relative frequency estimates:

Deleted Interpolation: Jelinek, et al. 1980

Back-off: Katz 1987, Witten-Bell 1990, Ney, et al. 1994.

01233 )()|(),|(),|( wfvwfvuwfvuwP

.0),,(#),|(

,),,(#0,),(#

)],,([#,),,(#),,|(

),|(

2

3

wvuifvwP

Twvuifvu

wvudiscountTwvuifvuwf

vuwP

),(#),,(#),|(),|(

vuwvuvuwfvuwP


Measuring the Quality of Language Models Word Error Rate:

Reference: The contract ended with a loss of *** seven cents. Hypothesis: A contract ended with * loss of some even

cents. Scores: S C C C D C C I S C

Perplexity:

Perplexity measures the average number of words that can follow a given history under a language model.

)(#)(#

DSCIDSWER

)(2

)(log)()(

LP PH

LW

LP

PPL

WPWPPH


Measuring the Quality of Language Models Word Error Rate:

Reference: The contract ended with a loss of *** seven cents. Hypothesis: A contract ended with * loss of some even

cents. Scores: S C C C D C C I S C

Perplexity:

Perplexity measures the average number of words that can follow a given history under a language model.

)(#)(#

DSCIDSWER

)(2

)(log)()(

LP PH

LW

LP

PPL

WPWPPH


Experimental Setup for Switchboard

American English conversations over the telephone. Vocabulary: 22K (closed), LM training set: 1100 conversations, 2.1M words.

Test set: WS97 dev-test set. 19 conversations (2 hours), 18K words, PPL=79 (back-off trigram model), State-of-the-art systems: 30-35% WER.

Evaluation: 100-best list rescoring.

Speech Recognizer

(Baseline LM) 100 Best HypSpeech

Rescoring(New LM)

1 hypothesis


Experimental Setup for Broadcast News

American English television broadcast. Vocabulary: open (>100K). LM training set: 125K stories, 130M words.

Test set: Hub-4 96 dev-test set. 21K words, PPL=174 (back-off trigram model), State-of-the-art systems: 25% WER.

The evaluation is based on rescoring 100-best lists of the first pass speech recognition.


The Maximum Entropy Principle The maximum entropy (ME) principle

When we make inferences based on incomplete information, we should choose the probability distribution which has the maximum entropy permitted by the information we do have.

Example (Dice) Let be the probability that the facet with

dots faces-up. Seek model , that maximizes

From Lagrangian

So , choose :

6,2,1, ipi i),,( 621 pppP

.log)( ii

i ppPH

i

iii

i pppPL )1(log),(

0log1 i

i

ppL

1621 , eppp

61,, 621 ppp .1

6

1

i

ip


The Maximum Entropy Principle (Cont.)

Example 2: Seek probability distribution with constraints. ( is the empirical distribution.)

The feature:

Empirical expectation:

Maximize subject to

So

41ˆ 2 p p̂

otherwise

iiff

01

41)(ˆ)(ˆ ifpfE

iiP

ii

i ppPH log)( )()( ˆ fEfE PP

i

ii

iii

i ifppppPL )41)(()1(log),( 21

203,

41

6312 pppp


Maximum Entropy Language Modeling

Use the short-hand notation For words u, v, w , define a collection of binary

features:

Obtain their target expectations from the training data.

Find It can be shown that

Kaaa ,, 21

}

)()|(:{,)|(*

2121

1 xZxyQQxyQMaxArgP

KfK

ffn

iiiQ

QQ

wyvux );,(

.,, 21 Kfff

}][:{),(* iiPafEPPHMaxArgP

P

P

otherwise0apples,y likes,John xif1

{),(

yxfk


Advantages and Disadvantage of Maximum Entropy Language Modeling

Advantages: Creating a “smooth” model that satisfies all empirical

constraints. Incorporating various sources of information in a unified

language model. Disadvantage:

Computation complexity of model parameter estimation procedure.


Training an ME Model

Darroch and Ratcliff 1972: Generalized Iterative Scaling (GIS).

Della Pietra, et al 1996 : Unigram Caching and Improved Iterative Scaling (IIS).

Wu and Khudanpur 2000: Hierarchical Training Methods. For N-gram models and many other models, the training

time per iteration is strictly bounded by which is the same as that of training a back-off model.

A real running time speed-up of one to two orders of magnitude is achieved compared to IIS.

See Wu and Khudanpur ICSLP2000 for details.

)(LO


Motivation for Exploiting Semantic and Syntactic Dependencies

N-gram models only take local correlation between words into account.

Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency.

Need a model that exploits topic and syntax.

Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange.


Motivation for Exploiting Semantic and Syntactic Dependencies

N-gram models only take local correlation between words into account.

Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency.

Need a model that exploits topic and syntax.

Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange.


Training a Topic Sensitive Model Cluster the training data by topic.

TF-IDF vector (excluding stop words). Cosine similarity. K-means clustering.

Select topic dependent words:

Estimate an ME model with topic unigram constraints:threshold

wfwfwf t

t

)()(log)(

),,(),,|(

12

),(),,(),()(

12

121

topicwwZeeeetopicwwwP

ii

wtopicwwwwww

iii

iiiiiii

where

][#],[#)|,,( 12 topic

wtopictopicwwwP iiii

12 ,ww ii

E[f]a


Recognition Using a Topic-Sensitive Model

Detect the current topic from Recognizer’s N-best hypotheses vs. reference

transcriptions.Using N-best hypotheses causes little degradation (in

perplexity and WER). Assign a new topic for each

Conversation vs. utterance.Topic assignment for each utterance is better than

topic assignment for the whole conversation. See Khudanpur and Wu ICASSP’99 paper and

Florian and Yarowsky ACL’99 for details.


Performance of the Topic Model

The ME model with only N-gram constraints duplicates the performance of the corresponding back-off model.

The Topic dependent ME model reduces the perplexity by 7% and WER by 0.7% absolute.

PPL79.0 79.0

73.5

68

72

76

80

BO3gram ME3gram METopic

WER38.5%

38.3%

37.8%

37.4%

37.6%

37.8%

38.0%

38.2%

38.4%

38.6%

BO3gram ME3gram METopic


Content Words vs. Stop Words

37.6%

42.2%

40.8%

37.0%34%

36%

38%

40%

42%

44%

stop Content

Trigram Topic

1/5 of tokens in the test data are content-bearing words.

The WER of the baseline trigram model is relatively high for content words.

Topic dependencies are much more helpful in reducing WER of content words (1.4%) than they are for stop words (0.6%).

78%

22%

Stop WdsCont. Wds


A Syntactic Parse and Syntactic Heads

The contract ended with a loss of 7 cents after …DT NN VBD IN DT NN IN CD NNS …

centsof

loss

losswith

ended

contract

ended

NP

PP

NP

NP

NP

VP

S’

PP


Exploiting Syntactic Dependencies

A stack of parse trees for each sentence prefix is generated.

All sentences in the training set are parsed by a left-to-right parser.

iT

contractNP

endedVP

The

h

contract ended with a loss of 7 cents after

h w w

DT NN VBD IN DT NN IN CD NNS

i-2 i-1 i-2 i-1w i

nt i-1

nt i-2


Exploiting Syntactic Dependencies (Cont.)

A probability is assigned to each word as:

contractNP

endedVP

The

h


h w w

DT NN VBD IN DT NN IN CD NNS

i-2 i-1 i-2 i-1w i

nt i-1

nt i-2

ii

ii

ST

iiiiiiiiii

ST

iiii

iii

ii

WTntnthhwwwP

WTTWwPWwP

)|(),,,,,|(

)|(),|()|(

1121212

1111


Exploiting Syntactic Dependencies (Cont.)

A probability is assigned to each word as:

ii

ii

ST

iiiiiiiiii

ST

iiii

iii

ii

WTntnthhwwwP

WTTWwPWwP

)|(),,,,,|(

)|(),|()|(

1121212

1111

It is assumed that most of the useful information is embedded in the 2 preceding words and 2 preceding heads.


Training a Syntactic ME Model

Estimate an ME model with syntactic constraints:

where

),,,,,(

),|(

212121

),,(),(),,(),(),,(),()( 121121121

iiiiii

wntntwntwhhwhwwwwww

i

ntnthhwwZeeeeeee

,,, 2121 iiii ntnthh, 21 ii wwwPiiiiiiiiiiii

1212

1212

1212

,, 12

12122121

,, 12

12122121

,,, 12

12122121

],[#],,[#),|,,,,(

],[#],,[#),|,,,,(

],[#],,[#),|,,,,(

ii

i

iii

hhww ii

iiiiiiiii

ntntww ii

iiiiiiiiii

ntnthh ii

iiiiiiiiii

ntntwntntntntwhhwwP

hhwhhhhwntntwwP

wwwwwwwwntnthhP

See Chelba and Jelinek ACL’98 and Wu and Khudanpur ICASSP’00 for details.

i


Experimental Results of Syntactic LMsPPL

79.0

74.0

68

72

76

80

3gram syntax

WER

38.5%

37.5%

36.8%

37.2%

37.6%

38.0%

38.4%

38.8%

3gram syntax

Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute compared to those of trigrams.


Head Words inside vs. outside 3gram Range

contractNP

endedVP

The

h


h w w

DT NP VBD IN DT NN IN CD NNS

i-2 i-1 i-2 i-1wi

contract

i-2 i-1

NP with

The

h

contract ended with a loss

h

w w

DT NP

VBD

IN DT

i-2 i-1 w i

INa

DTended

VBD


Syntactic Heads inside vs. outside Trigram Range

37.8%

40.3%

38.9%

36.9%

35%

36%

37%

38%

39%

40%

41%

Inside Outside

Trigram Both

73%

27%

Inside

Outside

1/4 of syntactic heads are outside trigram range.

The WER of the baseline trigram model is relatively high when syntactic heads are beyond trigram range.

Lexical heads words are much more helpful in reducing WER when they are outside trigram range (1.4%) than they are within trigram range.


Combining Topic, Syntactic and N-gram Dependencies in an ME Framework

Probabilities are assigned as:

Only marginal constraints are necessary.

The ME composite model is trained:

ii ST

iiiiiiiiii

ii WTtopicntnthhwwwPWwP )|(),,,,,,|()|( 1

1212121

1

),,,,,,(

),,,,,,|(

121212

),(),,(),(),,(),(),,(),()(121212

121121121

topicntnthhwwZeeeeeeee

topicntnthhwwwP

iiiiii

wtopicwhntwntwhhwhwwwwwwiiiiiii

iiiiiiiiiiiiiiiii


Overall Experimental ResultsPPL79.0

606468727680

3gram

Baseline trigram WER is 38.5%.

Topic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute.

Syntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER.

Topic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute.

WER

38.5%

36.0%36.5%37.0%37.5%38.0%38.5%39.0%

3gram

The gains from topic and syntactic dependencies are nearly additive.



73.5 74.0

606468727680

3gram Topic Syntax





WER

38.5%

37.8%37.5%

36.0%36.5%37.0%37.5%38.0%38.5%39.0%

3gram Topic Syntax




73.5 74.0

67.9

606468727680

3gram Topic Syntax Comp





WER

38.5%

37.8%37.5%

37.0%

36.0%36.5%37.0%37.5%38.0%38.5%39.0%

3gram Topic Syntax Comp



Content Words vs. Stop words

The topic sensitive model reduces WER by 1.4% on content words, which is twice as much as the overall improvement (0.7%).

The syntactic model improves WER on both content words and stop words evenly.

The composite model has the advantage of both models and reduces WER on content words more significantly (2.1%).

37.6%

42.2%

37.0%

40.8%41.9%

36.3%

40.1%

36.2%

32%

34%

36%

38%

40%

42%

44%

Stop Wds Content Wds

Trigram Topic Syntactic Composite

78%

22%

Stop WdsCont. Wds


Head Words inside vs. outside 3gram Range

The WER of the baseline trigram model is relatively high when head words are beyond trigram range.

Topic model helps when trigram is inappropriate.

The WER reduction for syntactic model (1.4%) is more than the overall reduction (1.0%) when head words are outside trigram range.

The WER reduction for composite model (2.2%) is more than the overall reduction (1.5%) when head words are inside trigram range.

37.8%

40.3%

39.1%

37.3%

38.9%

36.9%

38.1%

36.5%

34%

35%

36%

37%

38%

39%

40%

41%

Inside Outside

Trigram Topic Syntactic Composite

73%

27%

Inside

Outside


Nominal Speed-up

Nominal Speed-up

The hierarchical training methods achieve a nominal speed-up of two orders of magnitude

for Switchboard, and Three orders of magnitude

for Broadcast News.

)(#)(#methodnewtheinoperations

IISinoperationsSwitchboard

1 100 10000

composite

topic

3gram NewIIS

Boradcast News

1 100 10000 1000000

topic

3gram


Real Speed-up The real speed-up is 15-30

folds for the Switchboard task: 30 for the trigram model. 25 for the topic model. 15 for the composite model.

This simplification in the training procedure make it possible the implement of ME models for large corpora. 40 minutes for the trigram

model, 2.3 hours for the topic model.

Switchboard

1 10 100 1000 10000

composite

topic

3gram NewIIS

Boradcast News

1 10 100 1000

topic

3gram


More Experimental Results: Topic Dependent Models for BroadCast News

ME models are created for Broadcast News corpus (130M words).

The topic dependent model reduces the perplexity by 10% and WER by 0.6% absolute.

ME method is an effective means of integrating topic-dependent and topic-independent constraints.

PPL

168 164156 157

174

140150160170180

None

+T1gram

+T2gram

+T3gram ME

WER

34.6% 34.5%34.3%

34.1% 34.0%

33.5%

34.0%

34.5%

35.0%

None

+T1g

ram

+T2g

ram

+T3g

ram ME

Model 3gram +topic 1-gram

+topic 2-gram

+topic 3-gram

ME

Size 9.1M +100*64K +100*400K +100*600K +250K


Concluding Remarks

Non-local and syntactic dependencies have been successfully integrated with N-grams. Their benefit have been demonstrated in the speech recognition application. Switchboard: 13% reduction in PPL, 1.5% (absolute) in WER.

(Eurospeech99 best student paper award.) Broadcast News: 10% reduction in PPL, 0.6% in WER. (Topic

constraints only; syntactic constraints in progress.) The computational requirements for the estimation and use

of maximum entropy techniques have been vastly simplified for a large class of ME models. Nominal speedup: 100-1000 fold. “Real” speedup: 15+ fold.

A General purpose toolkit for ME models is being developed for public release.


Concluding Remarks

Non-local and syntactic dependencies have been successfully integrated with N-grams. Their benefit have been demonstrated in the speech recognition application. Switchboard: 13% reduction in PPL, 1.5% (absolute) in WER.

(Eurospeech99 best student paper award.) Broadcast News: 10% reduction in PPL, 0.6% in WER. (Topic

constraints only; syntactic constraints in progress.) The computational requirements for the estimation and use

of maximum entropy techniques have been vastly simplified for a large class of ME models. Nominal speedup: 100-1000 fold. “Real” speedup: 15+ fold.

A General purpose toolkit for ME models is being developed for public release.


Acknowledgement I thank my advisor Sanjeev Khudanpur who leads me to this

field and always gives me wisdom advice and help when necessary and David Yarowsky who gives generous help during my Ph.D. program.

I thank Radu Florian and David Yarowsky for their help on topic detection and data clustering, Ciprian Chelba and Frederick Jelinek for providing the syntactic model (parser) for the SWBD experimental results reported here, and Shankar Kumar and Vlasios Doumpiotis for their help on generating N-best lists for the BN experiments.

I thank all people in the NLP lab and CLSP for their assistance in my thesis work.

This work is supported by National Science Foundation, a STIMULATE grant (IRI-9618874).

Documents

Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies