43
Center for Language and Speech Processing, The Johns Hopkins University. April 2001 1 Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies Jun Wu Advisor: Sanjeev Khudanpur Department of Computer Science Johns Hopkins University Baltimore, MD 21218 April, 2001 NSF STIMULATE Grant No. IRI-9618874

Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

  • Upload
    lori

  • View
    70

  • Download
    2

Embed Size (px)

DESCRIPTION

Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies. Jun Wu Advisor: Sanjeev Khudanpur Department of Computer Science Johns Hopkins University Baltimore, MD 21218 April, 2001 NSF STIMULATE Grant No. IRI-9618874. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

Center for Language and Speech Processing, The Johns Hopkins University. April 2001 1

Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational

Dependencies

Jun Wu Advisor: Sanjeev Khudanpur

Department of Computer ScienceJohns Hopkins University

Baltimore, MD 21218

April, 2001NSF STIMULATE Grant No. IRI-9618874

Page 2: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

2 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Outline

Language modeling in speech recognition The maximum entropy (ME) principle Semantic (Topic) dependencies in natural

language Syntactic dependencies in natural language ME models with topic and syntactic dependencies Conclusion and future work Topic assignment during test (15min) Role of syntactic head (15min) Training ME models in an efficient way (1 hour)

Page 3: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

3 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Outline

Language modeling in speech recognition The maximum entropy (ME) principle Semantic (Topic) dependencies in natural

language Syntactic dependencies in natural language ME models with topic and syntactic dependencies Conclusion and future work Topic assignment during test (15min)Topic assignment during test (15min) Role of syntactic head (15min)Role of syntactic head (15min) Training ME models in an efficient way (1 hour)Training ME models in an efficient way (1 hour)

Page 4: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

4 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Motivation

Example: A research team led byA research team led by two Johns Hopkins

scientists ___ found the strongest evidence yet found the strongest evidence yet that a virus may …...that a virus may …... have has his

Page 5: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

5 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Motivation

Example: A research team led by two Johns Hopkins

scientists ___ found the strongest evidence yet that a virus may …... have has his

Page 6: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

6 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Language Models in Speech Recognition

Speaker’s Mind

Speech Producer

Acoustic Processor

Linguistic Decoder W* Speech A

W ̂

Speaker Acoustic Channel

Speech Recognizer

Role of language models

)()|(

)()|(

)(/)()|(

)|(^

WPWAPMaxArg

WPWAPMaxArg

APWPWAPMaxArg

AWPMaxArgW

LAW

W

W

W

Page 7: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

7 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Language Modeling in Speech Recognition

N-gram models

In practice, N=1,2,3,or 4. Even these values of N pose data sparseness problem. For , a trigram model has free parameters. There are millions of unseen bigrams and billions of unseen trigrams for which we need an estimate of the probability .

),|(3 vuwP trillionV 8~|| 3

),( wv),,( wvu

3P

modelgramN

11121

11121

21

)|(),(

)|()|()()()(

m

NiiNiiNN

mm

m

wwwPwwwP

wwwPwwPwPwwwPWP

KV 20|~|

Page 8: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

8 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Smoothing Techniques Relative frequency estimates:

Deleted Interpolation: Jelinek, et al. 1980

Back-off: Katz 1987, Witten-Bell 1990, Ney, et al. 1994.

01233 )()|(),|(),|( wfvwfvuwfvuwP

.0),,(#),|(

,),,(#0,),(#

)],,([#,),,(#),,|(

),|(

2

3

wvuifvwP

Twvuifvu

wvudiscountTwvuifvuwf

vuwP

),(#),,(#),|(),|(

vuwvuvuwfvuwP

Page 9: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

9 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Measuring the Quality of Language Models Word Error Rate:

Reference: The contract ended with a loss of *** seven cents. Hypothesis: A contract ended with * loss of some even

cents. Scores: S C C C D C C I S C

Perplexity:

Perplexity measures the average number of words that can follow a given history under a language model.

)(#)(#

DSCIDSWER

)(2

)(log)()(

LP PH

LW

LP

PPL

WPWPPH

Page 10: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

10 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Measuring the Quality of Language Models Word Error Rate:

Reference: The contract ended with a loss of *** seven cents. Hypothesis: A contract ended with * loss of some even

cents. Scores: S C C C D C C I S C

Perplexity:

Perplexity measures the average number of words that can follow a given history under a language model.

)(#)(#

DSCIDSWER

)(2

)(log)()(

LP PH

LW

LP

PPL

WPWPPH

Page 11: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

11 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Experimental Setup for Switchboard

American English conversations over the telephone. Vocabulary: 22K (closed), LM training set: 1100 conversations, 2.1M words.

Test set: WS97 dev-test set. 19 conversations (2 hours), 18K words, PPL=79 (back-off trigram model), State-of-the-art systems: 30-35% WER.

Evaluation: 100-best list rescoring.

Speech Recognizer

(Baseline LM) 100 Best HypSpeech

Rescoring(New LM)

1 hypothesis

Page 12: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

12 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Experimental Setup for Broadcast News

American English television broadcast. Vocabulary: open (>100K). LM training set: 125K stories, 130M words.

Test set: Hub-4 96 dev-test set. 21K words, PPL=174 (back-off trigram model), State-of-the-art systems: 25% WER.

The evaluation is based on rescoring 100-best lists of the first pass speech recognition.

Page 13: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

13 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

The Maximum Entropy Principle The maximum entropy (ME) principle

When we make inferences based on incomplete information, we should choose the probability distribution which has the maximum entropy permitted by the information we do have.

Example (Dice) Let be the probability that the facet with

dots faces-up. Seek model , that maximizes

From Lagrangian

So , choose :

6,2,1, ipi i),,( 621 pppP

.log)( ii

i ppPH

i

iii

i pppPL )1(log),(

0log1 i

i

ppL

1621 , eppp

61,, 621 ppp .1

6

1

i

ip

Page 14: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

14 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

The Maximum Entropy Principle (Cont.)

Example 2: Seek probability distribution with constraints. ( is the empirical distribution.)

The feature:

Empirical expectation:

Maximize subject to

So

41ˆ 2 p p̂

otherwise

iiff

01

41)(ˆ)(ˆ ifpfE

iiP

ii

i ppPH log)( )()( ˆ fEfE PP

i

ii

iii

i ifppppPL )41)(()1(log),( 21

203,

41

6312 pppp

Page 15: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

15 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Maximum Entropy Language Modeling

Use the short-hand notation For words u, v, w , define a collection of binary

features:

Obtain their target expectations from the training data.

Find It can be shown that

Kaaa ,, 21

}

)()|(:{,)|(*

2121

1 xZxyQQxyQMaxArgP

KfK

ffn

iiiQ

QQ

wyvux );,(

.,, 21 Kfff

}][:{),(* iiPafEPPHMaxArgP

P

P

otherwise0apples,y likes,John xif1

{),(

yxfk

Page 16: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

16 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Advantages and Disadvantage of Maximum Entropy Language Modeling

Advantages: Creating a “smooth” model that satisfies all empirical

constraints. Incorporating various sources of information in a unified

language model. Disadvantage:

Computation complexity of model parameter estimation procedure.

Page 17: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

17 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Training an ME Model

Darroch and Ratcliff 1972: Generalized Iterative Scaling (GIS).

Della Pietra, et al 1996 : Unigram Caching and Improved Iterative Scaling (IIS).

Wu and Khudanpur 2000: Hierarchical Training Methods. For N-gram models and many other models, the training

time per iteration is strictly bounded by which is the same as that of training a back-off model.

A real running time speed-up of one to two orders of magnitude is achieved compared to IIS.

See Wu and Khudanpur ICSLP2000 for details.

)(LO

Page 18: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

18 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Motivation for Exploiting Semantic and Syntactic Dependencies

N-gram models only take local correlation between words into account.

Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency.

Need a model that exploits topic and syntax.

Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange.

Page 19: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

19 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Motivation for Exploiting Semantic and Syntactic Dependencies

N-gram models only take local correlation between words into account.

Several dependencies in natural language with longer and sentence-structure dependent spans may compensate for this deficiency.

Need a model that exploits topic and syntax.

Analysts and financial officials in the former British colony consider the contract essential to the revival of the Hong Kong futures exchange.

Page 20: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

20 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Training a Topic Sensitive Model Cluster the training data by topic.

TF-IDF vector (excluding stop words). Cosine similarity. K-means clustering.

Select topic dependent words:

Estimate an ME model with topic unigram constraints:threshold

wfwfwf t

t

)()(log)(

),,(),,|(

12

),(),,(),()(

12

121

topicwwZeeeetopicwwwP

ii

wtopicwwwwww

iii

iiiiiii

where

][#],[#)|,,( 12 topic

wtopictopicwwwP iiii

12 ,ww ii

E[f]a

Page 21: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

21 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Recognition Using a Topic-Sensitive Model

Detect the current topic from Recognizer’s N-best hypotheses vs. reference

transcriptions.Using N-best hypotheses causes little degradation (in

perplexity and WER). Assign a new topic for each

Conversation vs. utterance.Topic assignment for each utterance is better than

topic assignment for the whole conversation. See Khudanpur and Wu ICASSP’99 paper and

Florian and Yarowsky ACL’99 for details.

Page 22: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

22 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Performance of the Topic Model

The ME model with only N-gram constraints duplicates the performance of the corresponding back-off model.

The Topic dependent ME model reduces the perplexity by 7% and WER by 0.7% absolute.

PPL79.0 79.0

73.5

68

72

76

80

BO3gram ME3gram METopic

WER38.5%

38.3%

37.8%

37.4%

37.6%

37.8%

38.0%

38.2%

38.4%

38.6%

BO3gram ME3gram METopic

Page 23: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

23 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Content Words vs. Stop Words

37.6%

42.2%

40.8%

37.0%34%

36%

38%

40%

42%

44%

stop Content

Trigram Topic

1/5 of tokens in the test data are content-bearing words.

The WER of the baseline trigram model is relatively high for content words.

Topic dependencies are much more helpful in reducing WER of content words (1.4%) than they are for stop words (0.6%).

78%

22%

Stop WdsCont. Wds

Page 24: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

24 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

A Syntactic Parse and Syntactic Heads

The contract ended with a loss of 7 cents after …DT NN VBD IN DT NN IN CD NNS …

centsof

loss

losswith

ended

contract

ended

NP

PP

NP

NP

NP

VP

S’

PP

Page 25: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

25 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Exploiting Syntactic Dependencies

A stack of parse trees for each sentence prefix is generated.

All sentences in the training set are parsed by a left-to-right parser.

iT

contractNP

endedVP

The

h

contract ended with a loss of 7 cents after

h w w

DT NN VBD IN DT NN IN CD NNS

i-2 i-1 i-2 i-1w i

nt i-1

nt i-2

Page 26: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

26 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Exploiting Syntactic Dependencies (Cont.)

A probability is assigned to each word as:

contractNP

endedVP

The

h

contract ended with a loss of 7 cents after

h w w

DT NN VBD IN DT NN IN CD NNS

i-2 i-1 i-2 i-1w i

nt i-1

nt i-2

ii

ii

ST

iiiiiiiiii

ST

iiii

iii

ii

WTntnthhwwwP

WTTWwPWwP

)|(),,,,,|(

)|(),|()|(

1121212

1111

Page 27: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

27 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Exploiting Syntactic Dependencies (Cont.)

A probability is assigned to each word as:

ii

ii

ST

iiiiiiiiii

ST

iiii

iii

ii

WTntnthhwwwP

WTTWwPWwP

)|(),,,,,|(

)|(),|()|(

1121212

1111

It is assumed that most of the useful information is embedded in the 2 preceding words and 2 preceding heads.

Page 28: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

28 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Training a Syntactic ME Model

Estimate an ME model with syntactic constraints:

where

),,,,,(

),|(

212121

),,(),(),,(),(),,(),()( 121121121

iiiiii

wntntwntwhhwhwwwwww

i

ntnthhwwZeeeeeee

,,, 2121 iiii ntnthh, 21 ii wwwPiiiiiiiiiiii

1212

1212

1212

,, 12

12122121

,, 12

12122121

,,, 12

12122121

],[#],,[#),|,,,,(

],[#],,[#),|,,,,(

],[#],,[#),|,,,,(

ii

i

iii

hhww ii

iiiiiiiii

ntntww ii

iiiiiiiiii

ntnthh ii

iiiiiiiiii

ntntwntntntntwhhwwP

hhwhhhhwntntwwP

wwwwwwwwntnthhP

See Chelba and Jelinek ACL’98 and Wu and Khudanpur ICASSP’00 for details.

i

Page 29: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

29 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Experimental Results of Syntactic LMsPPL

79.0

74.0

68

72

76

80

3gram syntax

WER

38.5%

37.5%

36.8%

37.2%

37.6%

38.0%

38.4%

38.8%

3gram syntax

Non-terminal constraints and syntactic constraints together reduce the perplexity by 6.3% and WER by 1.0% absolute compared to those of trigrams.

Page 30: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

30 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Head Words inside vs. outside 3gram Range

contractNP

endedVP

The

h

contract ended with a loss of 7 cents after

h w w

DT NP VBD IN DT NN IN CD NNS

i-2 i-1 i-2 i-1wi

contract

i-2 i-1

NP with

The

h

contract ended with a loss

h

w w

DT NP

VBD

IN DT

i-2 i-1 w i

INa

DTended

VBD

Page 31: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

31 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Syntactic Heads inside vs. outside Trigram Range

37.8%

40.3%

38.9%

36.9%

35%

36%

37%

38%

39%

40%

41%

Inside Outside

Trigram Both

73%

27%

Inside

Outside

1/4 of syntactic heads are outside trigram range.

The WER of the baseline trigram model is relatively high when syntactic heads are beyond trigram range.

Lexical heads words are much more helpful in reducing WER when they are outside trigram range (1.4%) than they are within trigram range.

Page 32: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

32 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Combining Topic, Syntactic and N-gram Dependencies in an ME Framework

Probabilities are assigned as:

Only marginal constraints are necessary.

The ME composite model is trained:

ii ST

iiiiiiiiii

ii WTtopicntnthhwwwPWwP )|(),,,,,,|()|( 1

1212121

1

),,,,,,(

),,,,,,|(

121212

),(),,(),(),,(),(),,(),()(121212

121121121

topicntnthhwwZeeeeeeee

topicntnthhwwwP

iiiiii

wtopicwhntwntwhhwhwwwwwwiiiiiii

iiiiiiiiiiiiiiiii

Page 33: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

33 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Overall Experimental ResultsPPL79.0

606468727680

3gram

Baseline trigram WER is 38.5%.

Topic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute.

Syntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER.

Topic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute.

WER

38.5%

36.0%36.5%37.0%37.5%38.0%38.5%39.0%

3gram

The gains from topic and syntactic dependencies are nearly additive.

Page 34: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

34 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Overall Experimental ResultsPPL79.0

73.5 74.0

606468727680

3gram Topic Syntax

Baseline trigram WER is 38.5%.

Topic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute.

Syntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER.

Topic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute.

WER

38.5%

37.8%37.5%

36.0%36.5%37.0%37.5%38.0%38.5%39.0%

3gram Topic Syntax

The gains from topic and syntactic dependencies are nearly additive.

Page 35: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

35 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Overall Experimental ResultsPPL79.0

73.5 74.0

67.9

606468727680

3gram Topic Syntax Comp

Baseline trigram WER is 38.5%.

Topic-dependent constraints alone reduce perplexity by 7% and WER by 0.7% absolute.

Syntactic Heads result in 6% reduction in perplexity and 1.0% absolute in WER.

Topic-dependent constraints and syntactic constraints together reduce the perplexity by 13% and WER by 1.5% absolute.

WER

38.5%

37.8%37.5%

37.0%

36.0%36.5%37.0%37.5%38.0%38.5%39.0%

3gram Topic Syntax Comp

The gains from topic and syntactic dependencies are nearly additive.

Page 36: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

36 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Content Words vs. Stop words

The topic sensitive model reduces WER by 1.4% on content words, which is twice as much as the overall improvement (0.7%).

The syntactic model improves WER on both content words and stop words evenly.

The composite model has the advantage of both models and reduces WER on content words more significantly (2.1%).

37.6%

42.2%

37.0%

40.8%41.9%

36.3%

40.1%

36.2%

32%

34%

36%

38%

40%

42%

44%

Stop Wds Content Wds

Trigram Topic Syntactic Composite

78%

22%

Stop WdsCont. Wds

Page 37: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

37 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Head Words inside vs. outside 3gram Range

The WER of the baseline trigram model is relatively high when head words are beyond trigram range.

Topic model helps when trigram is inappropriate.

The WER reduction for syntactic model (1.4%) is more than the overall reduction (1.0%) when head words are outside trigram range.

The WER reduction for composite model (2.2%) is more than the overall reduction (1.5%) when head words are inside trigram range.

37.8%

40.3%

39.1%

37.3%

38.9%

36.9%

38.1%

36.5%

34%

35%

36%

37%

38%

39%

40%

41%

Inside Outside

Trigram Topic Syntactic Composite

73%

27%

Inside

Outside

Page 38: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

39 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Nominal Speed-up

Nominal Speed-up

The hierarchical training methods achieve a nominal speed-up of two orders of magnitude

for Switchboard, and Three orders of magnitude

for Broadcast News.

)(#)(#methodnewtheinoperations

IISinoperationsSwitchboard

1 100 10000

composite

topic

3gram NewIIS

Boradcast News

1 100 10000 1000000

topic

3gram

Page 39: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

40 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Real Speed-up The real speed-up is 15-30

folds for the Switchboard task: 30 for the trigram model. 25 for the topic model. 15 for the composite model.

This simplification in the training procedure make it possible the implement of ME models for large corpora. 40 minutes for the trigram

model, 2.3 hours for the topic model.

Switchboard

1 10 100 1000 10000

composite

topic

3gram NewIIS

Boradcast News

1 10 100 1000

topic

3gram

Page 40: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

41 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

More Experimental Results: Topic Dependent Models for BroadCast News

ME models are created for Broadcast News corpus (130M words).

The topic dependent model reduces the perplexity by 10% and WER by 0.6% absolute.

ME method is an effective means of integrating topic-dependent and topic-independent constraints.

PPL

168 164156 157

174

140150160170180

None

+T1gram

+T2gram

+T3gram ME

WER

34.6% 34.5%34.3%

34.1% 34.0%

33.5%

34.0%

34.5%

35.0%

None

+T1g

ram

+T2g

ram

+T3g

ram ME

Model 3gram +topic 1-gram

+topic 2-gram

+topic 3-gram

ME

Size 9.1M +100*64K +100*400K +100*600K +250K

Page 41: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

42 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Concluding Remarks

Non-local and syntactic dependencies have been successfully integrated with N-grams. Their benefit have been demonstrated in the speech recognition application. Switchboard: 13% reduction in PPL, 1.5% (absolute) in WER.

(Eurospeech99 best student paper award.) Broadcast News: 10% reduction in PPL, 0.6% in WER. (Topic

constraints only; syntactic constraints in progress.) The computational requirements for the estimation and use

of maximum entropy techniques have been vastly simplified for a large class of ME models. Nominal speedup: 100-1000 fold. “Real” speedup: 15+ fold.

A General purpose toolkit for ME models is being developed for public release.

Page 42: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

43 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Concluding Remarks

Non-local and syntactic dependencies have been successfully integrated with N-grams. Their benefit have been demonstrated in the speech recognition application. Switchboard: 13% reduction in PPL, 1.5% (absolute) in WER.

(Eurospeech99 best student paper award.) Broadcast News: 10% reduction in PPL, 0.6% in WER. (Topic

constraints only; syntactic constraints in progress.) The computational requirements for the estimation and use

of maximum entropy techniques have been vastly simplified for a large class of ME models. Nominal speedup: 100-1000 fold. “Real” speedup: 15+ fold.

A General purpose toolkit for ME models is being developed for public release.

Page 43: Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

44 Center for Language and Speech Processing, The Johns Hopkins University. April 2001

Acknowledgement I thank my advisor Sanjeev Khudanpur who leads me to this

field and always gives me wisdom advice and help when necessary and David Yarowsky who gives generous help during my Ph.D. program.

I thank Radu Florian and David Yarowsky for their help on topic detection and data clustering, Ciprian Chelba and Frederick Jelinek for providing the syntactic model (parser) for the SWBD experimental results reported here, and Shankar Kumar and Vlasios Doumpiotis for their help on generating N-best lists for the BN experiments.

I thank all people in the NLP lab and CLSP for their assistance in my thesis work.

This work is supported by National Science Foundation, a STIMULATE grant (IRI-9618874).