Minimum Rank Error Training for Language Modeling

Minimum Rank Error Training Minimum Rank Error Training for Language Modelingfor Language Modeling

Meng-Sung WuMeng-Sung Wu

Department of Computer Science and Information EngineeringDepartment of Computer Science and Information EngineeringNational Cheng Kung University, Tainan, TAIWANNational Cheng Kung University, Tainan, TAIWAN

ContentsContents

IntroductionIntroduction

Language Model for Information RetrievalLanguage Model for Information Retrieval

Discriminative Language ModelDiscriminative Language Model

Average Precision versus Classification AccuracyAverage Precision versus Classification Accuracy

Evaluation of IR SystemsEvaluation of IR Systems

Minimum Rank Error TrainingMinimum Rank Error Training

Summarization and DiscussionSummarization and Discussion

Introduction Introduction

Language Modeling:Language Modeling: Provides linguistic constraints to the text sequence W.Provides linguistic constraints to the text sequence W. Based on statistical N-gram language modelsBased on statistical N-gram language models

Speech recognition system is always evaluated by the Speech recognition system is always evaluated by the word error rate.word error rate.

Discriminative learning methods Discriminative learning methods maximum mutual information (MMI)maximum mutual information (MMI) minimum classification error (MCE)minimum classification error (MCE)

Classification error rate is not a suitable metric for Classification error rate is not a suitable metric for measuring the rank of input document.measuring the rank of input document.

Language Model for Information Language Model for Information RetrievalRetrieval

Standard Probabilistic IRStandard Probabilistic IR

query

d1

d2

dn

…

Information need

document collection

matchingmatching

),|( dQRP

IR based on LMIR based on LM

query

d1

d2

dn

…

Information need

document collection

generationgeneration

)|( dMQP 1dM

2dM

…

ndM

Language ModelsLanguage Models

Mathematical model of text generationMathematical model of text generation

Particularly important for speech recognition, information Particularly important for speech recognition, information retrieval and machine translation.retrieval and machine translation.

NN-gram model commonly used to estimate probabilities o-gram model commonly used to estimate probabilities of wordsf words Unigram, bigram and trigramUnigram, bigram and trigram NN-gram model is equivalent to an (-gram model is equivalent to an (nn-1)-1)thth order Markov model order Markov model

Estimates must be smoothed by interpolating combinatioEstimates must be smoothed by interpolating combinations of ns of nn-gram estimates-gram estimates

),|()|()(),|( 21312121 nnnnnnnnn wwwPwwPwPwwwP

Using Language Models in IRUsing Language Models in IR

Treat each document as the basis for a model (e.g., unigTreat each document as the basis for a model (e.g., unigram sufficient statistics)ram sufficient statistics)

Rank document d based on P(d | q)Rank document d based on P(d | q)

P(d | q) = P(q | d) x P(d) / P(q)P(d | q) = P(q | d) x P(d) / P(q) P(q) is the same for all documents, so ignoreP(q) is the same for all documents, so ignore P(d) [the prior] is often treated as the same for all dP(d) [the prior] is often treated as the same for all d

But we could use criteria like authority, length, genreBut we could use criteria like authority, length, genre P(q | d) is the probability of q given d’s modelP(q | d) is the probability of q given d’s model

Very general formal approachVery general formal approach

Using Language Models in IRUsing Language Models in IR

Principle 1:Principle 1: Document D: Language model P(w|MDocument D: Language model P(w|MDD)) Query Q = sequence of words qQuery Q = sequence of words q11,q,q22,…,q,…,qnn (uni-grams) (uni-grams) Matching: P(Q|MMatching: P(Q|MDD))

Principle 2:Principle 2: Document D: Language model P(w|MDocument D: Language model P(w|MDD)) Query Q: Language model P(w|MQuery Q: Language model P(w|MQQ)) Matching: comparison between P(.|MMatching: comparison between P(.|MDD) and P(.|M) and P(.|MQQ))

Principle 3:Principle 3: Translate D to QTranslate D to Q

ProblemsProblems

Limitation to uni-grams: Limitation to uni-grams: No dependence between wordsNo dependence between words

Problems with bi-gramsProblems with bi-grams Consider all the adjacent word pairs (noise)Consider all the adjacent word pairs (noise) Cannot consider more distant dependenciesCannot consider more distant dependencies Word order – not always important for IRWord order – not always important for IR

Entirely data-driven, no external knowledge Entirely data-driven, no external knowledge e.g. programming e.g. programming computer computer

Direct comparison between D and QDirect comparison between D and Q Despite smoothing, requires that D and Q contain identical words Despite smoothing, requires that D and Q contain identical words

(except translation model)(except translation model) Cannot deal with synonymy and polysemyCannot deal with synonymy and polysemy

Discriminative Language ModelDiscriminative Language Model

Minimum Classification ErrorMinimum Classification Error

The advent of powerful computing devices and success of stThe advent of powerful computing devices and success of statistical approachesatistical approaches A renewed pursuit for more powerful method to reduce recognition erA renewed pursuit for more powerful method to reduce recognition er

ror rateror rate

Although MCE-based discriminative methods is rooted in the Although MCE-based discriminative methods is rooted in the classical Bayes’ decision theory, instead of a classical Bayes’ decision theory, instead of a classification tclassification task to distribution estimation problemask to distribution estimation problem, it takes a , it takes a discrimidiscriminant-function based statistical pattern classification apprnant-function based statistical pattern classification approachoach

For a given family of discriminant function, optimal classifier/For a given family of discriminant function, optimal classifier/recognizer design involves finding a set of parameters recognizer design involves finding a set of parameters whicwhich minimize the empirical pattern recognition error rateh minimize the empirical pattern recognition error rate

Minimum Classification Error LMMinimum Classification Error LM

Discrinimant function:Discrinimant function:

MCE classifier design based on three steps Misclassification measure:Misclassification measure:

)(log)(log);,( WpWXpWXg

1

1

;,exp1

log;,

N

nnWXgN

WXgXd

Score of target hypothesis

Score of competing hypotheses

Loss function:Loss function:

Expected loss:Expected loss:

))(exp(1

1)(

XdXdl

R

rrX XdlXdlE

1)()(

MCE approach has several advantages in classifier desiMCE approach has several advantages in classifier design:gn:

It is meaningful in the sense of minimizing the empirical recognitiIt is meaningful in the sense of minimizing the empirical recognition error rate of the classifieron error rate of the classifier

If the true class posterior distributions are used as discriminant fIf the true class posterior distributions are used as discriminant functions, the asymptotic behavior of the classifier will approximatunctions, the asymptotic behavior of the classifier will approximate the minimum Baye’s riske the minimum Baye’s risk

Average Precision versus Average Precision versus Classification AccuracyClassification Accuracy

Example Example

The same classification accuracy but different average The same classification accuracy but different average precisionprecision

The relevant documents = 10

Recall 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 1.0

Precision 1.0 0.5 0.67 0.5 0.4 0.5 0.43 0.38 0.44 0.5

Recall 0.0 0.2 0.2 0.2 0.4 0.6 0.8 1.0 1.0 1.0Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.63 0.55 0.5

AvgPrec=62.2%

AvgPrec=52.0%

Accuracy=50.0%

Evaluation of IR SystemsEvaluation of IR Systems

Measures of Retrieval EffectivenessMeasures of Retrieval Effectiveness

Precision and RecallPrecision and Recall

Single-valued P/R measureSingle-valued P/R measure

Significance testsSignificance tests

Precision and RecallPrecision and Recall

PrecisionPrecision Proportion of a retrieved set that is relevantProportion of a retrieved set that is relevant Precision = |relevant Precision = |relevant ∩ retrieved∩ retrieved| / | | / | retrievedretrieved | | = P(relevant = P(relevant | retrieved| retrieved))

RecallRecall Proportion of all relevant documents in the collection included in Proportion of all relevant documents in the collection included in

the retrieved setthe retrieved set Recall = |relevant Recall = |relevant ∩ retrieved∩ retrieved| / | relevant || / | relevant | = P(= P(retrieved | retrieved | relevant)relevant)

Precision and recall are well-defined for setsPrecision and recall are well-defined for sets

Average PrecisionAverage Precision

Often want a single-number effectiveness measureOften want a single-number effectiveness measure E.g., for a machine-learning algorithm to detect improvementE.g., for a machine-learning algorithm to detect improvement

Average precision is widely used in IRAverage precision is widely used in IR

Average precision at relevant ranksAverage precision at relevant ranks

Calculate by averaging precision when recall increasesCalculate by averaging precision when recall increases

The relevant documents = 5

Recall 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 1.0

Precision 1.0 0.5 0.67 0.5 0.4 0.5 0.43 0.38 0.44 0.5

Recall 0.0 0.2 0.2 0.2 0.4 0.6 0.8 1.0 1.0 1.0Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.63 0.55 0.5

AvgPrec=62.2%

AvgPrec=52.0%

Trec-eval demoTrec-eval demoQueryid (Num): 225Queryid (Num): 225Total number of documents over all queriesTotal number of documents over all queries Retrieved: 179550Retrieved: 179550 Relevant: 1838Relevant: 1838 Rel_ret: 1110Rel_ret: 1110Interpolated Recall - Precision Averages:Interpolated Recall - Precision Averages: at 0.00 0.6139 at 0.00 0.6139 at 0.10 0.5743 at 0.10 0.5743 at 0.20 0.4437 at 0.20 0.4437 at 0.30 0.3577 at 0.30 0.3577 at 0.40 0.2952 at 0.40 0.2952 at 0.50 0.2603 at 0.50 0.2603 at 0.60 0.2037 at 0.60 0.2037 at 0.70 0.1374 at 0.70 0.1374 at 0.80 0.1083 at 0.80 0.1083 at 0.90 0.0722 at 0.90 0.0722 at 1.00 0.0674 at 1.00 0.0674 Average precision (non-interpolated) for all rel docs(averaged over queries)Average precision (non-interpolated) for all rel docs(averaged over queries) 0.2680 0.2680 Precision:Precision: At 5 docs: 0.3173At 5 docs: 0.3173 At 10 docs: 0.2089At 10 docs: 0.2089 At 15 docs: 0.1564At 15 docs: 0.1564 At 20 docs: 0.1262At 20 docs: 0.1262 At 30 docs: 0.0948At 30 docs: 0.0948 At 100 docs: 0.0373At 100 docs: 0.0373 At 200 docs: 0.0210At 200 docs: 0.0210 At 500 docs: 0.0095At 500 docs: 0.0095 At 1000 docs: 0.0049At 1000 docs: 0.0049R-Precision (precision after R (= num_rel for a query) docs retrieved):R-Precision (precision after R (= num_rel for a query) docs retrieved): Exact: 0.2734Exact: 0.2734

Significance testsSignificance tests

System A beats system B on one querySystem A beats system B on one query Is it just a lucky query for system A?Is it just a lucky query for system A? Maybe system B does better on some other queryMaybe system B does better on some other query Need as many queries as possibleNeed as many queries as possible

Empirical research suggests 25 is minimum needEmpirical research suggests 25 is minimum need

TREC tracks generally aim for at least 50 queriesTREC tracks generally aim for at least 50 queries

System A and B identical on all but one querySystem A and B identical on all but one query If system A beats system B by enough on that one query, If system A beats system B by enough on that one query,

average will make A look better than B.average will make A look better than B.

Sign Test ExampleSign Test Example

For methods A and B, compare average precision for For methods A and B, compare average precision for each pair of result generated by queries in test each pair of result generated by queries in test collection.collection.If difference is large enough, count as + or -, otherwise If difference is large enough, count as + or -, otherwise ignore.ignore.Use number of +’s and the number of significant Use number of +’s and the number of significant difference to determine significance leveldifference to determine significance levelE.g. for 40 queries, method A produced a better result E.g. for 40 queries, method A produced a better result than B 12 times, B was better than A 3 times, and 25 than B 12 times, B was better than A 3 times, and 25 were the “same”, p < 0.035 and method A is significantly were the “same”, p < 0.035 and method A is significantly better than B.better than B.If A > B 18 times and B > A 9 times, p < 0.1222 and A is If A > B 18 times and B > A 9 times, p < 0.1222 and A is not significantly better than B at the 5% level.not significantly better than B at the 5% level.

Wilcoxon TestWilcoxon Test

Compute differencesCompute differences

Rank differences by absolute valueRank differences by absolute value

Sum separately + ranks and – ranksSum separately + ranks and – ranks

Two tailed test Two tailed test T= min (+ ranks, -ranks)T= min (+ ranks, -ranks) Reject null hypothesis if T < TReject null hypothesis if T < T00, where T, where T00 is found in a table is found in a table

Wilcoxon Test ExampleWilcoxon Test Example

+ ranks = 44+ ranks = 44

- ranks = 11- ranks = 11

T= 11T= 11

TT00 = 8 (from table) = 8 (from table)

Conclusion : not significantConclusion : not significant

AA BB diffdiff rankrank Signed Signed rankrank

9797 9696 -1-1 1.51.5 -1.5-1.5

8888 8686 -2-2 33 -3-3

7575 7979 44 44 44

9090 8989 -1-1 1.51.5 -1.5-1.5

8585 9191 66 6.56.5 6.56.5

9494 8989 -5-5 55 -5-5

7777 8686 99 88 88

8989 9999 1010 99 99

8282 9494 1212 1010 1010

9090 9696 66 6.56.5 6.56.5

Minimum Rank Error TrainingMinimum Rank Error Training

Document ranking principleDocument ranking principle

A ranking algorithm aims at estimating a function.A ranking algorithm aims at estimating a function.

The problem can be described as follows:The problem can be described as follows: Two disjoint sets Two disjoint sets SSRR and and SSII

A ranking function A ranking function ff((xx) assigns to each document ) assigns to each document dd of the docum of the document collection a score value.ent collection a score value.

denote that is ranked higher than . denote that is ranked higher than .

The objective function The objective function

ir xx rx ix

IiRririr SxSxxfxfxx ,),()(


There are different ways to measure the ranking error of There are different ways to measure the ranking error of a scoring function a scoring function ff..

The natural criterion might be the proportion of misorderThe natural criterion might be the proportion of misordered pair over the total pair number.ed pair over the total pair number.

This criterion is an estimate of the probability of misorderThis criterion is an estimate of the probability of misordering a pair ing a pair )()( ri xfxf

Qq rixxitrt

Qq rixxitrt

rixxri

t ri

t ri

ri

xqPxqPI

xqfxqfI

xfxfPXError

,,

,,

,,

)];|();|([

)];,();,([

)]()([)(

otherwise ,0

0for ,1][

yyI


Total distance measure is defined asTotal distance measure is defined as

);|(log max);|(log)P;( id

rr XQPXQPXdi

Illustration of the metric of average Illustration of the metric of average precision precision

Intuition and TheoryIntuition and Theory

Precision is the ratio of relevant documents retrieved to documents Precision is the ratio of relevant documents retrieved to documents retrieved at a given rank.retrieved at a given rank.

Average precision is the average of precision at the ranks of Average precision is the average of precision at the ranks of relevant documentsrelevant documents

irrr

r

k

k nnrrsreldrankprec

,/)()(@1

r is returned documents

sk is relevance of document k

otherwise 0

orelevant t is document if 1)(

qddrel

||

1 1

||

1

/)()(||

1

)(@*)(||

1

S

r

r

k

krR

S

r

rR

rsrelsrelN

drankprecsrelN

AP

Discriminative ranking algorithmsDiscriminative ranking algorithms

Maximizing the average precision is tightly related to Maximizing the average precision is tightly related to minimizing the following ranking error lossminimizing the following ranking error loss

||

1 1

||

1 1

)())(1(1

||

1

)()(1

1||

1

1);(

S

r

r

k

rkR

S

r

r

k

rkR

AP

srelsrelrN

srelsrelrN

APXL


Similar to MCE algorithm, ranking loss function Similar to MCE algorithm, ranking loss function LLAPAP is exp is exp

ress as a differentiable objective. ress as a differentiable objective.

The error count The error count nnirir is approximated by the differentiable l is approximated by the differentiable l

oss function defined asoss function defined as

))P,(()())(1(1

Xdllsrelsreln rir

r

k

rkir


r

r

ir

ir

APAP d

d

l

l

LXL );(

||

1 12

)()(

||

1S

r

r

k

kr

Rir

AP

r

srelsrel

Nl

L

)))((1())((

rrr

ir dldld

l

The differentiation of the ranking loss function turns The differentiation of the ranking loss function turns out to beout to be


We use a bigram language model as an exampleWe use a bigram language model as an example

Using the steepest descent algorithm, the parameters of Using the steepest descent algorithm, the parameters of language model are adjusted iteratively by language model are adjusted iteratively by

)(

))(;()()1(

t

tXLtt AP

, );|(),( max

);|(),();(

,

,,

itnmid

Qqrtnmr

nm

r

xqPwxn

xqPwxnXd

i

t

ExperimentsExperiments

Experimental SetupExperimental Setup

We evaluated our model with two different TREC collectiWe evaluated our model with two different TREC collections – ons – Wall Street Journal 1987 (WSJ87), Wall Street Journal 1987 (WSJ87), Asscosiated Press Newswire 1988 (AP88). Asscosiated Press Newswire 1988 (AP88).

Language ModelingLanguage Modeling

We used WSJ87 dataset as training data for language We used WSJ87 dataset as training data for language model estimation. The AP88 dataset is used as the test model estimation. The AP88 dataset is used as the test data. data.

During MRE training procedure, parameters is adopted During MRE training procedure, parameters is adopted asas

Comparison of perplexityComparison of perplexity

ML MRE

Unigram 1781.03 1775.65

Bigram 443.55 440.38

0.5 and 0 1, ,1

Experimental on Information Experimental on Information RetrievalRetrieval

Two query sets and the corresponding relevant documenTwo query sets and the corresponding relevant documents in this collection.ts in this collection. TREC topics 51-100 as training queries TREC topics 51-100 as training queries TREC topics 101-150 as test queries. TREC topics 101-150 as test queries.

Queries were sampled from the ‘title’ and ‘description’ fielQueries were sampled from the ‘title’ and ‘description’ fields of the topics. ds of the topics.

ML language model is used as the baseline system. ML language model is used as the baseline system.

To test the significance of improvement, Wilcoxon test wTo test the significance of improvement, Wilcoxon test was employed in the evaluation. as employed in the evaluation.

Comparison of Average Precision Comparison of Average Precision

Collection ML MRE Improvement Wilcoxon

WSJ87 0.1012 0.1122 10.9% 0.0163*

AP88 0.1692 0.1913 13.1% 0*

Comparison of Precision in Comparison of Precision in Document LevelDocument LevelDocuments Retrieved

ML(I)

MCE (II)

MRE (III)

Wilcoxon(IIII)

Wilcoxon(IIIII)

5 docs 0.416 0.464 0.472 0.0275* 0.1163

10 docs 0.406 0.426 0.456 0.0208* 0.0449*

15 docs 0.387 0.409 0.419 0.0251* 0.0447*

20 docs 0.373 0.397 0.405 0.0239* 0.0656

30 docs 0.339 0.361 0.371 0.0030* 0.0330*

100 docs 0.237 0.254 0.255 0* 0.0561

200 docs 0.165 0.173 0.175 0* 0.0622

500 docs 0.083 0.086 0.087 0* 0.0625

1000 docs 0.045 0.046 0.046 0* 0.0413*

R-Precision 0.232 0.247 0.273 0* 0.0096*

SummarySummary

Ranking learning requires to consider nonrelevance Ranking learning requires to consider nonrelevance information.information.

We will extend this method for spoken document retrievaWe will extend this method for spoken document retrieval l

Future work is focused on the area under of the ROC Future work is focused on the area under of the ROC curves (AUC).curves (AUC).

ReferencesReferences

M. Collins, “Discriminative reranking for natural language parsing”, M. Collins, “Discriminative reranking for natural language parsing”, in Proc. 17th Intein Proc. 17th International Conference on Machine Learningrnational Conference on Machine Learning, pp. 175-182, 2000., pp. 175-182, 2000.J. Gao, H. Qi, X. Xia, J.-Y. Nie, “Linear discriminant model for information retrieval”, J. Gao, H. Qi, X. Xia, J.-Y. Nie, “Linear discriminant model for information retrieval”, iin Proc. ACM SIGIRn Proc. ACM SIGIR, pp.290-297, 2005., pp.290-297, 2005.D. Hull, “Using statistical testing in the evaluation of retrieval experiments”,D. Hull, “Using statistical testing in the evaluation of retrieval experiments”, in Proc in Proc ACM SIGIRACM SIGIR, pp. 329-338, 1993., pp. 329-338, 1993.B. H. Juang, W. Chou, and C.-H. Lee, “Minimum classification error rate methods for B. H. Juang, W. Chou, and C.-H. Lee, “Minimum classification error rate methods for speech recognition”, speech recognition”, IEEE Trans. Speech and Audio ProcessingIEEE Trans. Speech and Audio Processing, pp. 257-265, 1997., pp. 257-265, 1997.B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classificationB.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification”, ”, IEEE Trans. Signal ProcessingIEEE Trans. Signal Processing, vol. 40, no. 12, pp. 3043-3054, 1992., vol. 40, no. 12, pp. 3043-3054, 1992.H.-K. J. Kuo, E. Fosler-Lussier, H. Jiang, and C.-H. Lee, “Discriminative training of laH.-K. J. Kuo, E. Fosler-Lussier, H. Jiang, and C.-H. Lee, “Discriminative training of language models for speech recognition”, nguage models for speech recognition”, in Proc. ICASSPin Proc. ICASSP, pp. 325-328, 2002., pp. 325-328, 2002.R. Nallapati, “Discriminative models for information retrieval”, in R. Nallapati, “Discriminative models for information retrieval”, in Proc. ACM SIGIRProc. ACM SIGIR, p, pp. 64-71, 2004.p. 64-71, 2004.J. M. Ponte and W. B. Croft, “A language modeling approach to information retrieval”,J. M. Ponte and W. B. Croft, “A language modeling approach to information retrieval”, in Proc. ACM SIGIRin Proc. ACM SIGIR, pp.275-281, 1998., pp.275-281, 1998.J.-N. Vittaut and P. Gallinari, “Machine learning ranking for structured information retJ.-N. Vittaut and P. Gallinari, “Machine learning ranking for structured information retrieval”, rieval”, in Proc. 28th European Conference on IR Researchin Proc. 28th European Conference on IR Research, pp.338-349, 2006., pp.338-349, 2006.

Thank You for Your Attention Thank You for Your Attention