Minimum Rank Error Training Minimum Rank Error Training for Language Modelingfor Language Modeling
Meng-Sung WuMeng-Sung Wu
Department of Computer Science and Information EngineeringDepartment of Computer Science and Information EngineeringNational Cheng Kung University, Tainan, TAIWANNational Cheng Kung University, Tainan, TAIWAN
ContentsContents
IntroductionIntroductionLanguage Model for Information RetrievalLanguage Model for Information RetrievalDiscriminative Language ModelDiscriminative Language ModelAverage Precision versus Classification AccuracyAverage Precision versus Classification AccuracyEvaluation of IR SystemsEvaluation of IR SystemsMinimum Rank Error TrainingMinimum Rank Error TrainingSummarization and DiscussionSummarization and Discussion
Introduction Introduction
Language Modeling:Language Modeling: Provides linguistic constraints to the text sequence W.Provides linguistic constraints to the text sequence W. Based on statistical N-gram language modelsBased on statistical N-gram language models
Speech recognition system is always evaluated by the Speech recognition system is always evaluated by the word error rate.word error rate.Discriminative learning methods Discriminative learning methods maximum mutual information (MMI)maximum mutual information (MMI) minimum classification error (MCE)minimum classification error (MCE)
Classification error rate is not a suitable metric for Classification error rate is not a suitable metric for measuring the rank of input document.measuring the rank of input document.
Language Model for Information Language Model for Information RetrievalRetrieval
Standard Probabilistic IRStandard Probabilistic IR
query
d1
d2
dn
…
Information need
document collection
matching
),|( dQRP
IR based on LMIR based on LM
query
d1
d2
dn
…
Information need
document collection
generation
)|( dMQP 1dM
2dM
…
ndM
Language ModelsLanguage Models
Mathematical model of text generationMathematical model of text generationParticularly important for speech recognition, information Particularly important for speech recognition, information retrieval and machine translation.retrieval and machine translation.NN-gram model commonly used to estimate probabilities o-gram model commonly used to estimate probabilities of wordsf words Unigram, bigram and trigramUnigram, bigram and trigram NN-gram model is equivalent to an (-gram model is equivalent to an (nn-1)-1)thth order Markov model order Markov model
Estimates must be smoothed by interpolating combinatioEstimates must be smoothed by interpolating combinations of ns of nn-gram estimates-gram estimates
),|()|()(),|( 21312121 nnnnnnnnn wwwPwwPwPwwwP
Using Language Models in IRUsing Language Models in IRTreat each document as the basis for a model (e.g., unigTreat each document as the basis for a model (e.g., unigram sufficient statistics)ram sufficient statistics)
Rank document d based on P(d | q)Rank document d based on P(d | q)
P(d | q) = P(q | d) x P(d) / P(q)P(d | q) = P(q | d) x P(d) / P(q) P(q) is the same for all documents, so ignoreP(q) is the same for all documents, so ignore P(d) [the prior] is often treated as the same for all dP(d) [the prior] is often treated as the same for all d
But we could use criteria like authority, length, genreBut we could use criteria like authority, length, genre P(q | d) is the probability of q given d’s modelP(q | d) is the probability of q given d’s model
Very general formal approachVery general formal approach
Using Language Models in IRUsing Language Models in IRPrinciple 1:Principle 1: Document D: Language model P(w|MDocument D: Language model P(w|MDD)) Query Q = sequence of words qQuery Q = sequence of words q11,q,q22,…,q,…,qnn (uni-grams) (uni-grams) Matching: P(Q|MMatching: P(Q|MDD))
Principle 2:Principle 2: Document D: Language model P(w|MDocument D: Language model P(w|MDD)) Query Q: Language model P(w|MQuery Q: Language model P(w|MQQ)) Matching: comparison between P(.|MMatching: comparison between P(.|MDD) and P(.|M) and P(.|MQQ))
Principle 3:Principle 3: Translate D to QTranslate D to Q
ProblemsProblemsLimitation to uni-grams: Limitation to uni-grams: No dependence between wordsNo dependence between words
Problems with bi-gramsProblems with bi-grams Consider all the adjacent word pairs (noise)Consider all the adjacent word pairs (noise) Cannot consider more distant dependenciesCannot consider more distant dependencies Word order – not always important for IRWord order – not always important for IR
Entirely data-driven, no external knowledge Entirely data-driven, no external knowledge e.g. programming e.g. programming computer computer
Direct comparison between D and QDirect comparison between D and Q Despite smoothing, requires that D and Q contain identical words Despite smoothing, requires that D and Q contain identical words
(except translation model)(except translation model) Cannot deal with synonymy and polysemyCannot deal with synonymy and polysemy
Discriminative Language ModelDiscriminative Language Model
Minimum Classification ErrorMinimum Classification ErrorThe advent of powerful computing devices and success of staThe advent of powerful computing devices and success of statistical approachestistical approaches A renewed pursuit for more powerful method to reduce recognition errA renewed pursuit for more powerful method to reduce recognition err
or rateor rate
Although MCE-based discriminative methods is rooted in the Although MCE-based discriminative methods is rooted in the classical Bayes’ decision theory, instead of a classical Bayes’ decision theory, instead of a classification tclassification task to distribution estimation problemask to distribution estimation problem, it takes a , it takes a discrimindiscriminant-function based statistical pattern classification approant-function based statistical pattern classification approachachFor a given family of discriminant function, optimal classifier/rFor a given family of discriminant function, optimal classifier/recognizer design involves finding a set of parameters ecognizer design involves finding a set of parameters which which minimize the empirical pattern recognition error rateminimize the empirical pattern recognition error rate
Minimum Classification Error LMMinimum Classification Error LM
Discrinimant function:Discrinimant function:
MCE classifier design based on three steps Misclassification measure:Misclassification measure:
)(log)(log);,( WpWXpWXg
1
1;,exp1log;,
N
nnWXgN
WXgXd
Score of target hypothesisScore of competing hypotheses
Loss function:Loss function:
Expected loss:Expected loss:
))(exp(1
1)(
Xd
Xdl
R
rrX XdlXdlE
1)()(
MCE approach has several advantages in classifier desiMCE approach has several advantages in classifier design:gn:
It is meaningful in the sense of minimizing the empirical recognitiIt is meaningful in the sense of minimizing the empirical recognition error rate of the classifieron error rate of the classifier
If the true class posterior distributions are used as discriminant fIf the true class posterior distributions are used as discriminant functions, the asymptotic behavior of the classifier will approximatunctions, the asymptotic behavior of the classifier will approximate the minimum Baye’s riske the minimum Baye’s risk
Average Precision versus Average Precision versus Classification AccuracyClassification Accuracy
Example Example
The same classification accuracy but different average The same classification accuracy but different average precisionprecision
The relevant documents = 10
Recall 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 1.0
Precision 1.0 0.5 0.67 0.5 0.4 0.5 0.43 0.38 0.44 0.5
Recall 0.0 0.2 0.2 0.2 0.4 0.6 0.8 1.0 1.0 1.0Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.63 0.55 0.5
AvgPrec=62.2%
AvgPrec=52.0%
Accuracy=50.0%
Evaluation of IR SystemsEvaluation of IR Systems
Measures of Retrieval EffectivenessMeasures of Retrieval Effectiveness
Precision and RecallPrecision and RecallSingle-valued P/R measureSingle-valued P/R measureSignificance testsSignificance tests
Precision and RecallPrecision and RecallPrecisionPrecision Proportion of a retrieved set that is relevantProportion of a retrieved set that is relevant Precision = |relevant Precision = |relevant ∩ retrieved∩ retrieved| / | | / | retrievedretrieved | | = P(relevant = P(relevant | retrieved| retrieved))
RecallRecall Proportion of all relevant documents in the collection included in Proportion of all relevant documents in the collection included in
the retrieved setthe retrieved set Recall = |relevant Recall = |relevant ∩ retrieved∩ retrieved| / | relevant || / | relevant | = P(= P(retrieved | retrieved | relevant)relevant)
Precision and recall are well-defined for setsPrecision and recall are well-defined for sets
Average PrecisionAverage Precision
Often want a single-number effectiveness measureOften want a single-number effectiveness measure E.g., for a machine-learning algorithm to detect improvementE.g., for a machine-learning algorithm to detect improvement
Average precision is widely used in IRAverage precision is widely used in IRAverage precision at relevant ranksAverage precision at relevant ranksCalculate by averaging precision when recall increasesCalculate by averaging precision when recall increases
The relevant documents = 5
Recall 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 1.0
Precision 1.0 0.5 0.67 0.5 0.4 0.5 0.43 0.38 0.44 0.5
Recall 0.0 0.2 0.2 0.2 0.4 0.6 0.8 1.0 1.0 1.0Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.63 0.55 0.5
AvgPrec=62.2%
AvgPrec=52.0%
Trec-eval demoTrec-eval demoQueryid (Num): 225Queryid (Num): 225Total number of documents over all queriesTotal number of documents over all queries Retrieved: 179550Retrieved: 179550 Relevant: 1838Relevant: 1838 Rel_ret: 1110Rel_ret: 1110Interpolated Recall - Precision Averages:Interpolated Recall - Precision Averages: at 0.00 0.6139 at 0.00 0.6139 at 0.10 0.5743 at 0.10 0.5743 at 0.20 0.4437 at 0.20 0.4437 at 0.30 0.3577 at 0.30 0.3577 at 0.40 0.2952 at 0.40 0.2952 at 0.50 0.2603 at 0.50 0.2603 at 0.60 0.2037 at 0.60 0.2037 at 0.70 0.1374 at 0.70 0.1374 at 0.80 0.1083 at 0.80 0.1083 at 0.90 0.0722 at 0.90 0.0722 at 1.00 0.0674 at 1.00 0.0674 Average precision (non-interpolated) for all rel docs(averaged over queries)Average precision (non-interpolated) for all rel docs(averaged over queries) 0.2680 0.2680 Precision:Precision: At 5 docs: 0.3173At 5 docs: 0.3173 At 10 docs: 0.2089At 10 docs: 0.2089 At 15 docs: 0.1564At 15 docs: 0.1564 At 20 docs: 0.1262At 20 docs: 0.1262 At 30 docs: 0.0948At 30 docs: 0.0948 At 100 docs: 0.0373At 100 docs: 0.0373 At 200 docs: 0.0210At 200 docs: 0.0210 At 500 docs: 0.0095At 500 docs: 0.0095 At 1000 docs: 0.0049At 1000 docs: 0.0049R-Precision (precision after R (= num_rel for a query) docs retrieved):R-Precision (precision after R (= num_rel for a query) docs retrieved): Exact: 0.2734Exact: 0.2734
Significance testsSignificance testsSystem A beats system B on one querySystem A beats system B on one query Is it just a lucky query for system A?Is it just a lucky query for system A? Maybe system B does better on some other queryMaybe system B does better on some other query Need as many queries as possibleNeed as many queries as possible
Empirical research suggests 25 is minimum needEmpirical research suggests 25 is minimum need
TREC tracks generally aim for at least 50 queriesTREC tracks generally aim for at least 50 queries
System A and B identical on all but one querySystem A and B identical on all but one query If system A beats system B by enough on that one query, If system A beats system B by enough on that one query,
average will make A look better than B.average will make A look better than B.
Sign Test ExampleSign Test ExampleFor methods A and B, compare average precision for For methods A and B, compare average precision for each pair of result generated by queries in test collection.each pair of result generated by queries in test collection.If difference is large enough, count as + or -, otherwise If difference is large enough, count as + or -, otherwise ignore.ignore.Use number of +’s and the number of significant Use number of +’s and the number of significant difference to determine significance leveldifference to determine significance levelE.g. for 40 queries, method A produced a better result E.g. for 40 queries, method A produced a better result than B 12 times, B was better than A 3 times, and 25 were than B 12 times, B was better than A 3 times, and 25 were the “same”, p < 0.035 and method A is significantly better the “same”, p < 0.035 and method A is significantly better than B.than B.If A > B 18 times and B > A 9 times, p < 0.1222 and A is If A > B 18 times and B > A 9 times, p < 0.1222 and A is not significantly better than B at the 5% level.not significantly better than B at the 5% level.
Wilcoxon TestWilcoxon Test
Compute differencesCompute differences
Rank differences by absolute valueRank differences by absolute value
Sum separately + ranks and – ranksSum separately + ranks and – ranks
Two tailed test Two tailed test T= min (+ ranks, -ranks)T= min (+ ranks, -ranks) Reject null hypothesis if T < TReject null hypothesis if T < T00, where T, where T00 is found in a table is found in a table
Wilcoxon Test ExampleWilcoxon Test Example+ ranks = 44+ ranks = 44- ranks = 11- ranks = 11T= 11T= 11TT00 = 8 (from table) = 8 (from table)
Conclusion : not significantConclusion : not significant
AA BB diffdiff rankrank Signed Signed rankrank
9797 9696 -1-1 1.51.5 -1.5-1.5
8888 8686 -2-2 33 -3-3
7575 7979 44 44 44
9090 8989 -1-1 1.51.5 -1.5-1.5
8585 9191 66 6.56.5 6.56.5
9494 8989 -5-5 55 -5-5
7777 8686 99 88 88
8989 9999 1010 99 99
8282 9494 1212 1010 1010
9090 9696 66 6.56.5 6.56.5
Minimum Rank Error TrainingMinimum Rank Error Training
Document ranking principleDocument ranking principle
A ranking algorithm aims at estimating a function.A ranking algorithm aims at estimating a function.
The problem can be described as follows:The problem can be described as follows: Two disjoint sets Two disjoint sets SSRR and and SSII
A ranking function A ranking function ff((xx) assigns to each document ) assigns to each document dd of the docum of the document collection a score value.ent collection a score value.
denote that is ranked higher than . denote that is ranked higher than .
The objective function The objective function
ir xx rx ix
IiRririr SxSxxfxfxx ,),()(
Document ranking principleDocument ranking principle
There are different ways to measure the ranking error of There are different ways to measure the ranking error of a scoring function a scoring function ff..
The natural criterion might be the proportion of misorderThe natural criterion might be the proportion of misordered pair over the total pair number.ed pair over the total pair number.
This criterion is an estimate of the probability of misorderThis criterion is an estimate of the probability of misordering a pair ing a pair )()( ri xfxf
Qq rixxitrt
Qq rixxitrt
rixxri
t ri
t ri
ri
xqPxqPI
xqfxqfI
xfxfPXError
,,
,,
,,
)];|();|([
)];,();,([
)]()([)(
otherwise ,0
0for ,1][
yyI
Document ranking principleDocument ranking principle
Total distance measure is defined asTotal distance measure is defined as
);|(log max);|(log)P;( idrr XQPXQPXdi
Illustration of the metric of average Illustration of the metric of average precision precision
Intuition and TheoryIntuition and TheoryPrecision is the ratio of relevant documents retrieved to documents Precision is the ratio of relevant documents retrieved to documents retrieved at a given rank.retrieved at a given rank.
Average precision is the average of precision at the ranks of Average precision is the average of precision at the ranks of relevant documentsrelevant documents
irrr
r
kk nnrrsreldrankprec
,/)()(@1
r is returned documents
sk is relevance of document k
otherwise 0
orelevant t is document if 1)(
qddrel
||
1 1
||
1
/)()(||
1
)(@*)(||
1
S
r
r
kkr
R
S
rr
R
rsrelsrelN
drankprecsrelN
AP
Discriminative ranking algorithmsDiscriminative ranking algorithms
Maximizing the average precision is tightly related to Maximizing the average precision is tightly related to minimizing the following ranking error lossminimizing the following ranking error loss
||
1 1
||
1 1
)())(1(1||
1
)()(11||
1
1);(
S
r
r
krk
R
S
r
r
krk
R
AP
srelsrelrN
srelsrelrN
APXL
Discriminative ranking algorithmsDiscriminative ranking algorithms
Similar to MCE algorithm, ranking loss function Similar to MCE algorithm, ranking loss function LLAPAP is exp is express as a differentiable objective. ress as a differentiable objective.
The error count The error count nnirir is approximated by the differentiable l is approximated by the differentiable loss function defined asoss function defined as
))P,(()())(1(1
Xdllsrelsreln rir
r
krkir
Discriminative ranking algorithmsDiscriminative ranking algorithms
r
r
ir
ir
APAP ddl
lLXL );(
||
1 12
)()(||
1S
r
r
k
kr
Rir
AP
rsrelsrel
NlL
)))((1())((
rrr
ir dldldl
The differentiation of the ranking loss function turns The differentiation of the ranking loss function turns out to beout to be
Discriminative ranking algorithmsDiscriminative ranking algorithms
We use a bigram language model as an exampleWe use a bigram language model as an example
Using the steepest descent algorithm, the parameters of Using the steepest descent algorithm, the parameters of language model are adjusted iteratively by language model are adjusted iteratively by
)())(;(
)()1(ttXL
tt AP
, );|(),( max
);|(),();(
,
,,
itnmid
Qqrtnmr
nm
r
xqPwxn
xqPwxnXd
i
t
ExperimentsExperiments
Experimental SetupExperimental Setup
We evaluated our model with two different TREC collectiWe evaluated our model with two different TREC collections – ons – Wall Street Journal 1987 (WSJ87), Wall Street Journal 1987 (WSJ87), Asscosiated Press Newswire 1988 (AP88). Asscosiated Press Newswire 1988 (AP88).
Language ModelingLanguage Modeling
We used WSJ87 dataset as training data for language We used WSJ87 dataset as training data for language model estimation. The AP88 dataset is used as the test model estimation. The AP88 dataset is used as the test data. data.
During MRE training procedure, parameters is adopted During MRE training procedure, parameters is adopted asas
Comparison of perplexityComparison of perplexity
ML MREUnigram 1781.03 1775.65Bigram 443.55 440.38
0.5 and 0 1, ,1
Experimental on Information Experimental on Information RetrievalRetrieval
Two query sets and the corresponding relevant documenTwo query sets and the corresponding relevant documents in this collection.ts in this collection. TREC topics 51-100 as training queries TREC topics 51-100 as training queries TREC topics 101-150 as test queries. TREC topics 101-150 as test queries.
Queries were sampled from the ‘title’ and ‘description’ fielQueries were sampled from the ‘title’ and ‘description’ fields of the topics. ds of the topics.
ML language model is used as the baseline system. ML language model is used as the baseline system.
To test the significance of improvement, Wilcoxon test wTo test the significance of improvement, Wilcoxon test was employed in the evaluation. as employed in the evaluation.
Comparison of Average Precision Comparison of Average Precision
Collection ML MRE Improvement WilcoxonWSJ87 0.1012 0.1122 10.9% 0.0163*AP88 0.1692 0.1913 13.1% 0*
Comparison of Precision in Comparison of Precision in Document LevelDocument LevelDocuments Retrieved
ML(I)
MCE (II)
MRE (III)
Wilcoxon(IIII)
Wilcoxon(IIIII)
5 docs 0.416 0.464 0.472 0.0275* 0.1163
10 docs 0.406 0.426 0.456 0.0208* 0.0449*
15 docs 0.387 0.409 0.419 0.0251* 0.0447*
20 docs 0.373 0.397 0.405 0.0239* 0.0656
30 docs 0.339 0.361 0.371 0.0030* 0.0330*
100 docs 0.237 0.254 0.255 0* 0.0561
200 docs 0.165 0.173 0.175 0* 0.0622
500 docs 0.083 0.086 0.087 0* 0.0625
1000 docs 0.045 0.046 0.046 0* 0.0413*
R-Precision 0.232 0.247 0.273 0* 0.0096*
SummarySummary
Ranking learning requires to consider nonrelevance Ranking learning requires to consider nonrelevance information.information.
We will extend this method for spoken document retrievaWe will extend this method for spoken document retrieval l
Future work is focused on the area under of the ROC Future work is focused on the area under of the ROC curves (AUC).curves (AUC).
ReferencesReferencesM. Collins, “Discriminative reranking for natural language parsing”, M. Collins, “Discriminative reranking for natural language parsing”, in Proc. 17th Intein Proc. 17th International Conference on Machine Learningrnational Conference on Machine Learning, pp. 175-182, 2000., pp. 175-182, 2000.J. Gao, H. Qi, X. Xia, J.-Y. Nie, “Linear discriminant model for information retrieval”, J. Gao, H. Qi, X. Xia, J.-Y. Nie, “Linear discriminant model for information retrieval”, iin Proc. ACM SIGIRn Proc. ACM SIGIR, pp.290-297, 2005., pp.290-297, 2005.D. Hull, “Using statistical testing in the evaluation of retrieval experiments”,D. Hull, “Using statistical testing in the evaluation of retrieval experiments”, in Proc in Proc ACM SIGIRACM SIGIR, pp. 329-338, 1993., pp. 329-338, 1993.B. H. Juang, W. Chou, and C.-H. Lee, “Minimum classification error rate methods for B. H. Juang, W. Chou, and C.-H. Lee, “Minimum classification error rate methods for speech recognition”, speech recognition”, IEEE Trans. Speech and Audio ProcessingIEEE Trans. Speech and Audio Processing, pp. 257-265, 1997., pp. 257-265, 1997.B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classificationB.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification”, ”, IEEE Trans. Signal ProcessingIEEE Trans. Signal Processing, vol. 40, no. 12, pp. 3043-3054, 1992., vol. 40, no. 12, pp. 3043-3054, 1992.H.-K. J. Kuo, E. Fosler-Lussier, H. Jiang, and C.-H. Lee, “Discriminative training of laH.-K. J. Kuo, E. Fosler-Lussier, H. Jiang, and C.-H. Lee, “Discriminative training of language models for speech recognition”, nguage models for speech recognition”, in Proc. ICASSPin Proc. ICASSP, pp. 325-328, 2002., pp. 325-328, 2002.R. Nallapati, “Discriminative models for information retrieval”, in R. Nallapati, “Discriminative models for information retrieval”, in Proc. ACM SIGIRProc. ACM SIGIR, p, pp. 64-71, 2004.p. 64-71, 2004.J. M. Ponte and W. B. Croft, “A language modeling approach to information retrieval”,J. M. Ponte and W. B. Croft, “A language modeling approach to information retrieval”, in Proc. ACM SIGIRin Proc. ACM SIGIR, pp.275-281, 1998., pp.275-281, 1998.J.-N. Vittaut and P. Gallinari, “Machine learning ranking for structured information retJ.-N. Vittaut and P. Gallinari, “Machine learning ranking for structured information retrieval”, rieval”, in Proc. 28th European Conference on IR Researchin Proc. 28th European Conference on IR Research , pp.338-349, 2006., pp.338-349, 2006.
Thank You for Your Attention Thank You for Your Attention