Icml05 Jin Lwtir 01

8/3/2019 Icml05 Jin Lwtir 01

1/58

OverviewLearn Term Weights

ExperimentSummary

Learn to Weight Term in Information RetrievalUsing Category Information

Rong Jin1 Joyce Y. Chai1 Luo Si2

1Department of Computer Science and EngineeringMichigan State University

2

School of Computer ScienceCarnegie Mellon University

International Conference on Machine Learning, 2005

Jin, Chai, and Si Learn to Weight Terms
http://goforward/http://find/http://goback/


2/58


ExperimentSummary

Outline1 Overview of Term Weighting Methods in Information

RetrievalTerm Weighting based on TF.IDFTerm Weighting based on Language ModelsProblems with Existing Term Weighting Methods

2 Learn Term Weights Using Category InformationA Framework for Learning Term Weights Using CategoryInformationA Regression ApproachA Probabilistic Approach

3 ExperimentExperimental DesignBaseline ApproachesExperimental Results

4

SummaryJin, Chai, and Si Learn to Weight Terms
http://find/http://goback/


3/58


ExperimentSummary





4



4/58


ExperimentSummary





4


O i


5/58


ExperimentSummary





4


O i


6/58


ExperimentSummary

TF.IDFLanguage ModelsProblems





4


Overview


7/58


ExperimentSummary


Term Weighting Methods based on TF.IDF

Most popular methods in information retrieval.

Consist of three factors

Term frequency (TF): f(w, d)- How frequent does the term w appear in document d

Inverse document frequency (IDF):- How rare is term w in a collection C

idf(w) = log

N + 0.5

N(w)

N : the total number of documents in collection C

N(w) : the number of documents in C having word w

Document normalization factor, e.g. d2- Reduce the bias of long documents


Overview


8/58


ExperimentSummary







idf(w) = log

N + 0.5

N(w)





Overview


9/58


ExperimentSummary







idf(w) = log

N + 0.5

N(w)





OverviewTF IDF


10/58


ExperimentSummary







idf(w) = log

N + 0.5

N(w)





OverviewTF IDF


11/58

Learn Term WeightsExperiment

Summary


Okapi: An Example of TF.IDF Term Weighting

Similarity between query q and document d is:

sim(d, q) =wq

kf(w, q)f(w, d)

f(w, d) + k(1 b + b |d|

d

)log

N + 0.5

N(w)

where

f(w, q) : term frequency ofw in query q

f(w, d) : term frequency ofw in d

d : is the average document length of collection C.

k, b : weight parameters determined empirically


OverviewTF IDF


12/58


Summary






4 Summary


OverviewTF IDF


13/58


Summary


Term Weighting Methods based on Language Models

Assume each document d is generated by a statistical

model dEstimate d by maximizing likelihood p(d|d)

Usually a smoothing technique, such as Jelink Mercersmoothing and Dirichlet smoothing, is used to deal withthe sparse data problem


OverviewL T W i h

TF.IDF


14/58


Summary







OverviewL T W i ht

TF.IDF


15/58


Summary

Language ModelsProblems







TF.IDF


16/58


Summary


An Example of Language models for InformationRetrieval

The unigram language model p(w|d) based on JelinkMercer smoothing:

p(w|d) = (1 )p(w|C) + f(w, d)

|d|

= p(w|C)

1 +

f(w, d)

|d|p(w|C)

where is a smoothing parameter.The similarity of query q to document d is estimated as

sim(q, d) p(q|d) wq

[p(w|d)]f(w,q)



TF.IDF


17/58


Summary


An Example of Language models for InformationRetrieval

The unigram language model p(w|d) based on JelinkMercer smoothing:

p(w|d) = (1 )p(w|C) + f(w, d)

|d|

= p(w|C)

1 +

f(w, d)

|d|p(w|C)

where is a smoothing parameter.The similarity of query q to document d is estimated as

sim(q, d) p(q|d) wq

[p(w|d)]f(w,q)



TF.IDF


18/58


Summary




2

Learn Term Weights Using Category InformationA Framework for Learning Term Weights Using CategoryInformationA Regression ApproachA Probabilistic Approach


4 Summary



TF.IDFL M d l


19/58


Summary


Problems with Existing Term Weighting Methods

The essential difficulty with determining term weights is thelack of supervision.

Problems with TF.IDF methods

Either IDF or TF is sufficient to determine if a word isinformative.

IDF factor rare words are informative wordsBut, typos are usually rare and uninformative.

Problems with language modeling approaches

They are generative models Insufficient to distinguish informative words fromuninformative ones



TF.IDFL M d l


20/58

gExperiment

Summary











TF.IDFLanguage Models


21/58

gExperiment

Summary











TF.IDFLanguage Models


22/58

ExperimentSummary











E i

FrameworkA Regression Approach


23/58

ExperimentSummary

A Regression ApproachA Probabilistic Approach



2

Learn Term Weights Using Category InformationA Framework for Learning Term Weights Using CategoryInformationA Regression ApproachA Probabilistic Approach


4 Summary



E i t



24/58

ExperimentSummary

A Regression ApproachA Probabilistic Approach

Learn Term Weights Using Category Information

Given: each document is assigned to a set of categories

Goal: learn term weights from the assigned categories of

documentsMain idea:

Each document is represented by both a bag of words and aset of categoriesCompute document similarity based on word sw(di, dj)

Compute document similarity based on category sc(di, dj)Find term weights sw(di, dj) sc(di, dj)



Experiment



25/58

ExperimentSummary

g ppA Probabilistic Approach




documentsMain idea:





Experiment



26/58

ExperimentSummary

A Probabilistic Approach




documentsMain idea:





Experiment

FrameworkA Regression ApproachA A


27/58

ExperimentSummary





documentsMain idea:





Experiment

FrameworkA Regression ApproachA P b bili ti A h


28/58

ExperimentSummary


A Framework for Learning Term Weights UsingCategory Information

For each document di, we have

Word based Rep. wi = (wi,1, wi,2,...,wi,n)T

Category based Rep. ci = (ci,1, ci,2,...,ci,n)T

Word based document similarity

sw(di, dj ; ) =

m

k=1

kwi,kwj,k

Category based document similarity

sc(di, dj ; ) =

mk=1

kci,kcj,k



Experiment

FrameworkA Regression ApproachA P b bili ti A h


29/58

pSummary


A Framework for Learning Term Weights UsingCategory Information

For each document di, we have

Word based Rep. wi = (wi,1, wi,2,...,wi,n)T

Category based Rep. ci = (ci,1, ci,2,...,ci,n)T

Word based document similarity

sw(di, dj ; ) =

m

k=1

kwi,kwj,k

Category based document similarity

sc(di, dj ; ) =

mk=1

kci,kcj,k



Experiment

FrameworkA Regression ApproachA Probabilistic Approach


30/58

pSummary


A Framework for Learning Term Weights UsingCategory Information (Contd)

Find weights and s.t. sw(di, dj ; ) sc(di, dj ; ) for

any two documents di and dj

(, ) = arg min,

i=j

l(sc(di, dj ; ), sw(di, dj ; ))

where l(x, y) is a loss function measures the differencebetween x and y.



Experiment



31/58

SummaryA Probabilistic Approach





4 Summary



ExperimentS



32/58


A Regression Approach Toward Learning Term Weights

Define loss function l(sc, sw) = sc sw2

Objective function Freg

Freg = (T, T) Q

cPT

P Qw

where

[Qw]i,j = (uTi uj)

2, [Qc]i,j = (vTi v)

2, [P]i,j = (uTi vj)

2

ui : frequency vector for the i-th termvi : frequency vector for the j-th category


OverviewLearn Term WeightsExperiment

S



33/58


A Regression Approach Toward Learning Term Weights

Define loss function l(sc, sw) = sc sw2

Objective function Freg

Freg = (T, T) Qc PT

P Qw

where

[Qw]i,j = (uTi uj)

2, [Qc]i,j = (vTi v)

2, [P]i,j = (uTi vj)

2

ui : frequency vector for the i-th termvi : frequency vector for the j-th category



Summar



34/58

Summarypp

The Regression Approach: Constraints

Trivial solution = = 0 Freg = 0L2 Constraint:

22 + 22 1

Problem: negative term weight i < 0 When two documents share word wi, they are less likelyto be similar

L1 Constraint:

i 0; j 0mi=1

i +ni=1

i 1



Summary



35/58

Summary



22 + 22 1


L1 Constraint:

i 0; j 0mi=1

i +ni=1

i 1



Summary



36/58

Summary



22 + 22 1


L1 Constraint:

i 0; j 0mi=1

i +ni=1

i 1



Summary



37/58

Summary



22 + 22 1


L1 Constraint:

i 0; j 0mi=1

i +ni=1

i 1



Summary



38/58

y

The Regression Approach: Final Form

Final form for the regression approach

min, (T, T) Q

cPT

P Qw

s. t 0, 0

1 + 1 1

Solve by quadratic programming techiques



Summary



39/58

y





4 Summary



Summary



40/58

A Probabilistic Approach Toward Learning TermWeights

Probability for documents to be similar based on words

pwi,j =1

1 + exp (sw(di, dj ; ) + 0)

Probability for documents to be similar based on categories

pci,j =1

1 + exp (sc(di, dj ; ) + 0)

Loss function: cross entropy function

l(sc(di, dj ; ), sw(di, dj ; )) =

pci,j logpwi,j (1 p

ci,j) log(1 p

wi,j)



Summary



41/58

A Probabilistic Approach Toward Learning TermWeights

Probability for documents to be similar based on words

pwi,j =1

1 + exp (sw(di, dj ; ) + 0)

Probability for documents to be similar based on categories

pci,j =1

1 + exp (sc(di, dj ; ) + 0)

Loss function: cross entropy function

l(sc(di, dj ; ), sw(di, dj ; )) =

pci,j logpwi,j (1 p

ci,j) log(1 p

wi,j)



Summary



42/58

The Probabilistic Approach: Final Form

Objective function Fprob

Fprob =Ni=j

pci,j logpwi,j + (1 p

ci,j) log(1 p

wi,j)

The final form for the probabilistic approach:

arg max,,0,0

Fprob w

n

i=1

i c

m

i=1

i

s. t. 0, 0

where w > 0 and c > 0 are regularization parameters.



Summary



43/58

The Probabilistic Approach: Optimization Strategy

Alternating OptimizationLearn term weights with fixed category weights

Decouple the correlation among

Fprob(

, ) Fprob(, )

n

i=1

gi(

i i)

and are term weights of two consecutive iterations.Solve

g

i(i) = 0

= +

Learn category weights with fixed term weights

A similar procedure for optimizing with fixed



Summary



44/58




Fprob(

, ) Fprob(, )

n

i=1

gi(

i i)


g

i(i) = 0

= +





Summary



45/58




Fprob(

, ) Fprob(, )

n

i=1

gi(

i i)


g

i(i) = 0

= +





ExperimentSummary



46/58




Fprob(

, ) Fprob(, )

n

i=1

gi(

i i)


g

i(i) = 0

= +





ExperimentSummary

Experimental DesignBaseline ApproachesExperimental Results


47/58



2 Learn Term Weights Using Category Information

A Framework for Learning Term Weights Using CategoryInformationA Regression ApproachA Probabilistic Approach


4 Summary



ExperimentSummary



48/58

Experimental Design

Document collection

A document collection from the ad hoc retrieval task ofImageCLEFTotally 28,133 documents, 933 categoriesAverage document length 50Average number of categories for a document 5

Evaluation Queries

5 queries from ImageCLEF 2003 for training w and c25 queries from ImageCLEF 2004 for testing

Evaluation metricsAverage precision for top retried documentsAverage precision across 11 recall pointsPrecision recall curve



ExperimentSummary



49/58

Experimental Design

Document collection


Evaluation Queries





ExperimentSummary



50/58

Experimental Design

Document collection


Evaluation Queries





ExperimentSummary


O li


51/58





3

ExperimentExperimental DesignBaseline ApproachesExperimental Results

4 Summary



ExperimentSummary


B li A h


52/58

Baseline Approaches

State-of-art information retrieval methods

The Okapi method (Okapi)The language model with JM smoothing (LM)

Inverse category frequency (ICF)

icf(w) = log

m

m(w)

m(w) : number of categories having word w

Replace idf(w) with icf(w) in the Okapi method



ExperimentSummary


B li A h (C td)


53/58

Baseline Approaches (Contd)

Category-based query expansion (CQE)

1 Retrieve top k = 100 documents for query q using Okapi2 Expand query q to include category information

q = {f(w1, q),...,f(wn, q); f(c1, q),...,f(cm, q)}

f(ci, q) : the number of top k documents in category ci

3 Retrieve documents using the expanded query q

logp(q

|d) =

n

i=1 f(wi, q)logp(wi|d)ni=1 f(wi, q)

+(1 )

mi=1 f(ci, q)logp(ci|d)mi=1 f(ci, q)



ExperimentSummary


O tli


54/58





3

ExperimentExperimental DesignBaseline ApproachesExperimental Results

4 Summary


Overview


Summary


P i i R ll C


55/58

Precision Recall Curves

0 0.2 0.4 0.6 0.8 10.2

0.3

0.4

0.5

0.6

0.7

0.8

Recall

Precision

Regression

Discriminative

Inverse Category Frequency

Categorybased Query Expansion

Okapi

Language Model

Probabilistic approach > Language Model & Okapi


Overview


Summary


Average Precision


56/58

Average Precision

Using Category No Category

Reg. Prob. ICF CQE Okapi LM

Avg. Prec. 0.45 0.48 0.38 0.42 0.41 0.41Prec @ 5 doc 0.55 0.56 0.40 0.50 0.47 0.50

Prec @ 10 doc 0.48 0.52 0.40 0.48 0.45 0.48Prec @ 20 doc 0.46 0.46 0.39 0.42 0.39 0.38Prec @ 100 doc 0.21 0.21 0.19 0.19 0.20 0.20

Reg. and Prob. > Okapi and LM- Category information is useful

ICF and CQE < Okapi and LM- Need to exploit category information wisely


Overview


Summary


Retrieval Precision for Individual Queries


57/58

Retrieval Precision for Individual Queries

0 5 10 15 20 250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Query ID

Average

Precision

Langauge ModelInverse Category Frequency

Categorybased Query Expansion

Probabilistic

Over 16 queries, probabilistic approach > langauge model

Over 5 queries, probabilistic approach < langauge model


Overview


Summary

Summary


58/58

Summary

Proposed two algorithms for learning term weights usingcategory information

A regression approachA probabilistic approach

Empirical studies with the ImageCLEF dataset verify theeffectiveness of the proposed algorithms

Future work

Improve learning efficiency for large numbers of documents

and large-sized vocabulariesExtend to image retrieval for annotated images


Documents

Icml05 Jin Lwtir 01