20080721-Latent Semantic Approaches for Information Retrieval …140.122.185.120/Courses/Speech Recognition/lectures2009... · 2009-03-18 · Handbook of Latent Semantic Analysis,

Latent Semantic Approaches for Information Retrieval and Language Modeling

Berlin ChenBerlin ChenDepartment of Computer Science & Information Engineering

National Taiwan Normal University

References• G.W.Furnas, S. Deerwester, S.T. Dumais, T.K. Landauer, R. Harshman, L.A. Streeter, K.E.

Lochbaum, “Information Retrieval using a Singular Value Decomposition Model of Latent Semantic Structure,” ACM SIGIR Conference on R&D in Information Retrieval , 1988, ,

• J.R. Bellegarda, ”Latent semantic mapping,” IEEE Signal Processing Magazine, September 2005• J.R. Bellegarda. Latent Semantic Mapping: Principles and Applications. Morgan and Claypool,

2007• C X Zhai "Statistical Language Models for Information Retrieval (Synthesis Lectures Series onC.X. Zhai, Statistical Language Models for Information Retrieval (Synthesis Lectures Series on

Human Language Technologies)," Morgan & Claypool Publishers, 2008• T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning

42, 2001 • M Steyvers T Griffiths "Probabilistic topic models " In T K Landauer D S McNamara SM. Steyvers, T. Griffiths, Probabilistic topic models, In T. K. Landauer, D. S. McNamara, S.

Dennis, W. Kintsch (eds.). Handbook of Latent Semantic Analysis, Mahwah NJ: Lawrence Erlbaum, 2007

• B. Chen, “Word topic models for spoken document retrieval and transcription,” ACM Transactions on Asian Language Information Processing 8(1), pp. 2:1-2:27 2009g g g ( ), pp

• B. Chen, "Latent topic modeling of word co-occurrence information for spoken document retrieval," ICASSP 2009

• H.-S. Chiu, B. Chen, “Word topical mixture models for dynamic language model adaptation,” ICASSP 2007

• D.M. Blei, A.Y.Ng, M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, 2003

• W. Kim, S. Khudanpur, “Lexical triggers and latent semantic analysis for cross-lingual language model adaptation,” ACM Transactions on Asian Language Information Processing 3(2), 2004

2

p g g g ( )• D. Gildea, T. Hofmann, “Topic-based language models using EM,” Eurospeech1999• L. K. Saul and F. C. N. Pereira, “Aggregate and mixed-order Markov models for statistical

language processing,” EMNLP1997

Taxonomy of Classic IR Models

Set Theoretic

Fuzzy

Classic Models

BooleanV t

Extended Boolean

Algebraic

Retrieval: Adhoc

Use

VectorProbabilistic Generalized Vector

Latent Semantic Analysis (LSA)

Non-Overlapping Lists

Structured ModelsFilteringr

T

Neural Networks

Probabilisticpp g

Proximal Nodes

Browsing

ask

Inference Network Belief NetworkLanguage ModelU i

Browsing

FlatStructure Guided

-Unigram -Probabilistic LSA-Latent Dirichlet Allocation-Word Topic Model

3

Structure GuidedHypertext

Word Topic Model

Classification of IR Models Along Two Axes

• Matching Strategy– Literal term matchingLiteral term matching

• E.g., Vector Space Model (VSM), Hidden Markov Model (HMM), Language Model (LM)

C t t hi– Concept matching• E.g., Latent Semantic Analysis (LSA), Probabilistic Latent

Semantic Analysis (PLSA), Word Topic Model (WTM)

• Learning Capability– Heuristic approaches for term weighting, query expansion,

document expansion etcdocument expansion, etc.• E.g., Vector Space Model, Latent Semantic Analysis • Most approaches are based on linear algebra operations

– Solid statistical foundations (optimization algorithms)• E.g., Unigram or Hidden Markov Model (HMM), Probabilistic

Latent Semantic Analysis, Latent Dirichlet Allocation (LDA),

4

ate t Se a t c a ys s, ate t c et ocat o ( ),Word Topic Model (WTM)

• Most models belong to the language modeling approach

Two Perspectives for IR Models (cont.)

• Literal Term Matching vs Concept Matching

中國解放軍蘇愷戰

Literal Term Matching vs. Concept Matching

香港星島日報篇報導引述軍事觀察家的話表示到二軍蘇愷戰機

香港星島日報篇報導引述軍事觀察家的話表示，到二零零五年台灣將完全喪失空中優勢，原因是中國大陸戰機不論是數量或是性能上都將超越台灣，報導指出中國在大量引進俄羅斯先進武器的同時也得加快研發自製武器系統目前西安飛機製造廠任職的改進型飛自製武器系統，目前西安飛機製造廠任職的改進型飛豹戰機即將部署尚未與蘇愷三十通道地對地攻擊住宅飛機，以督促遇到挫折的監控其戰機目前也已經取得了重大階段性的認知成果。根據日本媒體報導在台海戰爭隨時可能爆發情況之下北京方面的基本方針使

中共新一代空軍戰

力

戰爭隨時可能爆發情況之下北京方面的基本方針，使用高科技答應局部戰爭。因此，解放軍打算在二零零四年前又有包括蘇愷三十二期在內的兩百架蘇霍伊戰鬥機。

力

– There are usually many ways to express a given concept (an information need) so literal terms in a user’s query may not

5

information need), so literal terms in a user’s query may not match those of a relevant document

Latent Semantic Analysis (LSA)

• Also called Latent Semantic Indexing (LSI), Latent S ti M i (LSM) T M d F t A l iSemantic Mapping (LSM), or Two-Mode Factor Analysis– Original formulated in the context of information retrieval

• Users tend to retrieve documents on the basis of conceptual• Users tend to retrieve documents on the basis of conceptual content

• Individual terms (units) provide unreliable evidence about the conceptual topic or meaning of a document (composition)

• There are many ways to express a given concept– LSA attempts to explore some underlying latent semantic– LSA attempts to explore some underlying latent semantic

structure in the data (documents) which is partially obscured by the randomness of word choices

– LSA results in a parsimonious description of terms and documents

• Contextual or positional information for words in documents

6

Co te tua o pos t o a o at o o o ds docu e tsis discarded (the so-called bag-of-words assumption)

Applications of LSA

• Information Retrieval• Word/document/Topic Clustering• Language Modeling • Automatic Call Routing • Language Identification• Pronunciation Modeling• Speaker Verification (Prosody Analysis)• Utterance Verification• Text/Speech Summarization• Automatic Image Annotation• ....

7

LSA : Schematic Depiction

• Dimension Reduction and Feature Extraction– PCA feature spacePCA

XφTiiy =

Y

knX ∑

=

k

iiiy

1

φn

X̂

kφ1φ kφ1φ

kgivenaforˆmin2

XXorthonormal basis

– SVD (in LSA)kgiven afor min XX −

Σ VT

UA A’latent semantic

spacek

rxr

r ≤ min(m,n)rxn

U

mxrmxn mxn

kxkA

k

8

kF given afor min 2AA −′latent semantic

space

LSA: An Example

– Singular Value Decomposition (SVD) used for the word-document matrix

• A least-squares method for dimension reduction

xProjection of a Vector :x

x1ϕ

2ϕ

Ty2y

1θ

91 where,

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ1y

LSA: Latent Structure Space

• Two alternative frameworks to circumvent vocabulary mismatch

Doc terms structure model

doc expansion

literal term matching latent semantic i l

query expansion

literal term matching structure retrieval

Query terms structure model

10

LSA: Another Example (1/2)

1.2.3.4.55.6.7.8.9.

11

10.11.12.

LSA: Another Example (2/2)

Query: “human computer interaction”

An OOV word

12

LSA: Theoretical Foundation

• Singular Value Decomposition (SVD)Row A Rn

Col A Rm∈∈g p ( )

w1w

d1 d2 dn

Σr VT

d1 d2 dn

w1w2

Both U and V has orthonormalcolumn vectors

Col A Rm∈compositions

=

w2

rxr rxnA Umxr

Σr V rxnw2

VTV=IrXr

column vectors

UTU=IrXr

≤ i ( )

units

wm

mxn mxrwm

d d dK ≤ r ||A||F

2 ≥ ||A’|| F2

r ≤ min(m,n)

w1w2

d1 d2 dn

Σk V’Tkxm

d1 d2 dn

w1w2 ∑∑

= =

=m

i

n

jijFaA

1 1

22

= kxk kxnA’ U’mxkDocs and queries are represented in a k-dimensional space. The quantities of

13

wmmxn mxk

wmk dimensional space. The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk


• “term-document” matrix A has to do with the co-occurrences between terms (units) and documents (compositions)between terms (units) and documents (compositions)– Contextual or positional information for words in documents is discarded

• “bag-of-words” modelingbag of words modeling

• Feature extraction for the entities of matrix Ajia ,

1. Conventional tf-idf statistics

2. Or, :occurrence frequency weighted by negative entropy jia ,

( ) ∑=−×==

m

ijiji

j

jiji fd

d

fa

1,

,, ,1 ε

occurrence count ofterm i in document j

⎟⎞

⎜⎛ nn jiji

j

ff1

document lengthnegative normalized entropy

normalized entropy of term i occurrence count of term i in the collection

14

∑=∑ ⎟⎟⎠

⎞⎜⎜⎝

⎛−=

==

n

jjii

n

j i

ji

i

jii f

ffn 1

,1

,, ,loglog

1 τττ

ε10 ≤≤ iε


• Singular Value Decomposition (SVD)g p ( )– ATA is symmetric nxn matrix

• All eigenvalues λj are nonnegative real numbers

• All eigenvectors vj are orthonormal ( Rn)0....21 ≥≥≥≥ nλλλ ( )ndiag λλλ ,...,, 11

2 =Σ∈

D fi i l l j 1λ

[ ]nvvvV ...21= 1=jT

jvv ( )nxn

T IVV =

i• Define singular values:– As the square roots of the eigenvalues of ATA– As the lengths of the vectors Av1 Av2 Av

njjj ,...,1 , == λσsigma

As the lengths of the vectors Av1, Av2 , …., Avn

11 Av=σFor λi≠ 0, i=1,…r,iii

Tii

TTii vvAvAvAv λλ ===

2

15.....

22 Av=σ{Av1, Av2 , …. , Avr } is an orthogonal basis of Col A

ii

iiiiiii

Av σ=⇒


• {Av1, Av2 , …. , Avr } is an orthogonal basis of Col A ( Rm)∈{ 1 2 r } g

– Suppose that A (or ATA) has rank r ≤ n

( ) 0====• jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ– Suppose that A (or A A) has rank r ≤ n

Define an orthonormal basis {u u u } for Col A

0.... ,0.... 2121 ====>≥≥≥ ++ nrrr λλλλλλ

– Define an orthonormal basis {u1, u2 ,…., ur} for Col A

iiiii

ii

i AvuAvAvAv

u 11 =⇒== σσU is also an

[ ] [ ]rrrr

ii

vvvAuuu ... 2121 =Σ⇒ ×

σV : an orthonormal matrix

Known in advance

U is also an orthonormal matrix

(mxr)

• Extend to an orthonormal basis {u1, u2 ,…, um} of Rm

[ ] [ ]TT

nrnmmr vvvvAuuuu =Σ⇒ × ... ... ... ... 2121 ∑∑=m n

ijFaA 22

16

T

TT

VUAAVVVUAVU

Σ=⇒

=Σ⇒=Σ⇒

222

21

2 ... rFA σσσ +++= ?

∑∑= =i j1 1

nxnI ?( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

−×−×−

−××

rnrmrrm

rnrrnm 00

0Σ Σ


Av thespans i iu thespans

mxn

AA of space row

iTA of space row

mxn

1 V U

Rn Rm

V

1U

2U UTV

2 V 2U

( )VV

000Σ

UUVUΣ ⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛= 11

21 T

TT

0=vA

U

( )

VAV

VΣU

V00

=

=

⎟⎠

⎜⎝⎟⎠

⎜⎝

11

111

221

T

T

T0 =iv A

AVU =Σ

17

A=


• Additional Explanations– Each row of is related to the projection of a correspondingU– Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of U

A VVUA TΣ=

• the i-th entry of a row of is related to the projection of a

AVUUVVUAV T =Σ⇒Σ=Σ=⇒

Ucorresponding row of onto the i-th column of

– Each row of is related to the projection of a corresponding f t th b i f d b

A

V

V

UTrow of onto the basis formed by UTA

( )VUA

TTTT

TΣ=

( )UAV

VUUVUVUUAT

TTTT

=Σ⇒

Σ=Σ=Σ=⇒

18

• the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of TA

VU


• Fundamental comparisons based on SVDFundamental comparisons based on SVD– The original word-document matrix (A)

d d d t t d t d t f t f Aw1w2

d1 d2 dn

A

• compare two terms → dot product of two rows of A– or an entry in AAT

• compare two docs → dot product of two columns of A

wmmxn

A– or an entry in ATA

• compare a term and a doc → each individual entry of A

– The new word-document matrix (A’)• compare two terms A’A’T (U’Σ’V’T) (U’Σ’V’T)T U’Σ’V’TV’Σ’TU’T (U’Σ’)(U’Σ’)T

U’=Umxk

wjwi

compare two terms→ dot product of two rows of U’Σ’

• compare two docs

A’A’T=(U’Σ’V’T) (U’Σ’V’T)T=U’Σ’V’TV’Σ’TU’T =(U’Σ’)(U’Σ’)T

A’TA’=(U’Σ’V’T)T ’(U’Σ’V’T) =V’Σ’T’UT U’Σ’V’T=(V’Σ’)(V’Σ’)T

For stretching or shrinking

Irxr

Σ’=Σk

V’=Vnxk

19

→ dot product of two rows of V’Σ’• compare a query word and a doc → each individual entry of A’

( ) ( ) ( )( )

dk

ds

LSA: Fold-in

• Find representations for pesudo-docsF bj t ( i d ) th t did t i th– For objects (new queries or docs) that did not appear in the original analysis

• Fold-in a new mx1 query (or doc) vector q y ( )

( ) 111ˆ −

×××× Σ= kkkmmT

k Uqq The separate dimensions are differentially weighted

See Figure A in next page

Represented as the weighted sum of its component word

( )Query represented by the weightedsum of it constituent term vectors

are differentially weightedJust like a row of V

– Represented as the weighted sum of its component word (or term) vectors

– Cosine measure between the query and doc vectors in the latent semantic space

( ) Σ=ΣΣ=

dqdqcoinedqsimT

ˆ

ˆˆ)ˆ,ˆ(ˆ,ˆ

2

20

( )ΣΣ dq̂

row vectors


• Fold-in a new 1 X n term vector See Figure B below1

11ˆ−×××× Σ= kkknnk Vtt

See Figure B below

<Figure A>

<Figure B>

21

LSA: A Simple IR Evaluation

• Experimental resultsp– HMM is consistently better than VSM at all recall levels– LSA is better than VSM at higher recall levels

22

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection. (Using word-level indexing terms)

LSA: Pro and Con (1/2)

• Pro (Advantages)– A clean formal framework and a clearly defined optimization

criterion (least-squares)• Conceptual simplicity and clarityConceptual simplicity and clarity

– Handle synonymy problems (“heterogeneous vocabulary”)

• Replace individual terms as the descriptors of documents by independent “artificial concepts” that can specified by any one of several terms (or documents) or combinationsone of several terms (or documents) or combinations

– Good results for high-recall search• Take term co-occurrence into account

23

LSA: Pro and Con (2/2)

• Disadvantagesg– High computational complexity (e.g., SVD decomposition)

– Exhaustive comparison of a query against all stored documents is needed (cannot make use of inverted files ?)

– LSA offers only a partial solution to polysemy (e.g. bank, bass,…)• Every term is represented as just one point in the latent

space (represented as weighted average of different meanings of a term)

24

LSA: Junk E-mail Filtering

• One vector represents the centriod of all e-mails that are f i t t t th hil th th th t i d f llof interest to the user, while the other the centriod of all

e-mails that are not of interest

25

LSA: Dynamic Language Model Adaptation (1/4)

• Let wq denote the word about to be predicted, and H the admissible LSA history (context) for thisHq-1 the admissible LSA history (context) for this particular word– The vector representation of H 1 is expressed by 1

~dThe vector representation of Hq-1 is expressed by• Which can be then projected into the latent semantic

space

1−qd

UdSvv Tqqq 111

~~~ −−− ==~ ~

[ ]Σ=S :notation of changeLSA representation

• Iteratively update and as the decoding evolves

1~

−qd 1 −qv

]0...1...0[1~1~1

Tiq

qq d

nd ε−

+−

=VSM representation ]0...1...0[1q

qq

q nd

nd +−

[ ] )1(~)1(1~~11 iiqq

Tqqq uvn

nUdSvv ε−+−=== −−

VSM representation

LSA representation

26

qn

[ ] or iiqqq

u)ε(v~)n(λn

−+−⋅= − 1111

withexponential decay

LSA: Dynamic Language Model Adaptation (2/4)y g g ( )

• Integration of LSA with N-grams

),|Pr()|Pr( )(1

)(1

)(1 = −−+−

lq

nqq

lnqq HHwHw

gram-therefer totssuperscripand the

,word for history suitablesomedenotes where )()(

1−ln

qq

nand

wH

)~(

LSA the),1with ,...(component gramtherefer to tssuperscrip and the

121 >+−−− nqqq

d

nwwwnand

:asrewritten becan expression This

:)(component 1−qd

)|Pr(

)|,Pr()|Pr( )()(

)(1

)(1)(

1∑

= −−+− nl

nq

lqqln

qq HHw

HHwHw

27

)|,Pr( 11∑∈

−−Vw

qqii

HHw


• Integration of LSA with N-grams (cont.)

)|Pr()|Pr(

)|,Pr()()()(

)(1

)(1

nln

nq

lqq

HwHHw

HHw −− = Assume the probability of the document history given the current word is not affected by the immediate context preceding it

)|~Pr()|Pr(

),|Pr()|Pr(

1211121

)(1

)(1

)(1

nqqqqqnqqqq

qqqqq

wwwwdwwww

HwHHw

+−−−−+−−−

−−−

⋅=

⋅

LL

by the immediate context preceding it

)~Pr()~|Pr()|(

)|~Pr()|Pr(

11

1121

qqq

qqnqqqq

ddw

wdwwww

−−

−+−−− ⋅= L

)Pr()()|(

)|Pr( 11121

q

qqqnqqqq w

wwww +−−− ⋅= L

)|Pr( )(1 =+−ln

qq Hw

)Pr()~|Pr(

)|Pr(

)|(

1121

1

⋅ −+−−−

q

qqnqqqq

qq

wdw

wwww L

28

)Pr()~|Pr(

)|Pr(

)(

1121∑ ⋅

∈

−+−−−

Vw i

qinqqqi

q

i wdw

wwww L


wordofrelevance""thereflects)~|Pr( y,Intuitivel 1 wdw qqq

:~ through observed as history, admissible theto

)|(y,

1

1

dq

qqq

−

−

)~|Pr( dw

1

1

)~|(

)|Pr(

dwK

dw

qq

qq

−

−

≈

211

21121

121

~~

)~,(cosSvSu

vSuSvSu

qq

Tqq

qq−

−− ==

d )t t""l t(i~ff itith thl l i

most aligns meaning whosefor wordshighest be it will such, As

d

1qq

particularany convey not do which for wordslowest and

words),content""relevant (i.e., offavricsemanticth theclosely wi 1dq−

29

).""likeworksfunction""(e.g.,fabricabout thisn informatio the

LSA: Cross-lingual Language Model Adaptation (1/2)

• Assume that a document-aligned (instead of sentence-aligned) Chinese English bilingual corpus is providedaligned) Chinese-English bilingual corpus is provided

30Lexical triggers and latent semantic analysis for cross-lingual language model adaptation, TALIP 2004, 3(2)

LSA: Cross-lingual Language Model Adaptation (2/2)

• CL-LSA adapted Language Modelp g gis a relevant English doc of the Mandarin

doc being transcribed, obtained by CL-IR

Eid

Cid

( )( )( ) ( )

21Adapt ,, −−

+E

Eikkk

cccPdcPP

dcccP

λ ( ) ( )21Trigram-BGUnigram-LCA-CL , −−+⋅= kkkik cccPdcPPλ

( ) ( ) ( )( ) ( ) ( )( )

Unigram-LCA-CL =∑ Ei

eT

Ei dePecPdcP

( ) ( )( )

( )1 ,sim

,sim>>

′≈∑

γγ

γ

T ecececP rr

rr

31

( )∑′c

Probabilistic Latent Semantic Analysis (PLSA)• PLSA models the co-occurrence of word and documents

and evaluates the relevance in a low dimensional semantic/topic space– Each document is treated as a document model DMD

( ) ( ) ( )∑==

K

kDkkiDi M|TPT|wPM|wP

1PLSA

• PLSA can be viewed as a nonnegative factorization of a “word-document” matrix consisting probability entries – A procedure similar to the SVD performed by its algebraic

counterpart- LSA D D D D T T T D D D DD1 D2 Di Dn

w1w2

wi ( )MwP ~

w1w2

w

T1 Tk TK

( )ki TwP

D1 D2 Di Dn

( )ik DTP

KHT

32

Awi

wm

( )iDi MwP ~~

Qwi

wmmxn mxK

KxnH

PLSA: Information Retrieval (1/3)

• The relevance measure between a query and a document can be expressed by

( ) ( ) ( )( )

∏ ⎥⎤

⎢⎡∑=

Q,wcKDkkiD

i

MTPTwPMQPPLSA

– Relevance measure is not obtained based on the frequency of a respective query term occurring in a document but instead based

( ) ( ) ( )∏ ⎥⎦⎢⎣∑=

∈ =Qw kDkkiD

i

MTPTwPMQP1

PLSA

respective query term occurring in a document, but instead based on the frequency of the term and document in the latent topics

– A query and a document thus may have a high relevance score even if they do not share any terms in common

33


• Unsupervised training: The model parameters are p g ptrained beforehand using a set of text documents– Maximize the log-likelihood of entire collection D

( ) ( ) ( )∑ ∑=∑=∈ ∈∈ DD D Dw

DiPLSAiD

DPLSAn

M|wPlogD,wcM|DPlogLlog D

• Supervised training: The model parameters are trained using a training set of query exemplars and the associated query-document relevance information– Maximize the log-likelihood of the training set of query

exemplars generated by their relevant documentsexemplars generated by their relevant documents

( )∑ ∑=∈ ∈TrainSet QR

TrainSet Q DDPLSA MQPlogLlog

Q DQ

to

34

( ) ( )∑ ∑ ∑=∈ ∈ ∈TrainSet QR iQ D Qw

Dii MwPlogQ,wcQ D to


• Example: most probable words form 4 latent topics p p paviation space missions family love Hollywood love

35

PLSA vs. LSA• Decomposition/Approximation

– LSA: least-squares criterion measured on the L2- or Frobeniusqnorms of the word-doc matrices

– PLSA: maximization of the likelihoods functions based on the cross entropy or Kullback Leibler divergence between the empiricalentropy or Kullback-Leibler divergence between the empirical distribution and the model

• Computational complexityp p y– LSA: SVD decomposition– PLSA: EM training, is time-consuming for iterations ?– The model complexity of both LSA and PLSA grows linearly with the

number of training documents• There is no general way to estimate or predict the vectorThere is no general way to estimate or predict the vector

representation (of LSA) or the model parameters (of PLSA) for a newly observed document

36

• LSA and PLSA both assume “bag-of-words” representations of documents (how to distinguish “street market” from market street ?)

PLSA: Dynamic Language Model Adaptation

• The search history can be treated as a pseudo-document y pwhich is varying during the speech recognition process

( ) ( ) ( )∑==

K

kwkkiwi ii

H|TPT|wPH|wP1

PLSA

– The topic unigrams are kept unchanged– The history’s probability distribution over the latent topics is

( )ki TwP |y p y p

gradually updated– The topic mixture weights are estimated on the fly

It ld b ti i( )

iwk HTP |

• It would be time-consuming

37

PLSA: Document Organization (1/3)

• Each document is viewed as a document model to generate itself – Additional transitions between topical mixtures have to do with

the topological relationships between topical classes on a 2 Dthe topological relationships between topical classes on a 2-D map

( ) ( ) ( ) ( )∑ ⎥⎤

⎢⎡∑=

K KTwPTTPTPMwP M( ) ( ) ( ) ( )∑ ⎥⎦⎢⎣

∑== =k l

liklDkDi TwPTTPTPMwP1 1

MPLSA

( ) ( )⎥⎥⎦

⎤

⎢⎢⎣

⎡−= 2

2

2,exp

21,

σσπlk

klTTdistTTE

Two-dimensional

⎦⎣

( ) ( )( )∑

= Kk

klkl

TTE

TTETTP ,

38

Two dimensional Tree Structure

for Organized Topics

( )∑=s

ks TTE1

,


• Document models can be trained in an unsupervised pway by maximizing the total log-likelihood of the document collection

( ) ( )∑∑===

V

ijiji

n

jT DwPDwcL

11log,

• Each topical class can be labeled by words selected using the following criterionusing the following criterion

( ) ( )∑n

jjkji DTPDwc

1,

( )( ) ( )[ ]∑ −

=

=

=n

ijkji

jki

DTPDwcTwSig

1

1

1,,

39


• Spoken Document Retrieval and Browsing System p g ydeveloped by NTU (Prof. Lin-shan Lee)

40

Latent Dirichlet Allocation (LDA) (1/2)

• The basic generative process of LDA closely resembles g p yPLSA; however,– In PLSA, the topic mixture is conditioned on each

d t ( i fi d k )( )( )DTP k

document ( is fixed, unknown)– While in LDA, the topic mixture is drawn from a Dirichlet

distribution, so-called the conjugate prior, ( is unknown

( )DTP k

( )DTP k

( )DTP kj g p (and follows a probability distribution)

LDAwithcorpusageneratingofProcess

( )k

D

T

Dθβ

Tφ

afrom docu each for on distributilmultinomiaaPick parameter with on distributiDirichlet

afrom each topicfor on distributi lmultinomia aPick LDAwith corpusageneratingofProcess

)2

)1

{ }D

D

θKT

α

parameter with on distributi lmultinomia afrom topicaPick 3)

parameter with on distributiDirichlet ,,2,1

)

L∈

41

T

D

φw parameter with on distributi lmultinomia afrom ord aPick 4)p

Blei et al. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003

Latent Dirichlet Allocation (2/2)

word 3

Z (P(w3|MD))

X+Y+Z 1

X (P(w1|MD)Y (P(w2|MD)

X+Y+Z=1( ) ( ) ( )∑

=

=K

kDkkiDi MTPTwPMwP

1LDA |||

Y (P(w2|MD)

word 1

word 2

42

Word Topic Models (WTM)

• Each word of language are treated as a word topical g g pmixture model for predicting the occurrences of other words

( ) ( ) ( )∑==

K

kwkkiwi jj

MTPTwPMwP1

WTM |||

• WTM also can be viewed as a nonnegative factorization of a “word-word” matrix consisting probability entries g p y– Each column encodes the vicinity information of all occurrences

of a certain type of word V V V VT T TV V V V Vw1

Vw2Vwj

Vwm

w1w2

( )MwP ~

w1w2

w

T1 Tk TK

( )ki TwP ( )ik DTP

KQ’T

Vw1Vw2

VwjVwm

43

Bwj

wm

( )iWj MwP ~~

Qwi

wmmxm mxK

KxmQ

WTM: Information Retrieval (1/3)

• The relevance measure between a query and a document can be expressed bydocument can be expressed by

( ) ( ) ( )( )Qwc

Qw Dw

K

kwkkiDj

i

i jj

TPTwPDQP,

1,WTM M∏ ⎥

⎦

⎤⎢⎣

⎡∑ ∑=

∈ ∈ =α

• Unsupervised training– The WTM of each word can be trained by concatenating those

j ⎦⎣

The WTM of each word can be trained by concatenating those words occurring within a context window of size around each occurrence of the word, which are postulated to be relevant to the wordthe word

( ) ( ) ( ).Mlog,Mloglog WTMWTM ∑ ∑=∑=∈ ∈∈ ww

wj jwi

jjj

jj w Qwwiwi

www wPQwcQPL

j jwij Q

1,jwQ 2,jwQ Nw jQ ,Nwwww jjjj

QQQQ ,2,1, ,,, L=

44

jw jw jw


• Supervised training: The model parameters are trained i t i i t f l d thusing a training set of query exemplars and the

associated query-document relevance informationMaximize the log-likelihood of the training set of query– Maximize the log-likelihood of the training set of query exemplars generated by their relevant documents

( )∑ ∑ DQPL ll ( )∑ ∑=∈ ∈TrainSet QR

TrainSet Q DDQPL

Q DQ

toWTMloglog

45


• Formulas for Supervised Trainingp g[ ][ ]

[ ][ ]∑ ∑ ∑ ′′

∑ ∑

=

∈′ ∈′ ′∈

∈ ∈

′TrainSetQQ DocD Qwinkn

TrainSetQQ DocDik

k

QRi n

QRi

DwTPQwn

DwTPQwn

TwP

to

to

),|(),(

),|(),(

)|(ˆ

( )⎥⎥⎦

⎤

⎢⎢⎣

⎡∑ ⎟

⎠⎞⎜

⎝⎛

∈Dwwkijk

ijj

MTPTwP

DwTP,

)|(

where

α

∑ ∑ ∑ kD MwTPMwMPQwn )|()|()(

( )∑⎥⎥⎦

⎤

⎢⎢⎣

⎡∑ ⎟

⎠⎞⎜

⎝⎛

⎦⎣=

= ∈

K

l Dwwlijl

ik

ijj

MTPTwP

DwTP

1,

),|(

α

[ ][ ]

[ ][ ]∑ ∑ ∑ ′′′

∑ ∑ ∑

=

∈′ ∈′ ′∈′′

∈ ∈ ∈

′TrainSetQQ DocD QwDw

TrainSetQQ DocD QwwkDw

wk

QRiij

QRijij

j MwMPQwn

MwTPMwMPQwn

MTP

to

to

),|(),(

),|(),|(),(

)|(ˆ

where jij wkDw MwTPMwMP

⎞⎛⎞⎛

),|(),|(

( )( )∑ ⋅

⋅=

∈Dwwil

wijDw

ill

j

ij MwP

MwPMwMP

,

, ),|(

where

α

α ( )( )

( )

( )( )

jj

j

j

ill

j

wkkwij

K

zwzz

wkk

Dwwil

wij

MTPTwPMwP

MTPTwP

MTPTwP

MwP

MwP

⎟⎠⎞⎜

⎝⎛⎟

⎠⎞⎜

⎝⎛

∑ ⎟⎠⎞⎜

⎝⎛

⎟⎠⎞⎜

⎝⎛

⋅∑

⎟⎠⎞⎜

⎝⎛

=

=∈

,

1,

,

α

α

α

46

( ) ( )( ) ( )∑

=

=

K

zwzz

wkkwk

j

j

j

MTPTwP

MTPTwPMwTP

1

),|( and( )( )

( )i

j

ji

D

wkkij

wD

MwP

MTPTwP

MwPMwP

⎟⎠⎞⎜

⎝⎛

=

⎟⎠⎞⎜

⎝⎛

⎠⎝⋅⎠⎝=

,α

WTM: Dynamic Language Model Adaptation (1/2)

• For a decoded word , we can again interpret it as a iw , g p(single-word) query; while for each of its search histories, expressed by , we can linearly combine th i t d WTM d l f th d i i

i

121 ,,, −= iiw wwwH K

the associated WTM models of the words occurring in to form a composite WTM model

iwH

( ) ( ) ( )( ) ( ) ( ) ( )∑ ∑=∑=−

= =

−

=

1

1 1

1

1WTMWTM M M M

i

jwk

K

kkij

i

jwijHi jjiw

TPTwPwPwP ββ

1−iφ2φ11 =φ

are nonnegative weighting coefficients which empirically

( )∏ −=−−

=+

1

11

ji

ssjjj φφβ

β

iw1−iw2w1w11 −− iφ

2φ11 =φ

21 φ−

1

– are nonnegative weighting coefficients which empirically set to be exponentially decayed as the word is being apart from

– is set to a fixed value (between 0 and 1) for , and iw

jj φβ =

jφ 1,,2 −= ij L

47

set to 1 for j

1=j

WTM: Dynamic Language Model Adaptation (2/2)

• For our speech recognition test data, it was experimentally p g , p yobserved that the language model access time of WTM was approximately 1/30 of that of PLSA for language model d t ti th it ti b f th li EMadaptation, as the iteration number of the online EM

estimation of for PLSA was set to 5( )iwk HTP |

( ) ( ) ( ) ( )1M 12BGWTM12Adapt wwwPwPwwwP iiiHiiii iw −−−− ⋅−+⋅= λλ

model gram- background :BG n

48

Comparison of WTMM and PLSA/LDA

• A schematic comparison for the matrix factorizations of PLSA/LDA and WTMPLSA/LDA and WTM

documents documentstopics

Ts

wor

ds A ≈

wor

ds

mixture weights

G THPLSA/LDA topi

c

normalized “word-document”co-occurrence matrix

mixture components

( ) ( ) ( )∑=K

kDkkiDi TPTwPwP

1PLSA/LDA M||M|

ds

vicinities of words topics vicinities of words

ds Q TQ′cs

=k 1

wor

d B ≈w

ord

mixture weights

Q TQ′WTM

topi

c49

normalized “word-word”co-occurrence matrix

mixture components

( ) ( ) ( )∑==

K

kwkkiwi jj

TPTwPwP1

WTM M||M|

Summary: Three Ways of Developing LM Approaches for IRApproaches for IR

(a) Query likelihood(b) Document likelihood literal term matching

or concept matching

50

(c) Model comparisonp g

LSA: SVDLIBC

• Doug Rohde's SVD C Library version 1.3 is basedg yon the SVDPACKC library

• Download it at http://tedlab.mit.edu/~dr/

51

LSA: Exercise (1/4)

• Given a sparse term-document matrix Row#Tem

Col.# Doc

Nonzero entries

– E.g., 4 terms and 3 docsDoc

4 3 6 20 2.3

2 nonzero entries at Col 0

Col 0, Row 0 C l 0 R 2

2.3 0.0 4.2 0.0 1.3 2.2 3 8 0 0 0 5Term

2 3.811 1.3

Col 0, Row 2 1 nonzero entry

at Col 1Col 1, Row 1

3 t

E h t b i ht d b TF IDF

3.8 0.0 0.5 0.0 0.0 0.0

Term30 4.21 2.2

3 nonzero entryat Col 2

Col 2, Row 0 Col 2, Row 1 C l 2 R 2– Each entry can be weighted by TFxIDF score

• Perform SVD to obtain term and document vectors t d i th l t t ti

2 0.5 Col 2, Row 2

represented in the latent semantic space• Evaluate the information retrieval capability of the LSA

approach by using varying sizes (e g 100 200 600

52

approach by using varying sizes (e.g., 100, 200,...,600 etc.) of LSA dimensionality

LSA: Exercise (2/4)

• Example: term-document matrixp

51253 2265 21885277

Indexing Term no. Doc no. Nonzero

entries

77508 7.725771596 16.213399612 13.080868709 7.725771713 7.725771744 7.7257711190 2 11190 7.7257711200 16.2133991259 7.725771

LSA100 Ut• SVD command (IR_svd.bat)

svd -r st -o LSA100 -d 100 Term-Doc-Matrix

…… LSA100-Ut

LSA100-S

output

53

svd -r st -o LSA100 -d 100 Term-Doc-Matrix

sparse matrix input prefix of output filesNo. of reserved

eigenvectors name of sparse

matrix input

LSA100-Vt

LSA: Exercise (3/4)

• LSA100-Ut100 512530.003 0.001 ……..

51253 words

0.002 0.002 …….

word vector (uT): 1x100 • LSA100-Vt• LSA100-S

100

100 22650.021 0.035 ……..

2265 docs

1002686.18829.941559 59

0.012 0.022 …….

54

559.59….

100 eigenvalues

doc vector (vT): 1x100

LSA: Exercise (4/4)

• Fold-in a new mx1 query vector q y

( ) 111ˆ −

×××× Σ= kkkmmT

k Uqq The separate dimensions are differentially weighted( )

Query represented by the weightedsum of it constituent term vectors

are differentially weightedJust like a row of V

• Cosine measure between the query and doc vectors in the latent semantic spacep

( ) Σ=ΣΣ=

dqdqcoinedqsimTˆˆ

)ˆˆ(ˆˆ2( )ΣΣ

=ΣΣ=dq

dqcoinedqsimˆˆ

),(,

55

Documents

20080721-Latent Semantic Approaches for Information Retrieval …140.122.185.120/Courses/Speech Recognition/lectures2009... · 2009-03-18 · Handbook of Latent Semantic Analysis,