Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Latent Semantic Approaches for Information Retrieval and Language Modeling
Berlin ChenBerlin ChenDepartment of Computer Science & Information Engineering
National Taiwan Normal University
References• G.W.Furnas, S. Deerwester, S.T. Dumais, T.K. Landauer, R. Harshman, L.A. Streeter, K.E.
Lochbaum, “Information Retrieval using a Singular Value Decomposition Model of Latent Semantic Structure,” ACM SIGIR Conference on R&D in Information Retrieval , 1988, ,
• J.R. Bellegarda, ”Latent semantic mapping,” IEEE Signal Processing Magazine, September 2005• J.R. Bellegarda. Latent Semantic Mapping: Principles and Applications. Morgan and Claypool,
2007• C X Zhai "Statistical Language Models for Information Retrieval (Synthesis Lectures Series onC.X. Zhai, Statistical Language Models for Information Retrieval (Synthesis Lectures Series on
Human Language Technologies)," Morgan & Claypool Publishers, 2008• T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning
42, 2001 • M Steyvers T Griffiths "Probabilistic topic models " In T K Landauer D S McNamara SM. Steyvers, T. Griffiths, Probabilistic topic models, In T. K. Landauer, D. S. McNamara, S.
Dennis, W. Kintsch (eds.). Handbook of Latent Semantic Analysis, Mahwah NJ: Lawrence Erlbaum, 2007
• B. Chen, “Word topic models for spoken document retrieval and transcription,” ACM Transactions on Asian Language Information Processing 8(1), pp. 2:1-2:27 2009g g g ( ), pp
• B. Chen, "Latent topic modeling of word co-occurrence information for spoken document retrieval," ICASSP 2009
• H.-S. Chiu, B. Chen, “Word topical mixture models for dynamic language model adaptation,” ICASSP 2007
• D.M. Blei, A.Y.Ng, M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, 2003
• W. Kim, S. Khudanpur, “Lexical triggers and latent semantic analysis for cross-lingual language model adaptation,” ACM Transactions on Asian Language Information Processing 3(2), 2004
2
p g g g ( )• D. Gildea, T. Hofmann, “Topic-based language models using EM,” Eurospeech1999• L. K. Saul and F. C. N. Pereira, “Aggregate and mixed-order Markov models for statistical
language processing,” EMNLP1997
Taxonomy of Classic IR Models
Set Theoretic
Fuzzy
Classic Models
BooleanV t
Extended Boolean
Algebraic
Retrieval: Adhoc
Use
VectorProbabilistic Generalized Vector
Latent Semantic Analysis (LSA)
Non-Overlapping Lists
Structured ModelsFilteringr
T
Neural Networks
Probabilisticpp g
Proximal Nodes
Browsing
ask
Inference Network Belief NetworkLanguage ModelU i
Browsing
FlatStructure Guided
-Unigram -Probabilistic LSA-Latent Dirichlet Allocation-Word Topic Model
3
Structure GuidedHypertext
Word Topic Model
Classification of IR Models Along Two Axes
• Matching Strategy– Literal term matchingLiteral term matching
• E.g., Vector Space Model (VSM), Hidden Markov Model (HMM), Language Model (LM)
C t t hi– Concept matching• E.g., Latent Semantic Analysis (LSA), Probabilistic Latent
Semantic Analysis (PLSA), Word Topic Model (WTM)
• Learning Capability– Heuristic approaches for term weighting, query expansion,
document expansion etcdocument expansion, etc.• E.g., Vector Space Model, Latent Semantic Analysis • Most approaches are based on linear algebra operations
– Solid statistical foundations (optimization algorithms)• E.g., Unigram or Hidden Markov Model (HMM), Probabilistic
Latent Semantic Analysis, Latent Dirichlet Allocation (LDA),
4
ate t Se a t c a ys s, ate t c et ocat o ( ),Word Topic Model (WTM)
• Most models belong to the language modeling approach
Two Perspectives for IR Models (cont.)
• Literal Term Matching vs Concept Matching
中國解放軍蘇愷戰
Literal Term Matching vs. Concept Matching
香港星島日報篇報導引述軍事觀察家的話表示 到二軍蘇愷戰機
香港星島日報篇報導引述軍事觀察家的話表示,到二零零五年台灣將完全喪失空中優勢,原因是中國大陸戰機不論是數量或是性能上都將超越台灣,報導指出中國在大量引進俄羅斯先進武器的同時也得加快研發自製武器系統 目前西安飛機製造廠任職的改進型飛自製武器系統,目前西安飛機製造廠任職的改進型飛豹戰機即將部署尚未與蘇愷三十通道地對地攻擊住宅飛機,以督促遇到挫折的監控其戰機目前也已經取得了重大階段性的認知成果。根據日本媒體報導在台海戰爭隨時可能爆發情況之下北京方面的基本方針 使
中共新一代空軍戰
力
戰爭隨時可能爆發情況之下北京方面的基本方針,使用高科技答應局部戰爭。因此,解放軍打算在二零零四年前又有包括蘇愷三十二期在內的兩百架蘇霍伊戰鬥機。
力
– There are usually many ways to express a given concept (an information need) so literal terms in a user’s query may not
5
information need), so literal terms in a user’s query may not match those of a relevant document
Latent Semantic Analysis (LSA)
• Also called Latent Semantic Indexing (LSI), Latent S ti M i (LSM) T M d F t A l iSemantic Mapping (LSM), or Two-Mode Factor Analysis– Original formulated in the context of information retrieval
• Users tend to retrieve documents on the basis of conceptual• Users tend to retrieve documents on the basis of conceptual content
• Individual terms (units) provide unreliable evidence about the conceptual topic or meaning of a document (composition)
• There are many ways to express a given concept– LSA attempts to explore some underlying latent semantic– LSA attempts to explore some underlying latent semantic
structure in the data (documents) which is partially obscured by the randomness of word choices
– LSA results in a parsimonious description of terms and documents
• Contextual or positional information for words in documents
6
Co te tua o pos t o a o at o o o ds docu e tsis discarded (the so-called bag-of-words assumption)
Applications of LSA
• Information Retrieval• Word/document/Topic Clustering• Language Modeling • Automatic Call Routing • Language Identification• Pronunciation Modeling• Speaker Verification (Prosody Analysis)• Utterance Verification• Text/Speech Summarization• Automatic Image Annotation• ....
7
LSA : Schematic Depiction
• Dimension Reduction and Feature Extraction– PCA feature spacePCA
XφTiiy =
Y
knX ∑
=
k
iiiy
1
φn
X̂
kφ1φ kφ1φ
kgivenaforˆmin2
XXorthonormal basis
– SVD (in LSA)kgiven afor min XX −
Σ VT
UA A’latent semantic
spacek
rxr
r ≤ min(m,n)rxn
U
mxrmxn mxn
kxkA
k
8
kF given afor min 2AA −′latent semantic
space
LSA: An Example
– Singular Value Decomposition (SVD) used for the word-document matrix
• A least-squares method for dimension reduction
xProjection of a Vector :x
x1ϕ
2ϕ
Ty2y
1θ
91 where,
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ1y
LSA: Latent Structure Space
• Two alternative frameworks to circumvent vocabulary mismatch
Doc terms structure model
doc expansion
literal term matching latent semantic i l
query expansion
literal term matching structure retrieval
Query terms structure model
10
LSA: Another Example (1/2)
1.2.3.4.55.6.7.8.9.
11
10.11.12.
LSA: Another Example (2/2)
Query: “human computer interaction”
An OOV word
12
LSA: Theoretical Foundation
• Singular Value Decomposition (SVD)Row A Rn
Col A Rm∈∈g p ( )
w1w
d1 d2 dn
Σr VT
d1 d2 dn
w1w2
Both U and V has orthonormalcolumn vectors
Col A Rm∈compositions
=
w2
rxr rxnA Umxr
Σr V rxnw2
VTV=IrXr
column vectors
UTU=IrXr
≤ i ( )
units
wm
mxn mxrwm
d d dK ≤ r ||A||F
2 ≥ ||A’|| F2
r ≤ min(m,n)
w1w2
d1 d2 dn
Σk V’Tkxm
d1 d2 dn
w1w2 ∑∑
= =
=m
i
n
jijFaA
1 1
22
= kxk kxnA’ U’mxkDocs and queries are represented in a k-dimensional space. The quantities of
13
wmmxn mxk
wmk dimensional space. The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
LSA: Theoretical Foundation
• “term-document” matrix A has to do with the co-occurrences between terms (units) and documents (compositions)between terms (units) and documents (compositions)– Contextual or positional information for words in documents is discarded
• “bag-of-words” modelingbag of words modeling
• Feature extraction for the entities of matrix Ajia ,
1. Conventional tf-idf statistics
2. Or, :occurrence frequency weighted by negative entropy jia ,
( ) ∑=−×==
m
ijiji
j
jiji fd
d
fa
1,
,, ,1 ε
occurrence count ofterm i in document j
⎟⎞
⎜⎛ nn jiji
j
ff1
document lengthnegative normalized entropy
normalized entropy of term i occurrence count of term i in the collection
14
∑=∑ ⎟⎟⎠
⎞⎜⎜⎝
⎛−=
==
n
jjii
n
j i
ji
i
jii f
ffn 1
,1
,, ,loglog
1 τττ
ε10 ≤≤ iε
LSA: Theoretical Foundation
• Singular Value Decomposition (SVD)g p ( )– ATA is symmetric nxn matrix
• All eigenvalues λj are nonnegative real numbers
• All eigenvectors vj are orthonormal ( Rn)0....21 ≥≥≥≥ nλλλ ( )ndiag λλλ ,...,, 11
2 =Σ∈
D fi i l l j 1λ
[ ]nvvvV ...21= 1=jT
jvv ( )nxn
T IVV =
i• Define singular values:– As the square roots of the eigenvalues of ATA– As the lengths of the vectors Av1 Av2 Av
njjj ,...,1 , == λσsigma
As the lengths of the vectors Av1, Av2 , …., Avn
11 Av=σFor λi≠ 0, i=1,…r,iii
Tii
TTii vvAvAvAv λλ ===
2
15.....
22 Av=σ{Av1, Av2 , …. , Avr } is an orthogonal basis of Col A
ii
iiiiiii
Av σ=⇒
LSA: Theoretical Foundation
• {Av1, Av2 , …. , Avr } is an orthogonal basis of Col A ( Rm)∈{ 1 2 r } g
– Suppose that A (or ATA) has rank r ≤ n
( ) 0====• jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ– Suppose that A (or A A) has rank r ≤ n
Define an orthonormal basis {u u u } for Col A
0.... ,0.... 2121 ====>≥≥≥ ++ nrrr λλλλλλ
– Define an orthonormal basis {u1, u2 ,…., ur} for Col A
iiiii
ii
i AvuAvAvAv
u 11 =⇒== σσU is also an
[ ] [ ]rrrr
ii
vvvAuuu ... 2121 =Σ⇒ ×
σV : an orthonormal matrix
Known in advance
U is also an orthonormal matrix
(mxr)
• Extend to an orthonormal basis {u1, u2 ,…, um} of Rm
[ ] [ ]TT
nrnmmr vvvvAuuuu =Σ⇒ × ... ... ... ... 2121 ∑∑=m n
ijFaA 22
16
T
TT
VUAAVVVUAVU
Σ=⇒
=Σ⇒=Σ⇒
222
21
2 ... rFA σσσ +++= ?
∑∑= =i j1 1
nxnI ?( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
−×−×−
−××
rnrmrrm
rnrrnm 00
0Σ Σ
LSA: Theoretical Foundation
Av thespans i iu thespans
mxn
AA of space row
iTA of space row
mxn
1 V U
Rn Rm
V
1U
2U UTV
2 V 2U
( )VV
000Σ
UUVUΣ ⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛= 11
21 T
TT
0=vA
U
( )
VAV
VΣU
V00
=
=
⎟⎠
⎜⎝⎟⎠
⎜⎝
11
111
221
T
T
T0 =iv A
AVU =Σ
17
A=
LSA: Theoretical Foundation
• Additional Explanations– Each row of is related to the projection of a correspondingU– Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of U
A VVUA TΣ=
• the i-th entry of a row of is related to the projection of a
AVUUVVUAV T =Σ⇒Σ=Σ=⇒
Ucorresponding row of onto the i-th column of
– Each row of is related to the projection of a corresponding f t th b i f d b
A
V
V
UTrow of onto the basis formed by UTA
( )VUA
TTTT
TΣ=
( )UAV
VUUVUVUUAT
TTTT
=Σ⇒
Σ=Σ=Σ=⇒
18
• the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of TA
VU
LSA: Theoretical Foundation
• Fundamental comparisons based on SVDFundamental comparisons based on SVD– The original word-document matrix (A)
d d d t t d t d t f t f Aw1w2
d1 d2 dn
A
• compare two terms → dot product of two rows of A– or an entry in AAT
• compare two docs → dot product of two columns of A
wmmxn
A– or an entry in ATA
• compare a term and a doc → each individual entry of A
– The new word-document matrix (A’)• compare two terms A’A’T (U’Σ’V’T) (U’Σ’V’T)T U’Σ’V’TV’Σ’TU’T (U’Σ’)(U’Σ’)T
U’=Umxk
wjwi
compare two terms→ dot product of two rows of U’Σ’
• compare two docs
A’A’T=(U’Σ’V’T) (U’Σ’V’T)T=U’Σ’V’TV’Σ’TU’T =(U’Σ’)(U’Σ’)T
A’TA’=(U’Σ’V’T)T ’(U’Σ’V’T) =V’Σ’T’UT U’Σ’V’T=(V’Σ’)(V’Σ’)T
For stretching or shrinking
Irxr
Σ’=Σk
V’=Vnxk
19
→ dot product of two rows of V’Σ’• compare a query word and a doc → each individual entry of A’
( ) ( ) ( )( )
dk
ds
LSA: Fold-in
• Find representations for pesudo-docsF bj t ( i d ) th t did t i th– For objects (new queries or docs) that did not appear in the original analysis
• Fold-in a new mx1 query (or doc) vector q y ( )
( ) 111ˆ −
×××× Σ= kkkmmT
k Uqq The separate dimensions are differentially weighted
See Figure A in next page
Represented as the weighted sum of its component word
( )Query represented by the weightedsum of it constituent term vectors
are differentially weightedJust like a row of V
– Represented as the weighted sum of its component word (or term) vectors
– Cosine measure between the query and doc vectors in the latent semantic space
( ) Σ=ΣΣ=
dqdqcoinedqsimT
ˆ
ˆˆ)ˆ,ˆ(ˆ,ˆ
2
20
( )ΣΣ dq̂
row vectors
LSA: Theoretical Foundation
• Fold-in a new 1 X n term vector See Figure B below1
11ˆ−×××× Σ= kkknnk Vtt
See Figure B below
<Figure A>
<Figure B>
21
LSA: A Simple IR Evaluation
• Experimental resultsp– HMM is consistently better than VSM at all recall levels– LSA is better than VSM at higher recall levels
22
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection. (Using word-level indexing terms)
LSA: Pro and Con (1/2)
• Pro (Advantages)– A clean formal framework and a clearly defined optimization
criterion (least-squares)• Conceptual simplicity and clarityConceptual simplicity and clarity
– Handle synonymy problems (“heterogeneous vocabulary”)
• Replace individual terms as the descriptors of documents by independent “artificial concepts” that can specified by any one of several terms (or documents) or combinationsone of several terms (or documents) or combinations
– Good results for high-recall search• Take term co-occurrence into account
23
LSA: Pro and Con (2/2)
• Disadvantagesg– High computational complexity (e.g., SVD decomposition)
– Exhaustive comparison of a query against all stored documents is needed (cannot make use of inverted files ?)
– LSA offers only a partial solution to polysemy (e.g. bank, bass,…)• Every term is represented as just one point in the latent
space (represented as weighted average of different meanings of a term)
24
LSA: Junk E-mail Filtering
• One vector represents the centriod of all e-mails that are f i t t t th hil th th th t i d f llof interest to the user, while the other the centriod of all
e-mails that are not of interest
25
LSA: Dynamic Language Model Adaptation (1/4)
• Let wq denote the word about to be predicted, and H the admissible LSA history (context) for thisHq-1 the admissible LSA history (context) for this particular word– The vector representation of H 1 is expressed by 1
~dThe vector representation of Hq-1 is expressed by• Which can be then projected into the latent semantic
space
1−qd
UdSvv Tqqq 111
~~~ −−− ==~ ~
[ ]Σ=S :notation of changeLSA representation
• Iteratively update and as the decoding evolves
1~
−qd 1 −qv
]0...1...0[1~1~1
Tiq
qq d
nd ε−
+−
=VSM representation ]0...1...0[1q
q nd
nd +−
[ ] )1(~)1(1~~11 iiqq
Tqqq uvn
nUdSvv ε−+−=== −−
VSM representation
LSA representation
26
qn
[ ] or iiqqq
u)ε(v~)n(λn
−+−⋅= − 1111
withexponential decay
LSA: Dynamic Language Model Adaptation (2/4)y g g ( )
• Integration of LSA with N-grams
),|Pr()|Pr( )(1
)(1
)(1 = −−+−
lq
nqq
lnqq HHwHw
gram-therefer totssuperscripand the
,word for history suitablesomedenotes where )()(
1−ln
nand
wH
)~(
LSA the),1with ,...(component gramtherefer to tssuperscrip and the
121 >+−−− nqqq
d
nwwwnand
:asrewritten becan expression This
:)(component 1−qd
)|Pr(
)|,Pr()|Pr( )()(
)(1
)(1)(
1∑
= −−+− nl
nq
lqqln
qq HHw
HHwHw
27
)|,Pr( 11∑∈
−−Vw
qqii
HHw
LSA: Dynamic Language Model Adaptation (3/4)
• Integration of LSA with N-grams (cont.)
)|Pr()|Pr(
)|,Pr()()()(
)(1
)(1
nln
nq
lqq
HwHHw
HHw −− = Assume the probability of the document history given the current word is not affected by the immediate context preceding it
)|~Pr()|Pr(
),|Pr()|Pr(
1211121
)(1
)(1
)(1
nqqqqqnqqqq
qqqqq
wwwwdwwww
HwHHw
+−−−−+−−−
−−−
⋅=
⋅
LL
by the immediate context preceding it
)~Pr()~|Pr()|(
)|~Pr()|Pr(
11
1121
qqq
qqnqqqq
ddw
wdwwww
−−
−+−−− ⋅= L
)Pr()()|(
)|Pr( 11121
q
qqqnqqqq w
wwww +−−− ⋅= L
)|Pr( )(1 =+−ln
qq Hw
)Pr()~|Pr(
)|Pr(
)|(
1121
1
⋅ −+−−−
q
qqnqqqq
wdw
wwww L
28
)Pr()~|Pr(
)|Pr(
)(
1121∑ ⋅
∈
−+−−−
Vw i
qinqqqi
q
i wdw
wwww L
LSA: Dynamic Language Model Adaptation (4/4)
wordofrelevance""thereflects)~|Pr( y,Intuitivel 1 wdw qqq
:~ through observed as history, admissible theto
)|(y,
1
1
dq
qqq
−
−
)~|Pr( dw
1
1
)~|(
)|Pr(
dwK
dw
−
−
≈
211
21121
121
~~
)~,(cosSvSu
vSuSvSu
Tqq
qq−
−− ==
d )t t""l t(i~ff itith thl l i
most aligns meaning whosefor wordshighest be it will such, As
d
1qq
particularany convey not do which for wordslowest and
words),content""relevant (i.e., offavricsemanticth theclosely wi 1dq−
29
).""likeworksfunction""(e.g.,fabricabout thisn informatio the
LSA: Cross-lingual Language Model Adaptation (1/2)
• Assume that a document-aligned (instead of sentence-aligned) Chinese English bilingual corpus is providedaligned) Chinese-English bilingual corpus is provided
30Lexical triggers and latent semantic analysis for cross-lingual language model adaptation, TALIP 2004, 3(2)
LSA: Cross-lingual Language Model Adaptation (2/2)
• CL-LSA adapted Language Modelp g gis a relevant English doc of the Mandarin
doc being transcribed, obtained by CL-IR
Eid
Cid
( )( )( ) ( )
21Adapt ,, −−
+E
Eikkk
cccPdcPP
dcccP
λ ( ) ( )21Trigram-BGUnigram-LCA-CL , −−+⋅= kkkik cccPdcPPλ
( ) ( ) ( )( ) ( ) ( )( )
Unigram-LCA-CL =∑ Ei
eT
Ei dePecPdcP
( ) ( )( )
( )1 ,sim
,sim>>
′≈∑
γγ
γ
T ecececP rr
rr
31
( )∑′c
Probabilistic Latent Semantic Analysis (PLSA)• PLSA models the co-occurrence of word and documents
and evaluates the relevance in a low dimensional semantic/topic space– Each document is treated as a document model DMD
( ) ( ) ( )∑==
K
kDkkiDi M|TPT|wPM|wP
1PLSA
• PLSA can be viewed as a nonnegative factorization of a “word-document” matrix consisting probability entries – A procedure similar to the SVD performed by its algebraic
counterpart- LSA D D D D T T T D D D DD1 D2 Di Dn
w1w2
wi ( )MwP ~
w1w2
w
T1 Tk TK
( )ki TwP
D1 D2 Di Dn
( )ik DTP
KHT
32
Awi
wm
( )iDi MwP ~~
Qwi
wmmxn mxK
KxnH
PLSA: Information Retrieval (1/3)
• The relevance measure between a query and a document can be expressed by
( ) ( ) ( )( )
∏ ⎥⎤
⎢⎡∑=
Q,wcKDkkiD
i
MTPTwPMQPPLSA
– Relevance measure is not obtained based on the frequency of a respective query term occurring in a document but instead based
( ) ( ) ( )∏ ⎥⎦⎢⎣∑=
∈ =Qw kDkkiD
i
MTPTwPMQP1
PLSA
respective query term occurring in a document, but instead based on the frequency of the term and document in the latent topics
– A query and a document thus may have a high relevance score even if they do not share any terms in common
33
PLSA: Information Retrieval (2/3)
• Unsupervised training: The model parameters are p g ptrained beforehand using a set of text documents– Maximize the log-likelihood of entire collection D
( ) ( ) ( )∑ ∑=∑=∈ ∈∈ DD D Dw
DiPLSAiD
DPLSAn
M|wPlogD,wcM|DPlogLlog D
• Supervised training: The model parameters are trained using a training set of query exemplars and the associated query-document relevance information– Maximize the log-likelihood of the training set of query
exemplars generated by their relevant documentsexemplars generated by their relevant documents
( )∑ ∑=∈ ∈TrainSet QR
TrainSet Q DDPLSA MQPlogLlog
Q DQ
to
34
( ) ( )∑ ∑ ∑=∈ ∈ ∈TrainSet QR iQ D Qw
Dii MwPlogQ,wcQ D to
PLSA: Information Retrieval (3/3)
• Example: most probable words form 4 latent topics p p paviation space missions family love Hollywood love
35
PLSA vs. LSA• Decomposition/Approximation
– LSA: least-squares criterion measured on the L2- or Frobeniusqnorms of the word-doc matrices
– PLSA: maximization of the likelihoods functions based on the cross entropy or Kullback Leibler divergence between the empiricalentropy or Kullback-Leibler divergence between the empirical distribution and the model
• Computational complexityp p y– LSA: SVD decomposition– PLSA: EM training, is time-consuming for iterations ?– The model complexity of both LSA and PLSA grows linearly with the
number of training documents• There is no general way to estimate or predict the vectorThere is no general way to estimate or predict the vector
representation (of LSA) or the model parameters (of PLSA) for a newly observed document
36
• LSA and PLSA both assume “bag-of-words” representations of documents (how to distinguish “street market” from market street ?)
PLSA: Dynamic Language Model Adaptation
• The search history can be treated as a pseudo-document y pwhich is varying during the speech recognition process
( ) ( ) ( )∑==
K
kwkkiwi ii
H|TPT|wPH|wP1
PLSA
– The topic unigrams are kept unchanged– The history’s probability distribution over the latent topics is
( )ki TwP |y p y p
gradually updated– The topic mixture weights are estimated on the fly
It ld b ti i( )
iwk HTP |
• It would be time-consuming
37
PLSA: Document Organization (1/3)
• Each document is viewed as a document model to generate itself – Additional transitions between topical mixtures have to do with
the topological relationships between topical classes on a 2 Dthe topological relationships between topical classes on a 2-D map
( ) ( ) ( ) ( )∑ ⎥⎤
⎢⎡∑=
K KTwPTTPTPMwP M( ) ( ) ( ) ( )∑ ⎥⎦⎢⎣
∑== =k l
liklDkDi TwPTTPTPMwP1 1
MPLSA
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡−= 2
2
2,exp
21,
σσπlk
klTTdistTTE
Two-dimensional
⎦⎣
( ) ( )( )∑
= Kk
klkl
TTE
TTETTP ,
38
Two dimensional Tree Structure
for Organized Topics
( )∑=s
ks TTE1
,
PLSA: Document Organization (2/3)
• Document models can be trained in an unsupervised pway by maximizing the total log-likelihood of the document collection
( ) ( )∑∑===
V
ijiji
n
jT DwPDwcL
11log,
• Each topical class can be labeled by words selected using the following criterionusing the following criterion
( ) ( )∑n
jjkji DTPDwc
1,
( )( ) ( )[ ]∑ −
=
=
=n
ijkji
jki
DTPDwcTwSig
1
1
1,,
39
PLSA: Document Organization (3/3)
• Spoken Document Retrieval and Browsing System p g ydeveloped by NTU (Prof. Lin-shan Lee)
40
Latent Dirichlet Allocation (LDA) (1/2)
• The basic generative process of LDA closely resembles g p yPLSA; however,– In PLSA, the topic mixture is conditioned on each
d t ( i fi d k )( )( )DTP k
document ( is fixed, unknown)– While in LDA, the topic mixture is drawn from a Dirichlet
distribution, so-called the conjugate prior, ( is unknown
( )DTP k
( )DTP k
( )DTP kj g p (and follows a probability distribution)
LDAwithcorpusageneratingofProcess
( )k
D
T
Dθβ
Tφ
afrom docu each for on distributilmultinomiaaPick parameter with on distributiDirichlet
afrom each topicfor on distributi lmultinomia aPick LDAwith corpusageneratingofProcess
)2
)1
{ }D
D
θKT
α
parameter with on distributi lmultinomia afrom topicaPick 3)
parameter with on distributiDirichlet ,,2,1
)
L∈
41
T
D
φw parameter with on distributi lmultinomia afrom ord aPick 4)p
Blei et al. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003
Latent Dirichlet Allocation (2/2)
word 3
Z (P(w3|MD))
X+Y+Z 1
X (P(w1|MD)Y (P(w2|MD)
X+Y+Z=1( ) ( ) ( )∑
=
=K
kDkkiDi MTPTwPMwP
1LDA |||
Y (P(w2|MD)
word 1
word 2
42
Word Topic Models (WTM)
• Each word of language are treated as a word topical g g pmixture model for predicting the occurrences of other words
( ) ( ) ( )∑==
K
kwkkiwi jj
MTPTwPMwP1
WTM |||
• WTM also can be viewed as a nonnegative factorization of a “word-word” matrix consisting probability entries g p y– Each column encodes the vicinity information of all occurrences
of a certain type of word V V V VT T TV V V V Vw1
Vw2Vwj
Vwm
w1w2
( )MwP ~
w1w2
w
T1 Tk TK
( )ki TwP ( )ik DTP
KQ’T
Vw1Vw2
VwjVwm
43
Bwj
wm
( )iWj MwP ~~
Qwi
wmmxm mxK
KxmQ
WTM: Information Retrieval (1/3)
• The relevance measure between a query and a document can be expressed bydocument can be expressed by
( ) ( ) ( )( )Qwc
Qw Dw
K
kwkkiDj
i
i jj
TPTwPDQP,
1,WTM M∏ ⎥
⎦
⎤⎢⎣
⎡∑ ∑=
∈ ∈ =α
• Unsupervised training– The WTM of each word can be trained by concatenating those
j ⎦⎣
The WTM of each word can be trained by concatenating those words occurring within a context window of size around each occurrence of the word, which are postulated to be relevant to the wordthe word
( ) ( ) ( ).Mlog,Mloglog WTMWTM ∑ ∑=∑=∈ ∈∈ ww
wj jwi
jjj
jj w Qwwiwi
www wPQwcQPL
j jwij Q
1,jwQ 2,jwQ Nw jQ ,Nwwww jjjj
QQQQ ,2,1, ,,, L=
44
jw jw jw
WTM: Information Retrieval (2/3)
• Supervised training: The model parameters are trained i t i i t f l d thusing a training set of query exemplars and the
associated query-document relevance informationMaximize the log-likelihood of the training set of query– Maximize the log-likelihood of the training set of query exemplars generated by their relevant documents
( )∑ ∑ DQPL ll ( )∑ ∑=∈ ∈TrainSet QR
TrainSet Q DDQPL
Q DQ
toWTMloglog
45
WTM: Information Retrieval (3/3)
• Formulas for Supervised Trainingp g[ ][ ]
[ ][ ]∑ ∑ ∑ ′′
∑ ∑
=
∈′ ∈′ ′∈
∈ ∈
′TrainSetQQ DocD Qwinkn
TrainSetQQ DocDik
k
QRi n
QRi
DwTPQwn
DwTPQwn
TwP
to
to
),|(),(
),|(),(
)|(ˆ
( )⎥⎥⎦
⎤
⎢⎢⎣
⎡∑ ⎟
⎠⎞⎜
⎝⎛
∈Dwwkijk
ijj
MTPTwP
DwTP,
)|(
where
α
∑ ∑ ∑ kD MwTPMwMPQwn )|()|()(
( )∑⎥⎥⎦
⎤
⎢⎢⎣
⎡∑ ⎟
⎠⎞⎜
⎝⎛
⎦⎣=
= ∈
K
l Dwwlijl
ik
ijj
MTPTwP
DwTP
1,
),|(
α
[ ][ ]
[ ][ ]∑ ∑ ∑ ′′′
∑ ∑ ∑
=
∈′ ∈′ ′∈′′
∈ ∈ ∈
′TrainSetQQ DocD QwDw
TrainSetQQ DocD QwwkDw
wk
QRiij
QRijij
j MwMPQwn
MwTPMwMPQwn
MTP
to
to
),|(),(
),|(),|(),(
)|(ˆ
where jij wkDw MwTPMwMP
⎞⎛⎞⎛
),|(),|(
( )( )∑ ⋅
⋅=
∈Dwwil
wijDw
ill
j
ij MwP
MwPMwMP
,
, ),|(
where
α
α ( )( )
( )
( )( )
jj
j
j
ill
j
wkkwij
K
zwzz
wkk
Dwwil
wij
MTPTwPMwP
MTPTwP
MTPTwP
MwP
MwP
⎟⎠⎞⎜
⎝⎛⎟
⎠⎞⎜
⎝⎛
∑ ⎟⎠⎞⎜
⎝⎛
⎟⎠⎞⎜
⎝⎛
⋅∑
⎟⎠⎞⎜
⎝⎛
=
=∈
,
1,
,
α
α
α
46
( ) ( )( ) ( )∑
=
=
K
zwzz
wkkwk
j
j
j
MTPTwP
MTPTwPMwTP
1
),|( and( )( )
( )i
j
ji
D
wkkij
wD
MwP
MTPTwP
MwPMwP
⎟⎠⎞⎜
⎝⎛
=
⎟⎠⎞⎜
⎝⎛
⎠⎝⋅⎠⎝=
,α
WTM: Dynamic Language Model Adaptation (1/2)
• For a decoded word , we can again interpret it as a iw , g p(single-word) query; while for each of its search histories, expressed by , we can linearly combine th i t d WTM d l f th d i i
i
121 ,,, −= iiw wwwH K
the associated WTM models of the words occurring in to form a composite WTM model
iwH
( ) ( ) ( )( ) ( ) ( ) ( )∑ ∑=∑=−
= =
−
=
1
1 1
1
1WTMWTM M M M
i
jwk
K
kkij
i
jwijHi jjiw
TPTwPwPwP ββ
1−iφ2φ11 =φ
are nonnegative weighting coefficients which empirically
( )∏ −=−−
=+
1
11
ji
ssjjj φφβ
β
iw1−iw2w1w11 −− iφ
2φ11 =φ
21 φ−
1
– are nonnegative weighting coefficients which empirically set to be exponentially decayed as the word is being apart from
– is set to a fixed value (between 0 and 1) for , and iw
jj φβ =
jφ 1,,2 −= ij L
47
set to 1 for j
1=j
WTM: Dynamic Language Model Adaptation (2/2)
• For our speech recognition test data, it was experimentally p g , p yobserved that the language model access time of WTM was approximately 1/30 of that of PLSA for language model d t ti th it ti b f th li EMadaptation, as the iteration number of the online EM
estimation of for PLSA was set to 5( )iwk HTP |
( ) ( ) ( ) ( )1M 12BGWTM12Adapt wwwPwPwwwP iiiHiiii iw −−−− ⋅−+⋅= λλ
model gram- background :BG n
48
Comparison of WTMM and PLSA/LDA
• A schematic comparison for the matrix factorizations of PLSA/LDA and WTMPLSA/LDA and WTM
documents documentstopics
Ts
wor
ds A ≈
wor
ds
mixture weights
G THPLSA/LDA topi
c
normalized “word-document”co-occurrence matrix
mixture components
( ) ( ) ( )∑=K
kDkkiDi TPTwPwP
1PLSA/LDA M||M|
ds
vicinities of words topics vicinities of words
ds Q TQ′cs
=k 1
wor
d B ≈w
ord
mixture weights
Q TQ′WTM
topi
c49
normalized “word-word”co-occurrence matrix
mixture components
( ) ( ) ( )∑==
K
kwkkiwi jj
TPTwPwP1
WTM M||M|
Summary: Three Ways of Developing LM Approaches for IRApproaches for IR
(a) Query likelihood(b) Document likelihood literal term matching
or concept matching
50
(c) Model comparisonp g
LSA: SVDLIBC
• Doug Rohde's SVD C Library version 1.3 is basedg yon the SVDPACKC library
• Download it at http://tedlab.mit.edu/~dr/
51
LSA: Exercise (1/4)
• Given a sparse term-document matrix Row#Tem
Col.# Doc
Nonzero entries
– E.g., 4 terms and 3 docsDoc
4 3 6 20 2.3
2 nonzero entries at Col 0
Col 0, Row 0 C l 0 R 2
2.3 0.0 4.2 0.0 1.3 2.2 3 8 0 0 0 5Term
2 3.811 1.3
Col 0, Row 2 1 nonzero entry
at Col 1Col 1, Row 1
3 t
E h t b i ht d b TF IDF
3.8 0.0 0.5 0.0 0.0 0.0
Term30 4.21 2.2
3 nonzero entryat Col 2
Col 2, Row 0 Col 2, Row 1 C l 2 R 2– Each entry can be weighted by TFxIDF score
• Perform SVD to obtain term and document vectors t d i th l t t ti
2 0.5 Col 2, Row 2
represented in the latent semantic space• Evaluate the information retrieval capability of the LSA
approach by using varying sizes (e g 100 200 600
52
approach by using varying sizes (e.g., 100, 200,...,600 etc.) of LSA dimensionality
LSA: Exercise (2/4)
• Example: term-document matrixp
51253 2265 21885277
Indexing Term no. Doc no. Nonzero
entries
77508 7.725771596 16.213399612 13.080868709 7.725771713 7.725771744 7.7257711190 2 11190 7.7257711200 16.2133991259 7.725771
LSA100 Ut• SVD command (IR_svd.bat)
svd -r st -o LSA100 -d 100 Term-Doc-Matrix
…… LSA100-Ut
LSA100-S
output
53
svd -r st -o LSA100 -d 100 Term-Doc-Matrix
sparse matrix input prefix of output filesNo. of reserved
eigenvectors name of sparse
matrix input
LSA100-Vt
LSA: Exercise (3/4)
• LSA100-Ut100 512530.003 0.001 ……..
51253 words
0.002 0.002 …….
word vector (uT): 1x100 • LSA100-Vt• LSA100-S
100
100 22650.021 0.035 ……..
2265 docs
1002686.18829.941559 59
0.012 0.022 …….
54
559.59….
100 eigenvalues
doc vector (vT): 1x100
LSA: Exercise (4/4)
• Fold-in a new mx1 query vector q y
( ) 111ˆ −
×××× Σ= kkkmmT
k Uqq The separate dimensions are differentially weighted( )
Query represented by the weightedsum of it constituent term vectors
are differentially weightedJust like a row of V
• Cosine measure between the query and doc vectors in the latent semantic spacep
( ) Σ=ΣΣ=
dqdqcoinedqsimTˆˆ
)ˆˆ(ˆˆ2( )ΣΣ
=ΣΣ=dq
dqcoinedqsimˆˆ
),(,
55