Upload
cindy
View
69
Download
0
Tags:
Embed Size (px)
DESCRIPTION
14.0 Linguistic Processing and Latent Topic Analysis. Latent Semantic Analysis (LSA ). T k. Topic. Words. Documents. d 1 d 2 ........ d j .......... d N. w 1 w 2 w i w M. w ij. Latent Semantic Analysis (LSA) - Word-Document Matrix Representation. - PowerPoint PPT Presentation
Citation preview
14.0 Linguistic Processing and Latent Topic Analysis
Documents Words
Topic
Tk
Latent Semantic Analysis (LSA)
Latent Semantic Analysis (LSA) - Word-Document Matrix Representation• Vocabulary V of size M and Corpus T of size N
–V={w1,w2,...wi,..wM} , wi: the i-th word ,e.g. M=2×104
T={d1,d2,...dj,..dN} , dj: the j-th document ,e.g. N=105
– cij: number of times wi occurs in dj
nj: total number of words present in dj
ti = Σj cij : total number of times wi occurs in T
–
• Word-Document Matrix W
W = [wij]
– each row of W is a N-dim “feature vector” for a word wi with respect to all documents dj
each column of W is a M-dim “feature vector” for a document dj with respect to all words wi
entropy wordandlength document with normalizedbut doucments,in sfrequencie word, )-(1 w
j allfor /N if 1 jother for 0 and j somefor if 0 , 10
Tin wof power) (indexingentropy normalized ),log()( log
1
ij
i1
j
iji
iiji
ijiijiii
ijN
j i
iji
nc
tcctc
tc
tc
N
d1 d2 ........ dj .......... dNw1
w2
wi
wM
wij
Latent Semantic Analysis (LSA)
j
Dimensionality Reduction (1/2)•
–
– dimensionality reduction: selection of R largest eigenvalues (R=800 for example)],....,[ U,USUWW R21M
TMR
2RM
TMNNM RR
eeeR
T1 U SUWW T 2
ji
T
wand wbetween "similarity" Wof rowsth -j andth -i ofproduct inner: WWof element j,i
21
2T2221M21 , WWof seigenvalue : ,][S, ],...ee,[eU iiiMMi ssss
i
iTiii eees M
T2T IUU, rseigenvecto lorthonorma: , WW
R “concepts” or “latent semantic concepts”
)ee matrices"component " theof nce(significa weights:s Tii
2i
Dimensionality Reduction (2/2)
•
–
– dimensionality reduction: selection of R largest eigenvalues
V SVW W 22T T
IVV rs,eigenvecto lorthonorma : e ,eesWW NTii
Tii
2i
T
si2
: weights (significance of the “component matrices” e’i e’iT)
]...,[V ,VS VWW 21RNT
NR2
RRRNNMT
MN Reee
R “concepts” or “latent semantic concepts”
(i, j) element of WT W : inner product of i-th and j-th columns of W “similarity” between di and dj
N)min(M, ifor 0s ,s s W W,of seigenvalue :s ,][sS ],,....,[ 2i
21i
2i
2iNN
2i
22N21 eeeV T
N
N
Singular Value Decomposition (SVD)• Singular Value Decomposition (SVD)
– si: singular values, s1≥ s2.... ≥ sRU: left singular matrix, V: right singular matrix
• Vectors for word wi: uiS=ui (a row)– a vector with dimentionality N reduced to a vector uiS=ui with dimentionality R– N-dimentional space defined by N documents reduced to R-dimentional space
defined by R “concepts”– the R row vectors of VT, or column vectors of V, or eigenvectors {e’1,..e’R}, are
the R orthonormal basis for the “latent semantic space” with dimentionality R, with which uiS = u i is represented
– words with similar “semantic concepts” have “closer” location in the “latent semantic space”
• they tend to appear in similar “types” of documents, although not necessarily in exactly the same documents
TRRRMNM NRVSUW
ˆW NM
d1 d2 ........ dj . ......... dNw1
w2
wi
wM
wij=
w1
w2
wi
wM
ui
UR×R
s1
sR
d1 d2 ........ dj .......... dN
VT
R×N
M×R
vjT
Singular Value Decomposition (SVD)
Singular Value Decomposition (SVD)• Singular Value Decomposition (SVD)
• Vectors for document dj: vjS=vj (a row, or vj = S vjT for a column)
– a vector with dimentionality M reduced to a vector vjS=vj with dimentionality R– M-dimentional space defined by M words reduced to R-dimentional space
defined by R “concepts”– the R columns of U, or eigenvectors{e1,...eR}, are the R orthonormal basis for the
“latent semantic space” with dimensionality R, with which vjS=vj is represented– documents with similar “semantic concepts” have “closer” location in the “latent
semantic space”• they tend to include similar “types” of words, although not necessarily exactly the same
words• The Association Structure between words wi and documents dj is
preserved with noisy information deleted, while the dimensionality is reduced to a common set of R “concepts”
d1 d2 ........ dj . ......... dNw1
w2
wi
wM
wij=
w1
w2
wi
wM
ui
UR×R
s1
sR
d1 d2 ........ dj .......... dN
vj VT
R×N
M×R
T
T
Example Applications in Linguistic Processing• Word Clustering
– example applications: class-based language modeling, information retrieval ,etc.– words with similar “semantic concepts” have “closer” location in the “latent
semantic space”• they tend to appear in similar “types” of documents, although not necessarily in exactly
the same documents – each component in the reduced word vector ujS=uj is the “association” of the
word with the corresponding “concept” – example similarity measure between two words:
• Document Clustering– example applications: clustered language modeling, language model adaptation,
information retrieval, etc.– documents with similar “semantic concepts” have “closer” location in the “latent
semantic space”• they tend to include similar “types” of words, although not necessarily exactly the same
words– each component on the reduced document vector vjS=vj is the “association” of the
document with the corresponding “concept”– example “similarity” measure between two documents:
SuSu
uSu
uu
uuwwsim
ji
Tji
ji
jiji
),(
SvSv
vSv
vv
vvddsim
ji
ji
ji
jiji
),(
2
2
LSA for Linguistic Processing
Example Applications in Linguistic Processing
• Information Retrieval–“concept matching” vs “lexical matching” : relevant documents are
associated with similar “concepts”, but may not include exactly the same words
–example approach: treating the query as a new document (by “folding-in”), and evaluating its “similarity” with all possible documents
• Fold-in –consider a new document outside of the training corpus T, but with similar
language patterns or “concepts”–construct a new column dp ,p>N, with respect to the M words–assuming U and S remain unchanged
dp=USvpT (just as a column in W= USVT)
v p = vpS = dpTU as an R-dim representation of the new document
(i.e. obtaining the projection of dp on the basis ei of U by inner product)
Integration with N-gram Language Models• Language Modeling for Speech Recognition
– Prob(wq|dq-1) wq: the q-th word in the current document to be recognized (q: sequence index)dq-1: the recognized history in the current documentv q-1=dq-1
TU : representation of dq-1 by vq-1 (folded-in)– Prob(wq|dq-1) can be estimated by uq and v q-1 in the R-dim space– integration with N-gram
Prob(wq|Hq-1) =Prob(wq|hq-1, dq-1)Hq-1: history up to wq-1
hq-1:<wq-n+1, wq-n+2,... wq-1 >– N-gram gives local relationships, while dq-1 gives semantic concepts– dq-1 emphasizes more the key content words, while N-gram counts all words
similarly including function words• v q-1 for dq-1 can be estimated iteratively
– assuming the q-th word in the current document is wi
word-by- wordupdated, )1()1(
]0....0100...00)[1()1(
1
1
ii
qTqq
iqq
uq
vq
qUdv
qd
qqd
(n)
(n)
v q moves in the R-dim space initially, eventually settle down somewhere
i-th dimentionality out of M
T
Probabilistic Latent Semantic Analysis (PLSA)
• Exactly the same as LSA, using a set of latent topics{ }to construct a new relationship between the documents and terms, but with a probabilistic framework
• Trained with EM by maximizing the total likelihood
: frequency count of term in the document
KTTT ,, 21
K
kikkjij DTPTtPDtP
1
)|()|()|(
)|(log),(1 1
N
i
n
jijijT DtPDtcL
),( ij Dtc jtiD
t1t2
t j
tn
D1D2
Di
DN
TK
Tk
T2
T1
P(T |D )k i P(t |T )j k
Di: documents Tk: latent topics tj: terms
Probabilistic Latent Semantic Analysis (PLSA)
𝑃 (𝑤|𝑧 )𝑃 (𝑧|𝑑 )
w: word z: topicd: documentN: words in document dM: documents in corpus
Latent Dirichlet Allocation(LDA)
: Dirichlet Distribution
: Dirichlet Distribution
: prior for ) ( �⃗�𝑚 : topic d istributionfor document m )
: prior for
(𝜑 𝑘: word distributionfor topic k )
( k: topic index, a total of K topics )
• A document is represented as random mixtures of latent topics
• Each topic is characterized by a distribution over words
(𝑧𝑚 ,𝑛 : topic d istributionof𝑤𝑚 ,𝑛 )(𝑤𝑚 ,𝑛 : n − t h word indocument𝑚 )
(n : word index , a total of 𝑁𝑚 words in document𝑚 )
(m :document index , a total of M documents in the corpus )
𝑃 (𝑧𝑚 ,𝑛|�⃗�𝑚 )
Gibbs Sampling in general
• To obtain a distribution of a given form with unknown parameters 1. Initialize2. For
– Sample ~Take a sample of base on the distribution – Sample ~ – Sample ~ – Sample ~
• Apply MarKov Chain Monte Carlo and sample each variable sequentially conditioned on the other variables until the distribution converges, then estimate the parameters based on the coverged distribution
Gibbs Sampling applied on LDA
…
w11
?
w12
?
w13
?
w1n
?
…
Doc 1
w21
?
w22
?
w23
?
w2n
?…
Doc 2
Sample P(Z,W) :Topic
Word
Gibbs Sampling applied on LDA
…
w11 w12 w13 w1n
…
Doc 1
w21 w22 w23 w2n
…
Doc 2
Sample P(Z,W) :1. Random Initialization
Gibbs Sampling applied on LDA
…
w11
?
w12 w13 w1n
…
Doc 1
w21 w22 w23 w2n
…
Doc 2
Sample P(Z,W) :1. Random Initialization2. Erase Z11, and draw a
new Z11 ~𝑃 (𝑧 11|𝑧12⋯ 𝑧𝑀 ,𝑁𝑀
,𝑤1 1 ,𝑤12 ,⋯ ,𝑊𝑀 ,𝑁𝑀)
Gibbs Sampling applied on LDA
…
w11 w12
?
w13 w1n
…
Doc 1
w21 w22 w23 w2n
…
Doc 2
Sample P(Z,W) :1. Random Initialization2. Erase Z11, and draw a
new Z11 ~
3. Erase Z12, and draw a new Z12 ~
𝑃 (𝑧 12|𝑧 11 ,𝑧 1 3⋯ 𝑧𝑀 ,𝑁𝑀,𝑤11 ,𝑤12 ,⋯ ,𝑊 𝑀 ,𝑁𝑀
)
𝑃 (𝑧 11|𝑧12⋯ 𝑧𝑀 ,𝑁𝑀,𝑤1 1 ,𝑤12 ,⋯ ,𝑊𝑀 ,𝑁𝑀
)
Gibbs Sampling applied on LDA
w11 w12 w13 w1n
w21 w22 w23 w2n
…
…
…
Doc 1
Doc 2
Sample P(Z,W) :1. Random Initialization2. Erase Z11, and draw a
new Z11 ~
3. Erase Z12, and draw a new Z12 ~
4. Iteratively update topic assignment for each word until converge
5. Compute θ, φ according to the final setting
𝑃 (𝑧 11|𝑧12⋯ 𝑧𝑀 ,𝑁𝑀,𝑤1 1,𝑤12 ,⋯ ,𝑊𝑀 ,𝑁𝑀
)
𝑃 (𝑧 12|𝑧 11 ,𝑧 1 3⋯ 𝑧𝑀 ,𝑁𝑀,𝑤11 ,𝑤 12 ,⋯ ,𝑊 𝑀 ,𝑁𝑀
)
Matrix Factorization (MF) for Recommendation systems
Movie1
Movie2
Movie3
Movie4
Movie5
Movie6
Movie7
Movie8
Movie9
User A
3.7 4.0
User B
4.0 4.3
User C
4.1
User D
2.3 2.5
User E
3.3
User F
2.9
User G
2.6 2.7
: rating u: user i: item
Matrix Factorization (MF)
• Mapping both users and items to a joint latent factor space of dimensionality f
𝑞𝑖
𝑟𝑢𝑖 𝑝𝑢𝑇
i I11
u
U
=
1
u
U
i I1
f
f
latent factor: towards male, seriousness, etc.
Matrix Factorization (MF)
• Objective function
• Training– gradient decent (GD)
– Alternating least square (ALS): alternatively fix ’s or ’s and compute the other as a least square problem
• Different from SVD (LSA)– SVD assumes missing entries to be zero (a poor assumption)
Overfitting Problem
• A good model is not just to fit all the training data– needs to cover unseen data well which may have
distributions slightly different from that of training data– too complicated models with too many parameters usually
leads to overfitting
Extensions of Matrix Factorization (MF)• Biased MF
– add global bias μ (usually = average rating) , user bias bu , and item bias bi as parameters
• Non-negative Matrix Factorization– restrict the value in each component of pu and qi to be non-negative
References
• LSA and PLSA– “Exploiting Latent Semantic Information in Statistical Language Modeling”,
Proceedings of the IEEE, Aug 2000 – “La0tent Semantic Mapping”, IEEE Signal Processing Magazine, Sept. 2005,
Special Issue on Speech Technology in Human-Machine Communication – “Probabilistic Latent Semantic Indexing”, ACM Special Interest Group on
Information Retrieval (ACM SIGIR), 1999 – “Probabilistic Latent Semantic Indexing”, Proc. of Uncertainty in Artificial
Intelligence, 1999– Spoken Document Understanding and Organization”, IEEE Signal Processing
Magazine, Sept. 2005, Special Issue on Speech Technology in Human-Machine Communication
• LDA and Gibbs Sampling – Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer
2006– Blei, David M.; Andrew Y. Ng, Michael I. Jordan. "Latent Dirichlet Allocation”,
Journal of Machine Learning Research 2003– Gregor Heinrich, ”Parameter estimation for text analysis”, 2005
References
• Matrix Factorization – A Linear Ensemble of Individual and Blended Models for Music
Rating Prediction. In JMLR W&CP, volume 18, 2011.
– Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30-37, 2009.
– Introduction to Matrix Factorization Methods Collaborative Filtering (http://www.intelligentmining.com/knowledge/slides/Collaborative.Filtering.Factorization.pdf)
– GraphLab API: Collaborative Filtering (http://docs.graphlab.org/collaborative_filtering.html)
– J Mairal, F Bach, J Ponce, G Sapiro, Online learning for matrix factorization and sparse coding, The Journal of Machine Learning, 2010