Latent Association Analysis of Document Pairs
Gengxin MiaoUniversity of California, Santa Barbara
Presented at theIBM T.J. Watson Research Center
Hawthorne, NYDecember 2, 2011
Gengxin Miao UC Santa Barbara 2
Networked Texts
t1
D
G
A
B
E
C
F
H
t2
t3
DB2logon
Diseases
Symptoms
Treatments
Texts flow on expert networks Semantically associated texts
Interconnected text streams
Users
Queries
Web pages
Belong to the same search task
Semantically Associated Documents
+
Gengxin Miao UC Santa Barbara 4
Applications
Software system maintenance Root cause finding Problem prediction
Machine translation
Question answering
Healthcare assistance
Gengxin Miao UC Santa Barbara 5
Huge Datasets
Beyond human learner’s capability
Gengxin Miao UC Santa Barbara 6
Modeling Options
Word-level mapping
Topic-level mapping
Document-level mapping
Source Document Set Target Document Set
Gengxin Miao UC Santa Barbara 7
Word-Level Mapping (UAI’09)
Learns a dictionary between the two document sets Applies to machine translation Word mappings are typically noisy
Gengxin Miao UC Santa Barbara 8
Topic-Level Mapping (EMNLP’09)
Assumes the associated documents share the same topic proportion
Works well for translations between languages
Topic simplex of the source document set
Topic simplex of the target document set
Gengxin Miao UC Santa Barbara 9
Document-Level Mapping (our work)
One-to-many or many-to-one mappings are broken down into one-to-one document pairs
Two documents are associated by their association factor
Gengxin Miao UC Santa Barbara 10
Latent Association Analysis – Framework
Generative process Draw an association factor for each document pair Draw topic proportions for both the source and
the target document Draw the words in each document
Generative Models Ranking Algorithms Experiment
Gengxin Miao UC Santa Barbara 11
Latent Association Analysis – An Instantiation
Canonical Correlation Analysis (CCA) Captures the semantic association in document pairs
Correlated Topic Model (CTM) Captures the document and word co-occurrence
Generative Models Ranking Algorithms Experiment
Gengxin Miao UC Santa Barbara 12
The Generative ProcessGenerative Models Ranking Algorithms Experiment
A pair of documents arise from the following process Draw an L-dimensional association factor
For the source/target document, draw the topic proportions
For each word in the documents, draw a topic and a word
Gengxin Miao UC Santa Barbara 13
ProblemsGenerative Models Ranking Algorithms Experiment
Inference Given a model M and a document pair How to determine the association factor, topic proportions
and topic assignments that best describe the document pair?
Model fitting Given a set of document pairs How to calculate the parameters in M that best describes
the entire document pair set?
Gengxin Miao UC Santa Barbara 14
InferenceGenerative Models Ranking Algorithms Experiment
Objective function
Given a model and a document pair Calculate the topic assignments and the topic proportions
Posterior distribution is intractable to compute The topic assignments and the topic proportions
are correlated when conditioned on observations
Gengxin Miao UC Santa Barbara 15
Variational Inference
Decouple the parameters using a variational distribution Q
Fit the variational parameters to approximate the true posterior distribution
Generative Models Ranking Algorithms Experiment
Gengxin Miao UC Santa Barbara 16
Variational ParametersGenerative Models Ranking Algorithms Experiment
Gengxin Miao UC Santa Barbara 17
Model FittingGenerative Models Ranking Algorithms Experiment
Gengxin Miao UC Santa Barbara 18
LAA Ranking MethodsGenerative Models Ranking Algorithms Experiment
Direct Ranking Ranking function for a candidate document pair
Word frequency can distort the probability
Latent Ranking
Gengxin Miao UC Santa Barbara 19
Two-Step RankingGenerative Models Ranking Algorithms Experiment
Separate Topic Models Source document has topic proportion Target document has topic proportion
Topic-Level Mapping Canonical Correlation Analysis captures the association
between the topic proportions
Rank Target Documents
Gengxin Miao UC Santa Barbara 20
Experiments
Datasets IT-Change: Changes made to an IT environment and
the consequent problems 24,317 document pairs 20,000 used for training, the rest used for testing
IT-Solution: IT problems and their solutions 19,696 document pairs 15,000 used for training, the rest used for testing
Evaluation Randomly select 100 document pairs in testing dataset For each source document, rank the 100 target documents Use the rank of the correct target document as accuracy
measurement
Generative Models Ranking Algorithms Experiment
Gengxin Miao UC Santa Barbara 21
Accuracy AnalysisGenerative Models Ranking Algorithms Experiment
Gengxin Miao UC Santa Barbara 22
ExampleGenerative Models Ranking Algorithms Experiment
Gengxin Miao UC Santa Barbara 23
Summary
The LAA framework is capable of modeling two document sets associated by a bipartite graph
One-to-many mappings or many-to-one mappings of documents are taken into consideration
We instantiated LAA with CCA and CTM, but the framework can be used with other instantiations that fit specific applications
The LAA-latent ranking algorithm ranks the correct target document better than other state-of-the-art algorithms
Gengxin Miao UC Santa Barbara 24
Acknowledgment
Prof. Louise E. Moser
Prof. Xifeng Yan
Dr. Shu Tao
Dr. Ziyu Guan
Dr. Nikos Anerousis
Q & A?
Thanks!
Gengxin Miao UC Santa Barbara 26
Unigram ModelGenerative Models Ranking Algorithms Experiment
N
nnwpp
1
)()(w
Gengxin Miao UC Santa Barbara 27
Mixture of UnigramsGenerative Models Ranking Algorithms Experiment
z
N
nn zwpzpp
1
)|()()(w
Gengxin Miao UC Santa Barbara 28
Probabilistic Latent Semantic IndexingGenerative Models Ranking Algorithms Experiment
z
nn dzpzwpdpwdp )|()|()(),(
Gengxin Miao UC Santa Barbara 29
LDA and CTMGenerative Models Ranking Algorithms Experiment
topic 2 topic 3
topic 1
topic 1
topic 2 topic 3