Download ppt - Latent Association Analysis of Document Pairs

Latent Association Analysis of Document Pairs

Gengxin MiaoUniversity of California, Santa Barbara

Presented at theIBM T.J. Watson Research Center

Hawthorne, NYDecember 2, 2011

Gengxin Miao UC Santa Barbara 2

Networked Texts

t1

D

G

A

B

E

C

F

H

t2

t3

DB2logon

Diseases

Symptoms

Treatments

Texts flow on expert networks Semantically associated texts

Interconnected text streams

Users

Queries

Web pages

Belong to the same search task

Semantically Associated Documents

+


Applications

Software system maintenance Root cause finding Problem prediction

Machine translation

Question answering

Healthcare assistance


Huge Datasets

Beyond human learner’s capability


Modeling Options

Word-level mapping

Topic-level mapping

Document-level mapping

Source Document Set Target Document Set


Word-Level Mapping (UAI’09)

Learns a dictionary between the two document sets Applies to machine translation Word mappings are typically noisy


Topic-Level Mapping (EMNLP’09)

Assumes the associated documents share the same topic proportion

Works well for translations between languages

Topic simplex of the source document set

Topic simplex of the target document set


Document-Level Mapping (our work)

One-to-many or many-to-one mappings are broken down into one-to-one document pairs

Two documents are associated by their association factor


Latent Association Analysis – Framework

Generative process Draw an association factor for each document pair Draw topic proportions for both the source and

the target document Draw the words in each document

Generative Models Ranking Algorithms Experiment


Latent Association Analysis – An Instantiation

Canonical Correlation Analysis (CCA) Captures the semantic association in document pairs

Correlated Topic Model (CTM) Captures the document and word co-occurrence



The Generative ProcessGenerative Models Ranking Algorithms Experiment

A pair of documents arise from the following process Draw an L-dimensional association factor

For the source/target document, draw the topic proportions

For each word in the documents, draw a topic and a word


ProblemsGenerative Models Ranking Algorithms Experiment

Inference Given a model M and a document pair How to determine the association factor, topic proportions

and topic assignments that best describe the document pair?

Model fitting Given a set of document pairs How to calculate the parameters in M that best describes

the entire document pair set?


InferenceGenerative Models Ranking Algorithms Experiment

Objective function

Given a model and a document pair Calculate the topic assignments and the topic proportions

Posterior distribution is intractable to compute The topic assignments and the topic proportions

are correlated when conditioned on observations


Variational Inference

Decouple the parameters using a variational distribution Q

Fit the variational parameters to approximate the true posterior distribution



Variational ParametersGenerative Models Ranking Algorithms Experiment


Model FittingGenerative Models Ranking Algorithms Experiment


LAA Ranking MethodsGenerative Models Ranking Algorithms Experiment

Direct Ranking Ranking function for a candidate document pair

Word frequency can distort the probability

Latent Ranking


Two-Step RankingGenerative Models Ranking Algorithms Experiment

Separate Topic Models Source document has topic proportion Target document has topic proportion

Topic-Level Mapping Canonical Correlation Analysis captures the association

between the topic proportions

Rank Target Documents


Experiments

Datasets IT-Change: Changes made to an IT environment and

the consequent problems 24,317 document pairs 20,000 used for training, the rest used for testing

IT-Solution: IT problems and their solutions 19,696 document pairs 15,000 used for training, the rest used for testing

Evaluation Randomly select 100 document pairs in testing dataset For each source document, rank the 100 target documents Use the rank of the correct target document as accuracy

measurement



Accuracy AnalysisGenerative Models Ranking Algorithms Experiment


ExampleGenerative Models Ranking Algorithms Experiment


Summary

The LAA framework is capable of modeling two document sets associated by a bipartite graph

One-to-many mappings or many-to-one mappings of documents are taken into consideration

We instantiated LAA with CCA and CTM, but the framework can be used with other instantiations that fit specific applications

The LAA-latent ranking algorithm ranks the correct target document better than other state-of-the-art algorithms


Acknowledgment

Prof. Louise E. Moser

Prof. Xifeng Yan

Dr. Shu Tao

Dr. Ziyu Guan

Dr. Nikos Anerousis

Q & A?

Thanks!


Unigram ModelGenerative Models Ranking Algorithms Experiment

N

nnwpp

1

)()(w


Mixture of UnigramsGenerative Models Ranking Algorithms Experiment

z

N

nn zwpzpp

1

)|()()(w


Probabilistic Latent Semantic IndexingGenerative Models Ranking Algorithms Experiment

z

nn dzpzwpdpwdp )|()|()(),(


LDA and CTMGenerative Models Ranking Algorithms Experiment

topic 2 topic 3

topic 1

topic 1

topic 2 topic 3