Heterogeneous Cross Domain Ranking in Latent Space
Bo Wang
Joint work with Jie Tang, Wei Fan and Songcan Chen
Framework of Learning to Rank
Example: Academic Network
Ranking over Web 2.0
Traditional Web: standard (long) documents relevance measures such as BM25 and Pa
geRank score may play a key role Web 2.0: shorter non-standard docum
ents users' click-through data and users' com
ments might be much more important
Heterogeneous transfer ranking
If there isn't sufficient supervision on the domain of interest, how could one borrow labeled information from a related but heterogeneous domain to build an accurate model?
Differences from transfer learning What to transfer Instance type What we care Feature extraction
Main Challenges How to formalize the problem in a
unified framework? As both feature distributions and
objects' types in the source domain and the target domain may be different.
How to transfer the knowledge of heterogeneous objects across domains?
How to preserve the preference relationships between instances across heterogeneous data sources?
Outline Motivation Problem Formulation Transfer Ranking
Basic Idea The proposed algorithm Generalization bound
Experiment Ranking on Homogeneous data Ranking on Heterogeneous data
Conclusion
Problem Formulation Source domain:
Instance space: Rank level set:
where Target domain: and The two domains are heterogeneous but
related Problem Definition: given and ,
the goal is to learn a ranking function for predicting the rank levels of test set
Outline Motivation Problem Formulation Transfer Ranking
Basic Idea The proposed algorithm Generalization bound
Experiment Ranking on Homogeneous data Ranking on Heterogeneous data
Conclusion
Basic Idea Because the feature distributions or even
objects' types may be different across domains, we resort to finding a common latent space in which the preference relationships in source and target domains are all preserved
We can directly use a ranking loss function to evaluate how well the preferences are preserved in that latent space
Optimize the two ranking loss functions simultaneously in order to find the best latent space
The Proposed AlgorithmGiven the labeled data in source domain We aim to learn a ranking function which
satisfies:
The ranking loss function can be defined as:
The latent space can be described by
The Framework:
Ranking SVM
Generalization Bound
Scalability Let d is the total number of different
features in two domains, then matrix D is d*d and W is d*2, so it can be applied to very large scale data without too many features
ComplexityRanking SVM training has O((n1 + n2)3) time and O((n1 + n2)2) space complexity, in our algorithm Tr2SVM, T is the maximal iteration number, then Tr2SVM has O((2T +1)(n1 + n2)3) time and O((n1 + n2)2) space complexity for training
Outline Motivation Problem Formulation Transfer Ranking
Basic Idea The proposed algorithm Generalization bound
Experiment Ranking on Homogeneous data Ranking on Heterogeneous data
Conclusion
Data Set LETOR 2.0
three sub datasets: TREC2003, TREC2004, and OHSUMED query-document pairs collection TREC data: a topic distillation task which aims to find goo
d entry points principally devoted to a given topic OHSUMED data: a collection of records from medical jour
nals LETOR_TR
three sub datasets: TREC2003_TR, TREC2004_TR, and OHSUMED_TR
Data Set (Cont’d)
Data Set (Cont’d)
Experiment Setting Baselines:
Measures:MAP (mean average precision) and NDCG (normalized discount cumulative gain)
Three transfer ranking tasks: From S1 to T1 From S2 to T2 From S3 to T3
Why effective? Why transfer ranking is
effective on LETOR_TR dataset? Because the features used in ranking
already contain relevance information between queries and documents.
Outline Motivation Problem Formulation Transfer Ranking
Basic Idea The proposed algorithm Generalization bound
Experiment Ranking on Homogeneous data Ranking on Heterogeneous data
Conclusion
Data Set A subset of ArnetMiner: 14,134 authors, 10,716 papers,
and 1,434 conferences. 8 most frequent queries from log file:
'information extraction', 'machine learning', 'semantic web', 'natural language processing', 'support vector machine', 'planning', 'intelligent agents' and 'ontology alignment'
Author collection: For each query, we gathered authors from Libra, Rexa and Ar
netMiner Conference collection:
For each query, we gathered conferences from Libra and ArntetMiner
Evaluation One faculty and two graduates to judge the relevance betw
een query and authors/conferences
Feature Definition
All the features are defined between queries and virtual documents
Conference Use all the paper titles published on a
conference to form a conference "document"
Author Use all the paper titles authored by an expert
as the expert's "document"
Feature Definition (Cont’d)
Experimental Results
Why effective?
Why our approach can be effective on the heterogeneous network?Because of latent dependencies between the objects, some common features can still be extracted from the latent dependencies.
Conclusion
Conclusion (Cont’d) We formally define transfer ranking problem
and propose a general framework We provide a preferred solution under the
regularized framework by simultaneously minimize two ranking loss functions in two domains and derive the generalization bound
The experimental results on LETOR and a heterogeneous academic network verified the effectiveness of the proposed algorithm
Future Work
Develop new algorithms under the framework
Reduce the time complexity for online usage
Negative transfer Similarity between queries Actively select similar queries
Thanks!
Your Question. Our Passion.