Upload
grant-morris
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Predictive Modeling with Heterogeneous Sources
Xiaoxiao Shi1 Qi Liu2 Wei Fan3 Qiang Yang4 Philip S. Yu1
1 University of Illinois at Chicago2 Tongji University, China
3 IBM T. J. Watson Research Center
4 Hong Kong University of Science and Technology
1/18
Why learning with heterogeneous sources?
New York Times
Training (labeled)
Test (unlabeled)
Classifier
New York Times
85.5%
Standard Supervised Learning
2/18New York Times
Training (labeled)
Test (unlabeled)
New York TimesLabeled data are
insufficient!
47.3%
How to improve the
performance?
In Reality…
Why heterogeneous sources?
3/18
Why heterogeneous sources?
Reuters
Labeled data from other sources
Target domaintest (unlabeled)
New York Times
82.6%
1. Different distributions
2. Different outputs
3. Different feature spaces
47.3%
Real world examples
• Social Network:– Can various bookmarking systems help predict social tags for a
new system given that their outputs (social tags) and data (documents) are different?
Wikipedia ODP Backflip Blink
……
?4/18
Real world examples
• Applied Sociology:– Can the suburban housing price census data help predict the
downtown housing prices?
?
#rooms #bathrooms #windows price
5 2 12 XXX
6 3 11 XXX
#rooms #bathrooms #windows price
2 1 4 XXXXX
4 2 5 XXXXX 5/18
Other examples
• Bioinformatics– Previous years’ flu data new swine flu– Drug efficacy data against breast cancer
drug data against lung cancer– ……
• Intrusion detection– Existing types of intrusions unknown
types of intrusions • Sentiment analysis
– Review from SDM Review from KDD
6/18
Learning with Heterogeneous Sources
• The paper mainly attacks two sub-problems:– Heterogeneous data distributions
• Clustering based KL divergence and a corresponding sampling technique
– Heterogeneous outputs (to regression problem)
• Unifying outputs via preserving similarity.
7/18
Learning with Heterogeneous Sources
• General Framework
Unifying data distributions
Unifying outputs
Source data
Target data
Source data Target data
8/18
Unifying Data Distributions
• Basic idea: – Combine the source and target data and
perform clustering.– Select the clusters in which the target and
source data are similarly distributed, evaluated by KL divergence.
9/18
Unifying Outputs
• Basic idea:– Generate initial outputs according to the
regression model– For the instances similar in the original output
space, make their new outputs closer.
11/18
17/18
• Problem: Learning with Heterogeneous Sources:• Heterogeneous data distributions• Heterogeneous outputs
• Solution:• Clustering based KL divergence help perform
sampling• Similarity preserving output generation help
unify outputs
Conclusions