Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji

Predictive Modeling with Heterogeneous Sources

Xiaoxiao Shi1 Qi Liu2 Wei Fan3 Qiang Yang4 Philip S. Yu1

1 University of Illinois at Chicago2 Tongji University, China

3 IBM T. J. Watson Research Center

4 Hong Kong University of Science and Technology

1/18

Why learning with heterogeneous sources?

New York Times

Training (labeled)

Test (unlabeled)

Classifier

New York Times

85.5%

Standard Supervised Learning

2/18New York Times

Training (labeled)

Test (unlabeled)

New York TimesLabeled data are

insufficient!

47.3%

How to improve the

performance?

In Reality…

Why heterogeneous sources?

3/18

Why heterogeneous sources?

Reuters

Labeled data from other sources

Target domaintest (unlabeled)

New York Times

82.6%

1. Different distributions

2. Different outputs

3. Different feature spaces

47.3%

Real world examples

• Social Network:– Can various bookmarking systems help predict social tags for a

new system given that their outputs (social tags) and data (documents) are different?

Wikipedia ODP Backflip Blink

……

?4/18

Real world examples

• Applied Sociology:– Can the suburban housing price census data help predict the

downtown housing prices?

?

#rooms #bathrooms #windows price

5 2 12 XXX

6 3 11 XXX

#rooms #bathrooms #windows price

2 1 4 XXXXX

4 2 5 XXXXX 5/18

Other examples

• Bioinformatics– Previous years’ flu data new swine flu– Drug efficacy data against breast cancer

drug data against lung cancer– ……

• Intrusion detection– Existing types of intrusions unknown

types of intrusions • Sentiment analysis

– Review from SDM Review from KDD

6/18

Learning with Heterogeneous Sources

• The paper mainly attacks two sub-problems:– Heterogeneous data distributions

• Clustering based KL divergence and a corresponding sampling technique

– Heterogeneous outputs (to regression problem)

• Unifying outputs via preserving similarity.

7/18

Learning with Heterogeneous Sources

• General Framework

Unifying data distributions

Unifying outputs

Source data

Target data

Source data Target data

8/18

Unifying Data Distributions

• Basic idea: – Combine the source and target data and

perform clustering.– Select the clusters in which the target and

source data are similarly distributed, evaluated by KL divergence.

9/18

An Example

D T

Combined Data

Adaptive Clustering

10/18

Unifying Outputs

• Basic idea:– Generate initial outputs according to the

regression model– For the instances similar in the original output

space, make their new outputs closer.

11/18

12/18

16 3726.521.25 31.75

Initia

l Ou

tpu

ts

Initia

l Ou

tpu

ts

Mo

dificatio

n Mo

dificatio

n

Experiment

• Bioinformatics data set:

13/18

Experiment

14/18

Experiment

• Applied sociology data set:

15/18

Experiment

16/18

17/18

• Problem: Learning with Heterogeneous Sources:• Heterogeneous data distributions• Heterogeneous outputs

• Solution:• Clustering based KL divergence help perform

sampling• Similarity preserving output generation help

unify outputs

Conclusions

18/18

Thanks!

Documents

Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji