36
Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering Arizona State University August 12-16, 2012 KDD2012

Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Embed Size (px)

Citation preview

Page 1: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Data Mining and Machine Learning Lab

Unsupervised Feature Selection for Linked Social Media Data

Jiliang Tang and Huan LiuComputer Science and Engineering

Arizona State University

August 12-16, 2012 KDD2012

Page 2: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Social Media

• The expansive use of social media generates massive data in an unprecedented rate

- 250 million tweets per day

- 3,000 photos in Flickr per minute

-153 million blogs posted per year

Page 3: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

High-dimensional Social Media Data

• Social Media Data can be high-dimensional– Photos– Video stream– Tweets

• Presenting new challenges– Massive and noisy data– Curse of dimensionality

Page 4: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Feature Selection

• Feature selection is an effective way of preparing high-dimensional data for efficient data mining.

• What is new for feature selection of social media

data?

Page 5: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Representation of Linked Data

𝑢1𝑢2𝑢3

𝑢5𝑢6

𝑢4

𝑢7𝑢8

𝑓 1𝑓 2 𝑓 𝑚…. …. ….

1 1

1 1 1

1 1 1

1 1 1 11 1 1

1 1 11 1

1 1 1

𝑢1𝑢2𝑢3

𝑢5𝑢6

𝑢4

𝑢7𝑢8

𝑢1𝑢2𝑢3𝑢4𝑢5𝑢6𝑢7𝑢8

Page 6: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Challenges for Feature Selection

• Unlabeled data - No explicit definition of feature relevancy

- Without additional constraints, many subsets of features could be equally good

• Linked data - Not independent and identically distributed

Page 7: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Opportunities for Feature Selection

• Social media data provides link information - Correlation between data instances

• Social media data provides extra constraints

- Enabling us to exploring the use of social theories

Page 8: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Problem Statement

• Given n linked data instances, its attribute-value representation X, its link representation R, we want to select a subset of features by exploiting both X and R for these n data instances in an unsupervised scenario.

Page 9: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Supervised and Unsupervised Feature Selection

• A unified view– Selecting features that are consistent with some

constraints for either supervised or unsupervised feature selection

– Class labels are sort of targets as a constraint

• Two problems for unsupervised feature selection

- What are the targets?

- Where can we find constraints?

Page 10: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Our Framework: LUFS

Page 11: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

The Target for LUFS

Page 12: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

The Constraints for LUFS

Page 13: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Pseudo-class Label

• s is a selection vector

- s(j) = 1 if j-th feature is selected, s(j)=0 otherwise

- , X = diag(s)X

• Y is the pseudo-class label indicator matrix

- Y =

- ||Y(:,i) =

Page 14: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Social Dimension for Link Information

• Social Dimension captures group behaviors of linked Instances– Instances in different social dimensions are disimilar– Instances within a social dimension are similar

• Example:

Page 15: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Social Dimension Regularization

• Within, between, and total social dimension scatter matrices,

• Instances are similar within social dimensions while dissimilar between social dimensions.

Page 16: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Constraint from Attribute-Value Data

• Similar instances in terms of their contents are more likely to share similar topics,

Page 17: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

An Optimization Problem for LUFS

Page 18: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

The Optimization Problem for LUFS

Page 19: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

The Optimization Problem for LUFS

Page 20: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

The Optimization Problem for LUFS

Page 21: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

LUFS after Two Relaxations

• Spectral Relaxation on Y - Social Dimension Regularization:

• W = diag(s)W, and adding 2,1-norm on W

)YTr(YFFYYTrmin TTT )(

Page 22: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Evaluating LUFS

• Datasets and experiment setting

• What is the performance of LUFS comparing to state-of-the art baseline methods?

• Why does LUFS work?

Page 23: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Evaluating LUFS

• Datasets and experiment setting

• What is the performance of LUFS comparing to state-of-the art baseline methods?

• Why does LUFS work?

Page 24: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Data and Characteristics

• BlogCatalog

• Flickr

Page 26: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Experiment Settings

• Metrics - Clustering: Accuracy and NMI

- K-Means

• Baseline methods - UDFS

- SPEC

- Laplacian Score

Page 27: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Evaluating LUFS

• Datasets and experiment setting

• What is the performance of LUFS comparing to state-of-the art baseline methods?

• Why does LUFS work?

Page 28: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Results on Flickr

Page 29: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Results on Flickr

Page 30: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Results on BlogCatalog

Page 31: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Evaluating LUFS

• Datasets and experiment setting

• What is the performance of LUFS comparing to state-of-the art baseline methods?

• Why does LUFS work?

Page 32: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Probing Further: Why Social Dimensions Work

Social Dimensions Random Groups

…….

…….

Link Information

Social Dimension Extraction

Random Assignment

Page 33: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Results in Flickr

Page 34: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Future Work

• Further exploration of link information

• Noise and incomplete social media data

• Other sources: multi-view sources

• The strength of social ties ( strong and weak ties mixed)

Page 35: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

http://www.public.asu.edu/~huanliu/projects/NSF12/

More Information?

Page 36: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering

Questions

Acknowledgments: This work is, in part, sponsored by National Science Foundation via a grant (#0812551). Comments and suggestions from DMML members and reviewers are greatly appreciated.