29
Unsupervised Streaming Feature Selection in Social Media Arizona State University Data Mining and Machine Learning Lab CIKM 2015 1 Unsupervised Streaming Feature Selection in Social Media Jundong Li 1 , Xia Hu 2 , Jiliang Tang 3 and Huan Liu 1 1 Arizona State University 2 Texas A&M University 3 Yahoo! Labs

Unsupervised Streaming Feature Selection in Social Media

Embed Size (px)

DESCRIPTION

Outline Background and Motivation Problem Statement Proposed USFS Framework Experimental Results Conclusions and Future Work

Citation preview

Page 1: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 1

Unsupervised Streaming Feature Selection in Social Media

Jundong Li1, Xia Hu2, Jiliang Tang3 and Huan Liu1

1Arizona State University2Texas A&M University

3Yahoo! Labs

Page 2: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 22

• Background and Motivation

• Problem Statement

• Proposed USFS Framework

• Experimental Results

• Conclusions and Future Work

Outline

Page 3: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 33

Social Media

• Rapid growth of social media provides a platform for people to perform online social activities

• Massive amounts of high dimensional data are user generated and quickly disseminated

• It is desirable to reduce the dimensionality of social media data due to curse of dimensionality

Page 4: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 44

• Feature selection is effective to preparing high-dimensional data by selecting a subset of relevant features for a compact and accurate representation

Feature Selection

feature selection

Page 5: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 55

• Traditional feature selection assumes that all features are static and known in advance

• Features in social media are usually generated dynamically in a streaming fashion– Twitter produces more than 500 millions of tweets

everyday and a large amount of slang words (features) are continuously being user generated

– In disaster relief, topics (features) like ``Chile Earthquake” emerge to be hot shortly

Feature Selection in Social Media

Page 6: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 66

• It is more appealing to perform streaming feature selection to capture relevant features timely

Streaming Feature Selection

Page 7: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 77

• Challenges– Label information is costly – Data not i.i.d

• Opportunities– Link information is abundant and maybe helpful

• Target– Propose an unsupervised streaming feature selection

algorithm for social media data

Challenges, Opportunities and Target

No existing unsupervised streaming feature selection

algorithms !

Page 8: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 88

• Background and Motivation

• Problem Statement

• Proposed USFS Framework

• Experimental Results

• Conclusions and Future Work

Outline

Page 9: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 99

• Given n linked instances, let adjacency M denotes their link information. Assume that features arrive dynamically one each time, at time step t, each instance is associated with a set of streaming features X(t) = {f1, f2, …, ft}

• we want to select a subset of relevant features at each time step effectively and efficiently by using link information M and content information X(t)

Problem Statement

Page 10: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 1010

Illustration

……

t+i ……

……

…………

t t+1

t+it t+1t+it t+1

t+it t+1

t+it t+1

Selected Feature Set

Accept the new feature?

Reject existing feature?

Page 11: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 1111

• Background and Motivation

• Problem Statement

• Proposed USFS Framework

• Experimental Results

• Conclusions and Future Work

Outline

Page 12: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 1212

• Social media users connect due to a variety of reasons such as movie fans, sports enthusiasts, colleagues, etc

• Users with similar hidden factors are similar• Hidden factors are helpful to steer unsupervised

streaming feature selection • We use mixed membership stochastic blockmodel

(MMSB) [Blei+NIPS2009] to extract hidden social factors from link information

Modeling Link Information

Page 13: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 1313

• At time step t:

• Hidden social factors as regression targets• L1-norm can be used for feature selection

Modeling Link Information (con’t)

Page 14: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 1414

• If two users are similar in the original feature space, the two users are also similar in the selected feature space.

Modeling Content Information

Page 15: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 1515

Optimization Formulation at Time t

• By combining network information and content information

• Decompose into a set of sub-problems

Page 16: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 1616

• At time step t+1 when the new feature arrives:

• Objective function is reduced if the reduction in 1st,3rd,4th term outweighs the increase in the 2nd term

• Therefore, the condition to accept the new feature is

Testing New Feature

Page 17: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 1717

• Test existing features when new feature is added• When new feature is accepted, we optimize the

following w.r.t. current variables, which forces some feature coefficient to be zero

• Convex optimization problem, we use Broyden-Fletcher-Goldfarb-Shanno (BFGS)

Testing Existing Features

Page 18: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 1818

Feature Selection by USFS

• If the new feature is accepted, we obtain sparse coefficient matrix by solving all sub-problems

• For each feature j, if any of its k corresponding feature weight is nonzero, the feature is included in the final model, the feature score is defined as

• Features are ranked in a descending order by their feature scores

Page 19: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 1919

• Background and Motivation

• Problem Statement

• Proposed USFS Framework

• Experimental Results

• Conclusions and Future Work

Outline

Page 20: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 2020

• Q1: How is the quality of selected features by the USFS framework?

• Q2: How efficient is the proposed USFS framework?

Questions to Investigate

Page 21: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 2121

• BlogCatalog (social blog directory)• Flickr (image sharing website)

• Assume features arrive in a random order, take {20%,30%,…,90%,100%} of all features

Datasets

Page 22: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 2222

• Evaluation– Clustering: K-means– Metrics: Accuracy and NMI

• Baseline batch-mode methods• Laplacian Score [He et al. NIPS 2005]• SPEC [Zhao and Liu. ICML 2007]• NDFS [Li et al. AAAI 2012]• LUFS [Tang and Liu, KDD 2012]

Experimental Settings

Page 23: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 2323

Performance on Flickr

Page 24: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 2424

Performance on BlogCatalog

Page 25: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 2525

Cumulative Running Time

• In BlogCatalog, USFS is 7x, 20x, 29x, 76x faster • In Flickr, USFS is 5x, 11x, 20x, 75x faster

Page 26: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 2626

• Background and Motivation

• Problem Statement

• Proposed USFS Framework

• Experimental Results

• Conclusions and Future Work

Outline

Page 27: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 2727

• Goals: – Perform unsupervised streaming feature selection

for social media data• Solutions:

– Leverage link information as constraints– Stagewise algorithm for streaming features

• Results: – Achieve better feature selection performance in

terms of clustering– Reduce running time compared with batch-mode

methods

Conclusion

Page 28: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 2828

• In this work, we consider the link information is relative stable compared with dynamic content information, we will investigate streaming feature selection in dynamic networks

• Streaming features come from different sources, we will investigate how to fuse heterogeneous feature sources for streaming feature selection

Future Work

Page 29: Unsupervised Streaming Feature Selection in Social Media

Unsupervised Streaming Feature Selection in Social MediaArizona State University Data Mining and Machine Learning Lab CIKM 2015 29

Acknowledgement: This material is, in part, supported by National Science Foundation (NSF) under grant number IIS-1217466. Comments and suggestions from DMML members and reviewers are greatly appreciated.

Questions