Data Mining and Machine Learning Lab
Feature Selection with Linked Data
in Social Media
Jiliang Tang and Huan Liu
Computer Science and Engineering
Arizona State University
April 26-28, 2012 SDM2012
Social Media
• Explosion of social media generates massive
data in an unprecedented rate
- 200 million Tweets per day
- 3,000 photos in Flickr per minute
-153 million blogs posted per year
Social Media Data
• Massive and high-dimensional social media data
poses challenges to data mining tasks
- Scalability
- Curse of dimensionality
• Feature selection is an effective way to prepare
large-scale, high-dimensional data for effective
data mining
Feature Selection
• Traditional feature selection algorithms
work with “flat" data (attribute-value data)
- Independent and Identically Distributed (i.i.d.)
• Social media data differs from attribute-
value data
- Inherently linked
An Example of Social Media Data
𝑢1
𝑢2
𝑢3
𝑢4
𝑝1 𝑝2
𝑝3 𝑝5
𝑝6
𝑝4
𝑝7
𝑝8
Users
An Example of Social Media Data
𝑢1
𝑢2
𝑢3
𝑢4
𝑝1 𝑝2
𝑝3 𝑝5
𝑝6
𝑝4
𝑝7
𝑝8
Posts
An Example of Social Media Data
𝑢1
𝑢2
𝑢3
𝑢4
𝑝1 𝑝2
𝑝3 𝑝5
𝑝6
𝑝4
𝑝7
𝑝8
User-post
relations
An Example of Social Media Data
𝑢1
𝑢2
𝑢3
𝑢4
𝑝1 𝑝2
𝑝3 𝑝5
𝑝6
𝑝4
𝑝7
𝑝8
User-user
following
Representation for Attribute Value Data
𝑝1
𝑝2
𝑝3
𝑝5
𝑝6
𝑝4
𝑝7 𝑝8
𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 ….
Posts
Representation for Attribute Value Data
𝑝1
𝑝2
𝑝3
𝑝5
𝑝6
𝑝4
𝑝7 𝑝8
𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 …. Features
Representation for Attribute Value Data
𝑝1
𝑝2
𝑝3
𝑝5
𝑝6
𝑝4
𝑝7 𝑝8
𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 ….
Labels
Representation for Social Media Data
User-post relations
1
1 1 1
1
1 1
𝑢1
𝑢2
𝑢3
𝑢4
𝑢1 𝑢2 𝑢3 𝑢4
𝑝1
𝑝2
𝑝3
𝑝5
𝑝6
𝑝4
𝑝7 𝑝8
𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 ….
Representation for Social Media Data
1
1 1 1
1
1 1
𝑢1
𝑢2
𝑢3
𝑢4
𝑢1 𝑢2 𝑢3 𝑢4
𝑝1
𝑝2
𝑝3
𝑝5
𝑝6
𝑝4
𝑝7 𝑝8
𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 ….
User-user relations
Representation for Social Media Data
1
1 1 1
1
1 1
𝑢1
𝑢2
𝑢3
𝑢4
𝑢1 𝑢2 𝑢3 𝑢4
𝑝1
𝑝2
𝑝3
𝑝5
𝑝6
𝑝4
𝑝7 𝑝8
𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 ….
Social
Context
Problem Statement
• Given labeled data X and its label indicator matrix Y, the
whole dataset F, its social context including user-user
following relationships S and user-post relationships P, we
aim to select K most relevant features from m features on
the dataset F with its social context S and P.
Two Fundamental Problems
• Relation extraction
- What are distinctive relations that can be
extracted from linked data
• Mathematical representation
- How to use these relations in feature selection
formulation
𝑢1
𝑢2
𝑢3
𝑢4
𝑝1 𝑝2
𝑝3 𝑝5
𝑝6
𝑝4
𝑝7
𝑝8
Relation Extraction
coPost
• A user can have
multiple posts
coFollowing
𝑢1 𝑢3
𝑝1 𝑝2
𝑝6
𝑝7
𝑢4 𝑝8 • Two users
follow a
third user
coFollowed
𝑢1
𝑢2 𝑝1 𝑝2
𝑝3 𝑝5 𝑝4
𝑢4 𝑝8 • Two users
are followed
by a third
user
Following
𝑢1
𝑢2 𝑝1 𝑝2
𝑝5 𝑝4
• A user follows
another user
𝑝3
Post-Post relations
• What do these relations suggest for posts?
Social Correlation Theories
• Homophily
- People with similar interests are more likely to be
linked
• Social influence
- People that are linked are more likely to have
similar interests
CoPost Hypothesis
• CoPost Hypothesis
- Posts by the same user are more likely to be of
similar topics
𝑢2
𝑝5 𝑝4
𝑝3
CoFollowing Hypothesis
• CoFollowing
Hypothesis
- If two users follow
the same user, their
posts are likely of
similar topics.
𝑢1 𝑢3
𝑝1 𝑝2
𝑝6
𝑝7
𝑢4 𝑝8
CoFollowed Hypothesis
• CoFollowed
Hypothesis
- If two users are followed
by the same user, their
posts are likely of similar
topics
𝑢1
𝑢2 𝑝1 𝑝2
𝑝5 𝑝4
𝑢4 𝑝8
𝑝3
Following Hypothesis
• Following
Hypothesis
- If one user follows
another, their posts are more
likely similar in terms of
topics
𝑢1
𝑢2 𝑝1 𝑝2
𝑝3 𝑝5 𝑝4
Modeling CoFollowing Relation
• Two co-following users have similar interested topics
||||
)(^
k
Ff
i
T
k
Ff
i
kF
fW
F
fT
uT kiki
)(
• Users' topic interests
u Nuu
jiF
T
uji
uTuT,
2
2
^^
1,2
2
W||)()(||||W||||YWX||min
A Reformulation of CoFollowing Relation
• It is equivalent to
ji
j
pofauthortheisuifF
jiH
where
||
1),(
XYEHFFHLXXB
||W||EW)2BWTr(Wmin
TTTT
FI
T
1,2
T
W
A Unique Problem for LinkedFS
• LinkedFS framework is designed to solve
the following optimization problem
1,2
T
W||W||EW)2BWTr(Wmin
LinkedFS
Datasets
• BlogCatalog
- Undirected following
http://dmml.asu.edu/users/xufei/datasets.html
• Digg
- Directed Following
http://www.public.asu.edu/~ylin56/kdd09sup.html
Data Characteristics
Experiment Setting
• Metric
- Classification accuracy
- Classifier : LibSVM
• Baseline methods
- ttest (TT)
- InformationGain (IG)
- FisherScore (FS)
- Joint 2,1-Norms(RFS)
Training and Testing
• Testing (50%) and Training (50%)
• Subsample 5%, 25%, 50% from training
data to construct another three training sets
• Numbers of Selected Features
- ( 50,100,200,300)
Results on Digg
Results on Digg
Performance Improvement
Conclusions
• Investigate a new problem of feature selection for
social media data
• Provide a way to capture link information guided
by social correlation theories
• Propose an effective framework, LinkedFS, for
social media feature selection
Future Work
• Sophisticated ways to exploit social context
• Lack of label information (unsupervised)
• Noise and incomplete social media data
• The strength of social ties ( strong and weak ties
mixed)
Acknowledgments
This work is, in part, sponsored by National Science
Foundation via a grant (#0812551). Comments and
suggestions from DMML members and reviewers are
greatly appreciated.
Questions