Upload
hien
View
52
Download
1
Tags:
Embed Size (px)
DESCRIPTION
EVENT IDENTIFICATION IN SOCIAL MEDIA. Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University. Social Media Sites Host Many “Event” Documents. “Event”= something that occurs at a certain time in a certain place [Yang et al. ’99] - PowerPoint PPT Presentation
Citation preview
EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University
Social Media Sites Host Many “Event” Documents
Photo-sharing: Flickr Video-sharing: YouTube Social networking: Facebook
2
“Event”= something that occurs at a certain time in a certain place [Yang et al. ’99]
Popular, widely known eventsPresidential Inauguration, Thanksgiving Day Parade
Smaller events, without traditional news coverageLocal food drive, street fair
…
Social media documents for “All Points West” festival, Liberty State Park, New
Jersey, 8/8/08
Social media documents for “All Points West” festival, Liberty State Park, New
Jersey, 8/8/08
Identifying Events and Associated Social Media Documents
Applications Event search and browsing Local search …
3
General approach: group similar documents via clusteringEach cluster corresponds to one event and its associated social media documents
Event Identification: Challenges
Uneven data quality Missing, short, uninformative text … but revealing structured context
available: tags, date/time, geo-coordinates Scalability Dynamic data stream of event
information Unknown number of events
Necessary for many clustering algorithms Difficult to estimate
4
Clustering Social Media Documents Social media document
representation Social media document similarity Social media document clustering
Clustering task: definition Ensemble algorithm: combining
multiple clustering results Preliminary evaluation
5
Social Media Document Representation
TitleTitle
Description
Description
TagsTags
Date/TimeDate/Time
LocationLocation
All-TextAll-Text
6
Social Media Document Similarity
Text: tf-idf weights, cosine similarity
7
TitleTitle
Description
Description
TagsTags
Date/TimeDate/Time
LocationLocation
All-TextAll-Text
TitleTitle
Description
Description
TagsTags
Date/Time-
Keywords
Date/Time-
Keywords
Location-ProximityLocation-Proximity
All-TextAll-Text
Location-KeywordsLocation-Keywords
Date/Time-
Proximity
Date/Time-
Proximity
time
Location: geo-coordinate proximity
AA AAAA BB BBBB
Time: proximity in minutes
Social Media Document Clustering Framework
Document featurerepresentation
Social mediadocuments
Event clusters
8
Consensus Function:combine ensemble similarities
Consensus Function:combine ensemble similarities
Clustering: Ensemble Algorithm
Wtitle
Wtags
Wtime
9
f(C,W)f(C,W)
Ctitle
Ctags
Ctime
Ensemble clustering solution
Ensemble clustering solution
Learned in a training step
Learned in a training step
Clustering: Measuring Quality Homogeneous clusters
10
✔
✔
Complete clusters
Metric: Normalized Mutual Information (NMI)Shared information between clustering solution and “ground truth”
Experimental Setup
Data: >270K Flickr photos Event labels from Yahoo!’s “upcoming” event
database Split into 3 parts for training/validation/testing
Clusterers: single pass algorithm with centroid similarity
Weighing scheme: Normalized Mutual Information (NMI) scores on validation set
Consensus function: weighted average of clusterers’ binary predictions
Final prediction step: single pass clustering algorithm
11
Preliminary Evaluation Results Individual clusterer performance
Highest NMI: Tags, All-Text Lowest NMI: Description, Title
Ensemble performance, compared against all individual clusterers Highest overall performance in terms of
NMI More homogenous clusters: each event
is spread over fewer clusters
12
Details in paper
Details in paper
Document similarity metric Ensemble approach
Weight assignment Choice of clusterers
Train a classifier to predict document similarity Features correspond to similarity scores
All-text, title, tags, time, location, etc. Numeric values in [0,1]
State-of-the-art classifiers: SVM, Logistic Regression, …
13
Future Work: Alternative Choices
Future Work: Alternative Choices
Final clustering step Apply graph partitioning algorithms
Requires estimating the number of clusters Evaluation metrics: beyond NMI Datasets
Flickr LastFM, YouTube Exploit social network connections
14
Conclusions
Identified events and their corresponding social media documents Proposed a clustering solution Leveraged different representations of social media
documents Employed various social media similarity metrics
Developed a weighted ensemble clustering approach
Reported preliminary results of our event identification approach on a large-scale dataset of Flickr photographs
15