Upload
magicpeach
View
2.080
Download
0
Tags:
Embed Size (px)
Citation preview
Microblog(Twitter) mining
yutao
What is twitter?
• 140 character tweet• Hashtag # before relevant keywords in tweet• RT means to “re-tweet” or forward a tweet • @ reference refers to a user’s screen name
Why it is different?
• Very short in length• Written in informal style• Social
What is twitter, a social network or a news media?(www2010)
• Following is mostly not reciprocated(not so “social”)
• Users talk about timely topics• A few users reach large audience directly• Most users can reach large audience by word-
of-mouth quickly
early Analysis
Analysis 1: Take the people out
• Krishnamurthy et al (2008) • users were classified by
follower/following counts, Numbers and ratios
• means and mechanisms of their engagement
Web (61.7%), mobile/text (7.5%), software (22.4%)
Analysis 2: Content Category
Four meta-categories • daily chatter• conversations• information / URL sharing• news reporting
Analysis 3: measuring user influence
• Indegree, retweets and mentions• Strong correlation between retweet and
mention• Most connected != most influential
User influence
How to detect spam?
• classification• Content attributes
hashtags, trending topicsreplies, mentions, http links
• User behavior attributesage of user account
• Graph based attribute
Sentiment analysis
• Supervised classification• Training data come from twitter, instead of
human labeled• Happy emotions: “:-)”, “:)”, “=)”, “:D” etc• Sad emotions: “:-(”, “:(”, “=(”, “;(” etc• Objective: newspapers and magzines
such as “NY times”
Trend detection
• Bursty keywords detection• Bursty keywords grouping• Context extraction(such as PCA, SVD)
twitter search(wsdm2011)
The largest difference
• Twitter search order by time• Search engine order by relevance
• Social• Time
recommendation
Recommending content from information streams
• The filtering problem:– “I get 1000+ items in my stream daily but only
have time to read 10 of them. Which ones should I read?”
• The Discovery Problem:– “There are millions of URLs posted daily on twitter.
Am I missing something important there outside my own Twitter stream?”
Recommending content from information streams
• Recency of content: only interesting within a short time after published.– always a “cold start” situation
• Explicit interaction among users– Explicitly interact by subscribing or sharing
• User-generated content– People are content producers as well as
consumers
Recommending content from information streams
URL Sources
• Considering all URLs was impossible• FoF : URLs from followee-of-followees• Popular : URLs that are popular across whole
Topic relevance scores
• Topic profile of URLs– Use term vectors as profiles– Built from tweets that have mentioned the URL
• Topic profile of users– Self-topic: content profile based on what I post– Followee-Topic: content profile based on what my
followees post
Social network scores
• “Popular Vote” in among my followees-of-followees– People “vote” a URL by tweeting it– Votes are weighted using social network structure– URLs with more votes in total are assigned higher
score
Recommending twitter users to follow
• Social graph• Profile user– User himself– Followers– followees
Microblog summarization
The phrase reinforcement algorithm
• Looking for the most commonly occurring phrases– Users tend to use similar words when describing a
particular topic– RT
Hybrid TF-IDF summarization
• TF: the document is the entire collection of posts
• IDF: the document is a single post
Topic model
32
Content modeling on Twitter
Surface word features
tf.idf cosine similarity,
etc.
Deeper natural
language processing
Parsing, parts of speech,
coreference, etc
dats yur mom not me lol
THE_REAL_SHAQ
33
Best model in ranking
experiments
Labeled LDA
Content modeling on Twitter
Surface word features
Topic models, Dimensionality
reduction
Supervised classification
#hashtags, emoticons,
questions, etc.
tf.idf cosine similarity,
etc.
Latent Dirichlet Allocation (LDA),
LSA, etc.
Naïve Bayes,SVM, etc.
34
Content modeling with Labeled LDADiscover unlabeled topicsParameter K=200 latent
topic dimensions
Model common labels500 - 1000 dimensions for hashtags, emoticons, etc.
obama president american america says country russia pope island
I’m going go out gonna see im tonight sleep tomorrow about am night
:) good day morning thanks have happy hope birthday
:) can‘t wait see one yay!!! cant tomorrow got !! next christmas
Smile : )
#jobs featured manager sales engineer yahoo location senior
#jobs
35
Content modeling with Labeled LDA
new muppetblog political commentary link
@kermit heyy wanna catch a movie
just ate a cookie #yummy
4 1 1 1
2 2 2 3 3
5 5 #yummy #yummy
Histogram as signature for set of posts
4 1 1 1
2 2 2 3 3
5 5 #yummy #yummy
36
Twitter content by category
Substance27%
Status12%
Style38%
Social23%
can make help if someone tell_me them anyone use makes any sense trying explain
obama president american america says country russia pope island failed honduras
haha lol :) funny :p omg hahaha yeah too yes thats ha wow cool lmao though kinda
am still doing sleep so going tired bed awake supposed hell asleep early sleeping sleepy
night sleep bed going off tomorrow bye tonight goodnight all im time now nite
iphone new phone app mobile apple ipod blackberry touch pro store apps free android an
up what's hit pick whats hey set twitter sign give catch when show first wats make
im get dont gonna shit gotta wanna cuz damn ur make cant say cause bout ill mad tired
37
Characterizing Microblogs with Topic Models
Outline• Modeling Twitter content with topic models• Characterizing, recommending and filtering
Characterizing users
Characterizing users
TwitterRank: Finding Topic-sensitive Influential Twitterers
• Apply LDA to distill topics automatically• Find topics in the twitterer’s content to
represent her interests– Twitterer’s content = aggregated tweets
• Twitterers with “following” relationships are more similar than those without according to the topics they are interested in
Topic-specific TwitterRank
Interesting application
• Personalized and automatic social summarization of events in video
• Twitter Can Predict the Stock Market• Predicting elections with twitter• Earthquake(time, location)
thanksmany pictures and slides come from the internet