Upload
loren-potter
View
226
Download
0
Embed Size (px)
Citation preview
1. Social-Network Analysis Using Topic Models2. Web Event Topic Analysis by Topic Feature Clustering and Extended LDA Model RMBI4310/COMP4332 Big Data Mining: Presentation CHAN Chung Yin 07547265
Topic modeling
• statistical methods that analyze the words of the original texts to – discover the themes that run through them– how those themes are connected to each other– how they change over time
Latent Dirichlet Allocation (LDA)
• The simplest, the most common and basic• Idea:
– Represents documents as mixtures of topics– Defines collections of words and find a set of topics that are likely
to have generated the collection – I like to eat broccoli and bananas.– I ate a banana and spinach smoothie for breakfast.– Chinchillas and kittens are cute.– My sister adopted a kitten yesterday.– Look at this cute hamster munching on a piece of broccoli.
• Sentences 1 and 2: 100% Topic A• Sentences 3 and 4: 100% Topic B• Sentence 5: 60% Topic A, 40% Topic B Topic A (food): 30% broccoli, 15% bananas, 10% breakfast, 10%
munching, … Topic B(cute animals): 20% chinchillas, 20% kittens, 20% cute, 15%
hamster, …
Latent Dirichlet Allocation (LDA)
• Learning:– Go through each document (d), and randomly assign
each word (w) in the document to one of the K topics (t).
– Go through each word in a document• Calculate, for each topic t,
– p(topic t | document d) = the proportion of words in document d that are currently assigned to topic
– p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w
– Reassign w a new topic p(topic t | document d) * p(word w | topic t)
• Assume that all topic assignments except for the current word in question are correct
– Repeat above step until steady state
Applying LDA to Social Network Analysis
• “follow relationship” among users often looks unorganized and chaotic
• follow relationships are created haphazardly by each individual user and not controlled by a central entity.
• Provide more structure to this follow relationship – by “grouping” the users based on their topic interests– by “labeling” each follow relationship with the identified topic
group
Background
Purpose
Applying LDA to Social Network Analysis
• Represent E in matrix form by putting F in the rows and G in the columns
• Symbol Meaning u A Twitter userU A set of all uf A followerF A set of all fg A followed userG A set of all gz A topic (interest)Z A set of all ze(f, g) A follow edge from f to ge'(f, g) A follow edge from g to fe(f) A set of all outgoing edges from fe'(g) A set of all incoming edges to gE A set of all e(f, g) f F, g G∀ ∈ ∀ ∈
Applying LDA to Social Network Analysis
• SimilarLDA Edge Generative ModelTopic Interest (z)
Document Follower (f)
Word Followed user (g)
Applying LDA to Social Network Analysis
• DifferentLDA Edge Generative Modelmultinomial distribution multivariate hypergeometric distribution
documents are in the rows and words are in the columns
rows and the columns are from the same entity
integer binary
Smaller magnitude Larger magnitude (|f|X|g|)
Power-law distribution Do not show power-law distribution
Handling Popular Users
• Alternative 1: Setting Asymmetric Priors– Over-fitting: Dirichlet priors and constrain 𝞪 𝞫
P(z|f) and P(g|z), respectively– In the standard LDA, each element of and 𝞪 𝞫
is assumed to have the same value– 𝞫 : Distribution of followed user per-interest– we set each prior value proportional to the 𝞫
number of incoming edges of each followed user.
Handling Popular Users
• Alternative 2: Hierarchical LDA– Generate a topic hierarchy and more frequent
topics are located at higher levels– bottom level topics are expected to be more
specific– Infinite topic number– Topics are not
independent
Handling Popular Users
• Alternative 3: Two-Step Labeling– Topic establishment step• We run LDA after removing popular users from the
dataset similar to how we remove stop words before applying LDA to a document corpus.
– Labeling step• we apply LDA only to popular users in the dataset
Handling Popular Users
• Alternative 4: Threshold Noise Filtering– the smallest number of times a popular user is
assigned to a topic group could outnumber the largest number of times a non-popular user is assigned to that topic group.
– Set a cut-off value to determine whether to label a user with each topic
– filter out less relevant topics by keeping only the top-K topics for each user
Experiments
• Twitter• between October 2009 and January 2010 • Sampled 10 million edges
Dataset
Value
|E| 10,000,000
|U| 2,430,237
|F| 14,015
|G| 2,427,373
max(|e(f)|) 142,669 (f: zappos)
max(|e’(g)|) 7,410 (g: barackobama)
Statistics
Experiments
• 100 topic groups• Boundary value V=100
• Experimental cases
Normal user group |e’(g)|<= V
Popular user group |e’(g)| > V
Case Experiment Descriptionbase LDA over the whole group datasetnon-popular LDA over the normal user group datasetbeta LDA with asymmetric priorshlda-2lv HLDA with 2 levelshlda-3lv HLDA with 3 levels2step Two-step labelingfilter-base Threshold noise filtering after base (C=0.05)filter-2step Threshold noise filtering after 2step (C=0.05)
Experiments
• Measurement 1: Perplexity– how well the trained model deals with an
unobserved test data– A perplexity means that you are as surprised on
average as you would have been if you had had to guess between k equiprobable choices at each step.
– Smaller perplexity, better the model is
Experiments
• Measurement 2: Quality– Determine by how users perceive the quality of
the identified topic groups from each approach– Conduct a survey with a total of 14 participants
Experiments
• Measurement 3: Qualitative Comparison– The top ten users in a topic group– By going over the users' bios, we see that many
of the users in the group have the same interest
Experiments
• Measurement 3: Qualitative Comparison– Base: Topic group cycle from base showing the
popular user issue
Experiments
• Measurement 3: Qualitative Comparison– Base• Barack Obama appears in 15 groups out of 100 topic
groups produced by base. • Among the 15 groups, only one of them is related to
his specialty, politics• the standard LDA suffers from the noise from
popular users if applied directly to our social graph dataset
Experiments
• Measurement 3: Qualitative Comparison– non-popular: Topic group mobile gadget blog
from non-popular
Experiments
• Measurement 3: Qualitative Comparison– 2step : Topic group corresponding to mobile
gadget blog • many users in this group are very popular, tech
media
Experiments
• Measurement 3: Qualitative Comparison– Filter-2step : Topic group corresponding to
mobile gadget blog • many users in this group are very popular, tech
media• cnnbrk and
breakingnews are removed
Conclusion
• Two-step labeling with threshold noise filtering, are very effective in handling this popular user issue
• showing 1.64 times improvement in the quality of the produced topic groups, compared to the standard LDA model.
Web Event Topic Analysis
• Analyze a common topic of many different web events
• Cluster events which belong to a topic • Choose suitable topic terms for clusters
Web Event Topic Analysis
• Dimension LDA (DLDA) – Represent an event as a multi-dimensions
vector– {agent, activity, object, time, location, cause,
purpose, manner} • K-means Clustering
Dimension LDA (DLDA)
• Perform LDA analysis• Add a parameters vector on the weighting
of each dimension• Select some dimensions as topic feature
dimensions
Topic Feature Clustering
• Cluster the content of topic feature dimensions to analyze the common topic for events
Topic Feature Words Selection
• Use DLDA model to select topic feature words
• Find the words within same cluster group• By maximize the probability of topic given
dimension and word
Topic Terms Generating
• Merge 2 words together• Replace existing keyword from dictionary• => candidate topic terms• Finally, Compute probability distribution
of topic candidate terms