Finding motivations behind message interaction in MDT

MOTIVATIONS BEHIND SENDING MESSAGES IN MDT. (METHODS, DERIVATION & RESULTS) Anup Sawant SONIC @ Northwestern University

Acknowledgement

¨  This analysis wouldn’t have been possible without the support of SONIC @ Northwestern University and DELTA Lab @ Georgia Tech in developing the project ‘My Dream Team’.

¨  Other project members: ¤  Dr. Noshir Contractor, Northwestern University. ¤  Dr. Leslie DeChurch, Georgia Tech. ¤  Dr. Edward Palazzolo, Northwestern University. ¤  Harshad Gado, Northwestern University. ¤  Amy Wax, Georgia Tech. ¤  Samuel Posnock, Georgia Tech. ¤  Peter Seely, Georgia Tech.

Overview

¨  Description ¤  Objective ¤  Corpus details

¨  Methodology ¤  Different approaches ¤  Text Parsing ¤  Forming vectors ¤  Measuring similarity ¤  K-Means ¤  Choosing the optimum K ¤  Distribution & Density ¤  Topic Indicators

¨  Results ¤  What can we infer ? ¤  Supporting inference ¤  Gratitude & Request ¤  Need, Liking, Invite & Praise ¤  Collaboration & Grouping

¨  Conclusion

Project Description

¨  Objective ¤  To do textual analysis and find motivations behind message

interaction in the process of forming teams. ¤  Serve as a proof for some of the known reasons about team

formation ties, through, mathematically derived topical patterns hidden in unstructured text data.

¨  Corpus details

¤  ‘My Dream Team’ run 19th – 23rd Feb, 2014. ¤ # Students participated : 214 ¤ # Text messages exchanged : 353 ¤ # Unique words/terms in entire corpus : 619

Methodology/Different Approaches

¨  Problem statement : “ Given a text corpus X = { x1, x2, x3….} Where xi = document/message, find the topics/ideas (in our context, primary motivations) that represent individual clusters within X.”

¨  Possible Approaches : ¤  Latent Semantic Analysis (mostly used in IR for indexing) ¤  Latent Dirichlet Allocation (probabilistic topic modeling) ¤  Document Clustering (we go by this)

¨  Problems : ¤  Real world textual data is always “dirty” when it comes to text parsing. ¤  Performance and accuracy can depend on rich vocabulary for

grammatical parsing.

Methodology/Text Parsing

“I enjoy learning and growing while also getting out a little of my competitive spirit.”

“i enjoy learning and growing while also getting out a little of my competitive spirit.”

[i, enjoy, learn, and, grow, while, also, get, out, a, little, of, my, competitive, spirit]

[i, enjoy, learn, grow, out, little, my, competitive, spirit]

Lowercase

Lemmatize – remove punctuations, split into words & convert each word to its root.

Remove Stopwords

Methodology/Forming Vectors

¨  Bag of words : Collect all unique words from the corpus vocabulary.

¨  Document-Term index : For each word in the vocabulary, count its frequency across all documents/messages in the corpus. (example below)

Term/Word Message -1 Message-2 Message-3 Message-4

hey 1 1 1 0

team 1 1 0 1

join 1 0 1 1

work 1 0 1 0

Methodology/Measuring Similarity

¨  Cosine Similarity : Performs better when compared to Euclidean measure. Similarity is mostly retained irrespective of vectors space distance due to length of vectors. Intuitively, documents/messages dealing with same topic/domain remain close in vector space irrespective of their message length.

Euclidean distance

Cosine distance Θ

x

y

(0,0)

Message 1

Message 2

Example : The figure on right indicates 2 message vectors. Although, the Euclidean measure shows quite a bit of distance in vector space, the Cosine measure indicates that the vectors are close enough to point in the same direction. Cos 0 = 1 indicates similar vectors, Cos 90 = 0 indicates dissimilar vectors.

Methodology/K-Means

¨  The K-means algorithm is a method to automatically cluster similar data examples together. The intuition behind K-means is an iterative procedure that starts by guessing the initial centroids, and then refines this guess by repeatedly assigning examples to their closest centroids and then recomputing the centroids based on the assignments. Img source- Wikipedia

Methodology/Choosing the optimum K

In finding the optimum number of clusters/means in the message corpus, we use the ‘Elbow Curve’ technique as shown in the figure on right that plots the Jcost-min function across number of means tried, where, Jcost-min = (1/m) Σ(xi – μci)2

m = number of messages. xi = message vector. ci = centroid number that vector xi

is assigned to. μci= corresponding cluster centroid

vector to which xi belongs. Thus, the number of optimum clusters considered in MDT message corpus are 3 (The point where the graph curves like an elbow).

Methodology/Distribution & Density

122

124

107

Message distribution among clusters

Cluster 1

Cluster 2

Cluster 3

0

200

400

600

800

1000

1200

1400

1600

1800

105 110 115 120 125

# W

ords

per

clu

ster

# Messages per cluster

Cluster density Cluster 3

The pie chart on the right gives the number of messages distributed among 3 clusters in text message corpus of MDT.

Cluster 2

Cluster 1

The graph on the right gives an idea of how dense each cluster is with words or terms. Note: Cluster-1 is bigger in size (has more number of messages) when compared to Cluster-3 but number of words that makeup Cluster-1 is far less than that of Cluster-3. This gives an important clue that Cluster-1 is most likely madeup of ‘Short messages’.

Methodology/Topic indicators

On a broader scale we already know that all the messages deal with ‘Team Formation’ topic. We are on a hunt to find the hidden motivations on a sub-topical scale. The segmented pyramid on the right shows some of the top ranking words by frequency in each cluster so far with the core topic as ‘Team Formation’. The words that are most common to all clusters and reflect ‘Team Formation’ topic on a broader scale, are at the core of the pyramid. We consider top words as strong indicators of hidden motivations.

Cluster-2 # messages : 124 # words : 1013



Methodology/What can we infer ?

Cluster-1: Has short messages indicating gratitude or request to be a member through words like ‘thanks’ and ‘accept’. Low frequency words from this cluster are mostly a slang or non-dictionary word. Cluster-2: Has messages that mostly refer recipient's qualities and hence words like ‘cool’ & ‘like’ stand out as some of the top words. Probably, these messages also talk about sender's ‘need’ to add one or more member to the team. Cluster-3: Has messages that indicate topics such as working together, grouping and mostly collaboration with words like ‘group’, ‘work’ and ‘together’. with high frequency.




Results/Supporting inference

¨  Though top words provide a strong indication of probable topics in a cluster, high frequency of each word isn’t enough to support our assumption of topics.

¨  A good support to our inference would be through a mathematical analysis of co-occurrence of the top words from each cluster with the words ‘team’ and ‘you’ that makeup the core topic of ‘Team Formation’.

Results/Probability for coherence

¨  Example : “Ana (a word) is in the mall (topic) given her best friends Harry and Brian (‘team’ & ‘you’) are in the mall (topic).”

¨  In other words, a word would define a topic only if it co-occurs with other supporting words to reflect coherence necessary to define that topic.

¨  P(topic) = P(w/X) where, w = word, X = core words of topic ‘Team Formation’.

¨  We calculate P(topic) across all clusters for a given word.

Results/Gratitude & Request by Cluster 1

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

thanks accept

Prob

abili

ty

words

Topic probability through conditional probability of words given core Team Formation words ‘team’ and ‘you’

Cluster 1

Cluster 2

Cluster 3

Results/Need, Liking, Invite & Praise by Cluster 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

need one like join cool

Prob

abili

ty

words


Cluster 1

Cluster 2

Cluster 3

Results/Collaboration & Grouping by Cluster 3

0

0.05

0.1

0.15

0.2

0.25

0.3

work group together

Prob

abili

ty

words


Cluster 1

Cluster 2

Cluster 3

Conclusion

¨  Document clustering with probabilistic support to topical inference over message corpus of MDT has helped us expose following motivations behind sending messages. Users interacted when, ¤  There was a need for one or more person to complete the team. ¤  They liked someone’s profile. ¤  They wanted to invite someone to join their team. ¤  They wanted to praise someone for good profile or looks. ¤  They wanted to group/merge their incomplete teams. ¤  They wanted to collaborate/work with someone. ¤  They wanted to express gratitude. ¤  They wanted to earnestly request someone to join.

Data & Analytics

Finding motivations behind message interaction in MDT