Topic Modelling to identify behavioral trends in online communities

Fabrikatyr AnalyticsUncover tangible truths amidst the noise of modern media

TOPIC MODELLING FOR HUMANS - SUBJECTIVE MODELLING OF ONLINE DISCUSSION

2

Agenda

@conr#[email protected]

Explanation of Topic Modelling

Application using Gensim

Sample Results

08 Apr 2015


3

Explanation of Topic ModellingA BRIEF INTRODUCTION TO THE SEMANTIC WEB

08 Apr 2015


4

Why is it Important?

• Discover topics in large groups of documents

• Use these labels to understand the body of text and documents more effectively

What is Semantic Analysis?

Some use cases:

•Consumer Insight

• Recommender

• Social Media Monitoring

08 Apr 2015


5

What is Topic Modelling?

Grouping documents based on the probability of words occurring in each document

http://people.cs.umass.edu/~wallach/talks/priors.pdf08 Apr 2015


608 Apr 2015

Transforming raw data to insight for a particular audience is not about algorithms alone

DataInsight

The Gap

Good Data Science makes ‘The Gap’ as small as possible


7

Finding the most suitable application of Topic modelling for ‘discussion’ is critical

08 Apr 2015

Topic modelling

Semantic

Subject matter corpus

General Corpus

Statistical

Word probability

Paragraph structure

Word distance

Mixture of all?

Analysing political debate discourse has the following issues• Few / little ‘training’ texts• Highly variable sentence

length• Distinct word distributions

• Statistical word probability has readily available implementations and can resolve these challenges


8

What is Gensim?

08 Apr 2015

Gensim is a free Python library designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible. Gensim aims at processing raw, unstructured digital texts (“plain text”).

• Offers more precise modelling options than ‘topicModels’ in R or MALLET

• Wider function set

• Somewhat complex to optimise

• Dependencies: numpy and scipy

Radim Rehurek


9

Application using GensimHOW TO USE GENSIM TO UNDERSTAND LARGE VOLUMES OF TEXT EFFECTIVELY

08 Apr 2015


10

Preparing the data• No data set is

ever ready to operate on ‘out-of-the-box’

• Challenges included:• Character

encoding• Multiple fields in a

column• Timestamps

DATA

08 Apr 2015


11

What is a Text Corpus and a ‘Bag-of-Words’?

08 Apr 2015

Bag-of-words (BOW) converts each response into a set unordered single words

This Method:• does not parse sentences,• does not care about word order, and• does not "understand" grammar or syntax


1208 Apr 2015

The optimum number of topics can be selected by calculating the model with the smallest measure of Chaos / Entropy

Least amount of disorder in the topics

Harmonic Mean

AIC

Entropy

“Sum of Lowest average probability” for each topic distribution

Balance of “Harmonic mean “ against model complexity

Least amount of disorder in the topics

Using Kullback–Leibler divergence we can spot local minimum and pick the optimum number based on how many topics we want to ‘name’

Local minimums provide a chance to explore the Trade-off between granularity and consistency


13

Latent Dirichlet Allocation

LDA repeatedly examines the probability of the words in each response and establish ‘common sets’ (topics)

08 Apr 2015


14

The topic words associated can be extractedEach comment is be assigned to a single topic

08 Apr 2015

LDA.print_topic extracts the words in each topic

NP.Max gets the most likely topic for each comment


15

Sample ResultsINTERROGATING A COMMUNITY FORUM DISCUSSION

08 Apr 2015


16

How to use it?There are 7 key stages to model topics effectively

1• Colla

te text

2• Creat

e Corpus

3• Creat

e ‘bag of words’

4• Opti

mum topics

5• Esta

blish keyword groupings

6• Nam

e Topics

7• Visu

alise

1Get Data

2Create Corpus

3Feature review

4Optimum

topics

5Review

6Name & Visualise

7Deliver insight

08 Apr 2015


17

Sample set : 11.3K posts to a Teleco help forum

08 Apr 2015

Corpus 5,000

Questions3,000 Users

3 years of data

Kudos

Device

Thread size

User Age

Views

Maximum user posts

Data Features


18

Classifying users will help identify admin versus users

08 Apr 2015


19

We then use ‘Regression Forest’ to further identify post features which drive ‘Views’

08 Apr 2015


20

Removing the ‘Admin’ outlier ‘Kudos’ seems to be the driving feature

08 Apr 2015

Kudos Response no User Age Thread Size


2108 Apr 2015

Optimum topic number across the different user segments ensures our grouping assumptions are reasonable

Using Kullback–Leibler divergence we can spot local minimum and pick the optimum number based on how many topics we want to ‘name’

Local minimums provide a chance to explore the Trade-off between granularity and consistency


22

Amount of posts in each topic and length of post

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18

08 Apr 2015

We examine the structure of the corpus and the lengths of the posts to validate our model

Response count

Length of Post


2308 Apr 2015

Word probability distributions, corpus and domain knowledge allow for topics to be named

Topic Topic Name Word tokens and probability11 Internet setting internet setting data phone work

30% 27% 31% 12% 5%

12 Number Transfer number 48 sim support old 36% 35% 14% 6% 3%

13 General new account query

phone sim go solution solved

26% 29% 24% 9% 6%

14 Roaming text roaming call send eu 23% 25% 22% 8% 5%

15 General chat im like think good dont 13% 12% 10% 4% 2%

16 Referral Bonus press key navi highlight select 27% 32% 31% 15% 9%

17 Network Issues network phone problem im internet 12% 11% 13% 5% 3%

18 Blackberry Problems problem blackberry mine get thanks 11% 11% 13% 5% 2%


24

Posts get ‘views’ for any number of reason, we need to identifying topics are important

08 Apr 2015

Using Random Forest of predicting ‘Views’

Topic ‘name’ Topic number

Internet setting 11Number Transfer 12General new account query 13Referal Bonus 16Network Issues 17Blackberry Problems 18

Only 5 Topics which drive views

This suggests these topics get ‘repeat’ visits

This is NOT the most ‘viewed’ topics, but the ones which people refer to 16 18 13 12 17 11


2508 Apr 2015

We then compare key topics posts over time to understand the patterns


26

Using ‘Named Entity Recognition’(NER), Topic Modelling can be used to understand how consumers are interacting with brands

08 Apr 2015

Brands mentions only occur in 2% of the entire corpus, making any assignment of topics trivial


27

ConclusionTHINGS TO THINK ABOUT

08 Apr 2015


28

2nd Generation of ‘Listening’ tools will be less metric and more Qualitative

08 Apr 2015


29

Context is KeyBlind application of complex modelling will yield results which deliver incorrect classification

The final deliverable and key features must be defined before embarking on the analysis08 Apr 2015


30

There is an infinite amount of data, harvesting it is the key

08 Apr 2015


31

AppendixGENSIM

08 Apr 2015


32

Comparison of LDA implementations

08 Apr 2015

Learning rate – (decay)To ‘bootstrap’ small bodies of text

‘Passes’ of the Bayesian sampling function can also effect the model

•Gensim in Python currently has the most extensive set of parameters however topicmodels in R has some good visualisation examples

•‘Online’ LDA implementations are crucial for ‘social listening’ for evolving political commentary

The ‘Number of Topics’ is the key parameter however there are a few other parameters which are important.

Priors MatterFunction of document count and length

‘Honourable mention’ implementations• Vowpal Wabbit – machine learning• Mallet – Focus on text modelling• Stanford - great resource


33

The Model still needs to be visualised

Again we use Kullback-Leibler divergence to map the topics against each other. Each word has a measure of Saliency

Saliency is a compromise between a word's overall frequency and it's distinctiveness. A word's distinctiveness is a measure of that word's distribution over topics

08 Apr 2015

By visualising the word distributions in each topic we understand them better


34

Why Priors Matter!

Careful thinking about priors can yield new insights

– e.g., priors and STOPWORD handling are related

For LDA the choice of prior is surprisingly important:

– Asymmetric prior for document-specific topic distributions

– Symmetric prior for topic-specific word distributions

Almost all work on LDA uses symmetric Dirichlet priors– Two scalar concentration parameters: α and β● Concentration parameters are usually set heuristically● Some recent work on inferring optimal concentration parameter values from data (Asuncion et al., 2009)

08 Apr 2015

Data & Analytics

Topic Modelling to identify behavioral trends in online communities