Social Media Monitoringcitihub.com/wp-content/uploads/2020/09/Citihub-Social... · 2020. 10. 13. · Social Media Monitoring: sing Big ata Technies Parts of Speech Tagging: Perhaps

www.citihub.com

Using Big Data Techniques toEnhance Customer Analytics

Social MediaMonitoring

Social Media Monitoring: Using Big Data Techniques

CONTENTS

3 Executive Summary

3 Highlights

4 Introduction

5 Data Analysis Techniques

• Supervised and Unsupervised Methods

• Measuring Similarity

• Pre-processing

• Entropy

• Information Gain

7 Example Use Case

8 Reference Stacks

• Analysis of Streaming Data

• Analysis of Natural Language Data Logs

10 Conclusion

11 Contact Us

2


Looking at something from the top-down allows analysts to focus on the bigger picture, but tends to rely on assumptions about the details. Bottom-up analysis may be based on more precise empirical evidence of the details, but makes assumptions in extrapolating results from a sample set to a broader population.

Big data techniques have the potential to harness the best of both worlds by being able to gather, mine and model granular data across extremely large sample sets—providing a top-down view that has been aggregated from a very large set of empirical data points. However, the ability to extract data from new sources, derive meaningful insights from that content and integrate it with existing data warehouse technologies is by no means simple, and most financial institutions are still in early stages of adoption and experimentation.

This paper outlines some key techniques that can be used for big data analysis and highlights a few potential use cases focused on integrating social media content into retail financial services analytics. It then goes on to define some open source reference stacks to support big data analysis without significant investment in software licenses or hardware procurement.

Executive Summary

Financial institutions have always been adept at managing data. After all, data lies at the heart of every financial decision - whether it’s processing a transaction, approving a loan, picking a stock or mutual fund investment, or devising a savings plan. Yet the proliferation of technology has meant the variety of data sources and velocity with which data is generated have exploded.

From a retail perspective, the uptake of social media and other communication channels have spawned rich and diverse sets of client information, which are creating new opportunities for client profiling and sentiment analysis. Moreover, the ‘internet-of-things’ threatens to add further momentum to the variety, volume and velocity of data created, while also providing valuable real-time insights into everything from individuals’ power consumption and shopping patterns to their driving behaviour.

Ultimately, the age of big data has the potential to significantly change our understanding of complex mechanisms made up of many individual components. Typically, most models—be they financial, sociological or operational—have either been based on top-down or bottom-up perspectives.

3

Highlights

• Big data technologies encompass a range of capabilities, including the ability to store and analyse very large datasets; process unstructured content such as natural language; and render intuitive visualisations.

• There are a range of open source solutions within each of those areas that can be used to drive powerful analytics without significant investment.

• Tools like Hadoop and Amazon’s Elastic MapReduce; Apache Hive; Apache Pig and Apache Mahout are ideal for large dataset storage and analysis

• For natural language processing, Apache Lucene, Apache OpenNLP and Stanford NLP each provide unique capabilities to pre-process, model and analyse natural language content

• Being able to visualise large datasets is another key aspect of big data analytics, and a range of open source tools are available for that purpose, including Carrot2 Workbench for unsupervised clustering; Gephi, NodeXL, yEd, GraphWiz, TikZ for relational analytics; and CartoDB for geo-spacial and time series visualisations

Social Media Monitoring: Using Big Data Techniques4

Ideally, teams charged with big data analysis should operate in an agile development mode, being able to spin up new environments and test out ideas quickly, so as to speed up the learning curve with regards to new sources of unstructured data from social media and other channels.

Those data sets can then be used to supplement existing customer analytics to build richer customer profiles. With a more accurate profile of each customer, obtained by analysing richer data from a variety of sources, financial institutions will be able to better guide and personalise their customer interactions.

Ultimately the goal is to marry traditional financial metrics—determining factors such as the profitability and credit worthiness of a particular client—with insights into individual customer behaviour, which can influence their propensity to buy certain products or the risk of churn.

Introduction

The retail financial services industry is facing a period of unprecedented change, with new competitors, disruptive technologies and regulatory pressures all contributing to a challenging operating environment. In spite of these threats, most incumbents still have a significant advantage in their favour. With an established customer base and the potential to develop a better understanding of those customers using new data sources and analytical techniques, retail financial institutions can benefit significantly from evolving big data technologies.

In order to do so, they need to focus efforts on delivering clear business benefits—such as improving sales and marketing efforts, customer service and reducing churn—and architect solutions designed to achieve those goals.

The cost of software licensing need not be prohibitive given that there are many powerful open source technologies available. However, assembling the right skill sets and developing the right data architecture will take time.


For example, content scraped from social media channels, such as Facebook and Twitter, can be laden with colloquialisms, slang and mis-spellings—all of which contribute to making the content much more difficult to analyse.

Measuring Similarity

One way to analyse large sets of data is to spot similarities between objects and cluster similar objects together. A couple of approaches exist to help. Euclidean distance vectors are ideal at identifying similarities using a statistical approach to compare a range of features that each object possesses. For example, if you were comparing the physical attributes of people, you might use measurements for chest, waist, hips, arm and leg length as features. Starting with a large sample of peoples’ actual measurements, you could then start to cluster features to derive a model for different sizes of clothing that provides a ‘best fit’ for the different categories (small, medium, large, extra large etc.).

Equally, techniques such as the Jaccard coefficient can help determine similarities by measuring the shared characteristics of two different objects as a fraction of the total characteristics that those two objects possess. A Jaccard coefficient of one would therefore represent two identical objects. This technique can be particularly useful for processing natural language content—such as Facebook posts, online chat logs or transcribed call centre conversations—as each object can be turned into a ‘bag of words’ and compared with other such objects to see how many words they have in common.

However, before this kind of analysis can be carried out, pre-processing is required to determine which sets of words will be compared. For example, let’s say we were using the Jaccard coefficient to analyse two sentences: a) your service can not be tolerated; and b) your service can not be bettered. Without any pre-processing, the Jaccard coefficient would see these sentences as highly similar, with a score of 0.714 (5/7) calculated by taking the number of words that appear in both sentences (five), divided by the sum total of all words in both sentences (seven). Yet we can see that the two phrases share very little in common other than their sentence structure and are actually diametrically opposed in their sentiment.

Data Analysis Techniques

This chapter explores some of the techniques that can be applied to analysing big data sets. Although this is not an exhaustive list of techniques, it provides an overview of how data analysts can go about deriving insights from large volumes of unstructured data.

Supervised and Unsupervised Methods

When looking to process and analyse large volumes of unstructured data, most of the available techniques can be grouped into two broad categories: supervised and unsupervised methods. Essentially, these methods relate to the level of training that you need to provide to your analytical algorithms.

Unsupervised algorithms will mine large datasets in a completely automated fashion. Using statistical methods, such as k-means clustering, these algorithms can look to group similar data items together without any manual intervention. This type of technique may often be used to quickly analyse very large data sets to identify basic patterns that can then be explored in more detail using supervised methods.

Supervised methods involve some level of hand holding and manual training of algorithmic processes. Rather than asking an algorithm to automatically mine a dataset and identify its own patterns, supervised methods allow data analysts to infer a model or function. That model is then applied to a dataset that has not yet been seen to test its accuracy in correctly categorising different objects. Data analysts will then be able to identify and correct errors and exceptions (where the algorithm has incorrectly classified a data item) and fine-tune the algorithm accordingly, thereby continuing to increase its accuracy.

These techniques are most pertinent when there are complex patterns residing in a set of data that automated methods may not uncover or understand. For example, natural language processing will typically have to be supervised. That is because unsupervised algorithms would struggle to understand the complexity of grammatical constructs and sentiment expressed in natural language. That complexity is even more difficult to decipher when grammatical rules are not followed.


Parts of Speech Tagging: Perhaps the most sophisticated means of analysing natural language is to tag and identify the context of each word. However, tagging words in this manner is not simple, given that a single word can have multiple meanings. For example, the word ‘well’ could be a noun to describe a hole from which water is gathered; it could be a verb as in ‘you could see tears well in her eyes’; it could be an adverb as in ‘I performed well in my test’; it could be an interjection ‘well, the thing is…’; or it could be an adjective ‘I’m well, how are you?’. Algorithms that perform parts of speech tagging will therefore need to be supervised closely with errors in automated processes flagged to help improve levels of accuracy.

Entropy

Entropy is a measure of how well a particular data set can be split into groups or clusters. When you are looking to mine a large amount of data that shares similar characteristics, it is more difficult for an algorithm to accurately categorise and classify the differences between data items. For example, say you were looking to mine information about retail banking customers to create a model that would predict which of those customers may default on their loans. The entropy measure—‘default’ or ‘not default’—is likely to be very low because the vast majority of customers fall into the ‘not default’ category. As a result, your predictive model is likely to be skewed – tending to assign new customers into the ‘not default’ category and not being successful at predicting the few that will default. Analysing data with higher entropy values can therefore help to create models that are more accurate in their predictions.

Information Gain

Information gain is a technique that can be used to increase the entropy of a particular data set by splitting it according to different characteristics. Using the previous example, although the entropy of the entire universe of your customer base may be low (ie. only 1% of all customers default, with 99% in the ‘not default’ category), there may be ways to filter the data to increase its entropy. For example, by only looking at customers with low credit ratings and high debt-to-salary ratios, you may end up with a more balanced sample set (30% ‘default’ versus 70% ‘not default’). By focusing on this subset of your data universe, you can create a model that more accurately classifies customers that are likely to default.

Pre-processing

Pre-processing is an essential stage of analysing large unstructured datasets. Content like natural language will need a significant amount of pre-processing before statistical algorithms can be applied to derive further insights.

There are many techniques that can be used, including:

Stop word removal: This technique helps simplify natural language for the purpose of analysis. In the previous example comparing the sentences ‘your service can not be tolerated’ with ‘your service can not be bettered’, the removal of words like ‘your’ and ‘be’ help place more emphasis on the difference in meaning, rather than the similarity of sentence structure.

Stemming: This technique removes conjugations and endings that add little to the meaning of a word – helping to identify similarities between words that largely carry the same meaning. For example, the words ‘argue, argues, arguing, argued, argument and argumentative’ may each be seen as unique using a Jaccard coefficient model. But by stemming those words to ‘argu’ an algorithm could identify them as similar in that they all carry a slightly negative sentiment.

Domain specific replacement: This technique helps to identify certain strings of characters as belonging to a certain domain. For example, by replacing ‘162.32.54.16’ and ‘71.198.32.114’ with ‘IP address’, an algorithm can analyse the two as being equivalent objects even though they do not share similar characters.

Bag of words: By turning any body of natural language text into a bag of words it will make the data intrinsically easier to process in an algorithmic fashion, helping to identify not only the different words used but also repeat occurrences of the same word.

N-grams: Treating text simply as a ‘bag of words’ does not capture the relational context of those words. An n-grams approach breaks bags of words into associated pairs, triplets or so on. For example, by using bi-grams to classify the phrase in the previous example, you would identify that the key difference between the phrases lies in the bi-grams ‘not bettered’ and ‘not tolerated’.

6


Example Use Case

The use cases for applying big data techniques within the financial services industry are broad. For the purposes of this paper, we have decided to focus on the retail market, specifically the potential to improve sales, marketing and customer retention by complementing existing sources of client data with content gleaned from social media and other sources of natural language content. When looking at retail sales, marketing and customer retention efforts, there are many potential data sources that can be analysed, both internal and external. They include traditional metrics that financial institutions are accustomed to collecting and storing—such as earnings, expenditures, debt levels, credit ratings, net worth—all of which can be used to build profiles of different customer groups and determine whether those groups exhibit different characteristics (such as loyalty, probability of default etc.). Similarly, account and individual transaction data can be mined to detect changing circumstances and potentially even life events, such as getting married, starting a family, being promoted, entering retirement etc. The same data can then be modelled to validate which of those events trigger increased likelihood of purchasing specific financial products (a personal loan to fund your wedding, a car loan to buy a bigger car to accommodate your new baby etc.) to help target marketing efforts. Traditional sources of financial data can then be supplemented with external data gleaned from social media channels to help enrich and validate customer profiles. For example, data obtained via social media could help to corroborate whether a purchase made from a store selling baby products means your customer is expecting a baby, or simply buying a present for a friend. Equally, big data techniques can also be applied to processing a growing raft of internally collected data. For example, call centre conversations could be transcribed and online channel interactions recorded to provide richer insights into how to improve customer service and reduce churn.

Use Case Retail Marketing, Sales and Customer Retention

Potential Data Sources

Facebook pages and posts; Twitter feeds; Transcribed call logs; Online help logs; CRM data; Transaction logs; Internet of Things

Tools Large Dataset Analysis Tools• Hadoop (Amazon EMR): Distributed processing framework for large data sets. • Apache Hive: Provides SQL-like interface for producing sequences of MapReduce jobs• Apache Pig: High level language for expressing data analysis and producing sequences of MapReduce jobs• Apache Mahout: Scalable machine learning libraries, integrates with Hadoop

Natural Language Processing (NLP) Tools• Apache Lucene: Natural language search engine – also provides key NLP functions such as stemming and stop word removal• Apache OpenNLP: Machine learning based toolkit for natural language processing – tokenisation, sen-tence segmentation, parts of speech tagging• Stanford NLP: NLP tools, including Chinese language processing

Visualisation Tools• Carrot2 Workbench: Open source unsupervised clustering tool for smaller data sets• Gephi, NodeXL, yEd, GraphWiz, TikZ: Various tools for creating (relationship) graphs • CartoDB: Geospacial visualisation tool, including time series visualisation

Goals New Insights into customer base; Targeted marketing campaigns to improve cross- and up-sell; Enhanced / personalised customer service; Improved customer retention rates


Reference Stacks

Big data technology solutions tend to share similar capabilities. Typically they need the ability to capture data from a variety of sources; transform, cleanse and process the data on the fly; store the results and enable further analysis and visualisation in a more structure way. However, reference architectures for specific big data use cases will differ depending on the type of data being analysed. This section outlines a couple of example reference technology stacks that can be used to carry out big data analytics. Both of the solutions described consist entirely of open source components and run on commodity cloud hardware, and are therefore extremely cost effective with no up-front license fees or infrastructure costs.

8

Analysis of Streaming Data (Twitter, Online media aggregators etc.)

The first of our reference stacks can be used to capture and analyse large volumes of streaming data. For sources such as Twitter, the first step would be to write an interface to Twitter’s REST API to capture the required data - provided via JSON - and store it in a real-time platform for pre-processing. The same would apply for aggregated social media content from providers like Moreover Technologies, which provides its data via an XML feed.

Irrespective of the source or format, the streaming data typically needs to be cached and pre-processed. In this case, we use Amazon Kinesis to allow us to store and transform the real-time data; helping to simplify further analysis by adding structure to the content, such as fields for ‘time’, ‘GPS coordinates’, ‘tweet contents’ etc. We then use Amazon’s Elastic MapReduce (EMR) and S3 storage services to store the data in a Hadoop cluster. Once stored, the data can be queried using database engines such as Apache Hive or Spark for basic analysis.

However, for more complex sentiment-based analytics, some natural language processing is needed. Apache Mahout, a machine learning algorithm allows us to classify words carrying positive and negative sentiment, or search for specific terms. The output of the model is then visualised and interpreted using a variety of tools, including spreadsheets, map visualisation software CartoDB, and graphical visualisation tool Gephi.

Classification Model

S3

EMR

Kinesis

Figure 1: Shows tweets mentioning the word ‘ebola’ during the height of the outbreak in October 2014. The visualisation of time series geospacial data uses CartoDB


Analysis of Natural Language Data Logs

(Facebook Page Data etc.)

For analysis of natural language data logs that do not need to be processed in real-time, the technology stack may look a little different.

For the purpose of this example, we chose to look at a retail firm based in China, which operates in the Leisure and Entertainment indsutry and has an active Facebook page. The messages posted on the page were all in Chinese language and demonstrated high entropy values as there was an even mix of positive and negative feedback. Extracting the content was relatively simple and involved programmatically sourcing data from Facebook and uploading it to an AWS S3 file store. The velocity of updates did not necessitate real-time or streaming analysis techniques.

Once stored, the raw Facebook data could immediately be visualised using the Carrot2 Workbench. However, for more complex analysis, the data was loaded into a Hadoop cluster using Amazon EMR. For basic pre-processing of the posts, we use Stanford NLP to enable our classification model to recognise the Chinese language text. We then applied a classification model that scored words based on their sentiment ratings and classified posts into one of four categories: positive, negative, neutral and noise.

The classification model was iterated to improve its accuracy until it was able to accurately categorise 80 percent of all posts. The results were then loaded into the Carrot2 Workbench for visualisation, helping to immediately see the subject of both negative and positive comments about the retail service provider.

9

Classification Model

Figure 2: Shows analysis of natural language content posted on the Facebook page of a Chinese retail services firm in the Leisure industry. The visualisation is via Carrot2


Conclusion

Big data analysis requires unique skill sets. In order to build up those skills within an organisation, Citihub recommends the creation of a centre-of-excellence for data analysis. Such a team would help identify use cases and specify business goals for the technologies and techniques in question—whether that be improving sales, enhancing customer service or reducing churn. By focusing directly on tangible business goals, the centre-of-excellence will also help detect existing data quality issues and capability gaps that need to be filled to achieve desired goals.

Further investment can then help fill those capability gaps. As firms enhance their data science skills—including the ability to capture, store and process high volumes of data; perform complex real-time and event-driven analysis of that data; process unstructured natural language data using machine learning techniques; and finally, correlate all new and existing sources of data to build richer and more insightful customer profiles—they will be better placed to achieve their business goals.

Most financial institutions are still at a relatively early stage in their data analysis maturity curve. However, that should not deter them. The previous reference stacks help evidence that complex data analysis does not require a large budget for licensing new technologies. In fact, one of the benefits of using big data techniques is that they can offer highly scalable architectures, made up of commodity hardware (easily be spun up by cloud service providers) and open source software that is free to license.

10

INVESTMENT IN DATA SCALE

UNSTRUCTURED DATA SOURCES

DELIVER EARLY ROI

REAL-TIME & EVENT-DRIVEN

CROSS-SOURCE CORRELATION

MATURITY CURVE


About Citihub Consulting

Citihub Consulting is a global, independent IT advisory firm with deep domain expertise across every layer of the technology stack – from business applications and data platforms down to core infrastructure. From IT strategy, architecture and solution development, through to cost optimisation, risk assessment and implementation – our trusted experts deliver the right results for your business.

For us consultancy is personal. We have a relentless commitment to great execution, integrity and client success. We aim to redefine perceptions of our industry and our commitment to delivering the right results for our clients has never changed, even as the business has grown consistently over the last decades.

2014/15 clients include 7 of the top 10 investment banks and 2 of the top 5 hedge funds.

For more information, please visit www.citihub.com

Contact Us

EMEARichard Hamstead, [email protected]

1 Canada Square London E14 5AB +44 207 536 5801

AMERICASKeith Maitland, [email protected]

757 3rd Avenue, 20th Floor New York, NY 10017 +1 212 878 8840

ASIA PACIFICChris Allison, [email protected]

12th Floor, ICC West Kowloon Hong Kong +852 8108 2777

Bellerivestrasse 201CH-8034 Zurich+41 44 386 6080

The Dineen Building140 Yonge Street, Suite 200Toronto, Ontario M5C 1X6+1 437 886 8390

137 Market StreetLevel 5, Office 505Singapore 048943 +65 3152 2777

Author: Ian Tivey

Contributors: Chris Allison Mark Wong Paul Jones Ilya Finkelshteyn

Documents

Social Media Monitoringcitihub.com/wp-content/uploads/2020/09/Citihub-Social... · 2020. 10. 13. · Social Media Monitoring: sing Big ata Technies Parts of Speech Tagging: Perhaps