Upload
ryan-stuart
View
304
Download
1
Embed Size (px)
Citation preview
Getting started with and realising ROI on Text Analytics
Ryan Stuart Founder & CTO
Who am I?
• Software Engineer (previously?) • Founder & CTO of Kapiche • Work in the Text Analytics industry
since 2008. • Interests: Distributed Computing,
Database Design, Machine Learning.
@rstuart85 / @Kapiche Official
rstuart85 / Kapiche
Raise your hand if you are a….. • Engineer / Developer / Technical; • Data Scientist; • Academic; • Market Researcher; • Statistician; • Have “analyst” or “risk” in your job title; or • Other;
Who are you?
• Overview of Text Analytics – What is it? – Different Types
• Who are Kapiche? • Solving Business Problems with Text Analytics – Automation – Enterprise Search – Voice of the Customer (with demo) – Machine Learning
• Resources
Overview
Overview of Text Analytics
“…the process of analyzing unstructured text, extracting relevant information, and transforming it into useful business intelligence.”
What is Text Analytics?
• Consider a big customer survey with two questions: • How likely are you to recommend Microsoft to your family,
friends or colleagues? (0-10) • Why did you give us that score?
• You get 10,000 responses to your survey. Now what? • Maybe add more structure to the survey? • Maybe send it offshore to be understood? • Enter Text Analytics.
Text Analytics performs some sort of dimensionality reduction which results in a lower-dimensional representation of data to serve the task of
analytics.
Types of Text Analytics?
• Entity Extraction (NER): – Markup text with entity tags: Person, Organisation, Time etc. – Used to improve processing/routing of text
• Classification: – The process of classifying a piece of text with a fixed set labels. – Sentiment Analysis and Categorisation are both examples of
classification. • Topic Modeling:
– Identifying of high level constructs (topics or ideas) present in the text.
– Some approaches treat topic as abstract constructs useful for specific tasks (e.g. more like this search). Others use them as a mechanism for understanding data.
Who are Kapiche?
What does Kapiche do?
• Take away all the marketing lingo and Kapiche does automatic Topic Modeling.
• Not the abstract variety. The understandable variety.
• The goal is to understand large amounts of data quickly.
• But what is a topic and how are they identified?
What is a Topic?
• Remember, most text analytics is just noise reduction.
• Kapiche uses a pure mathematical approach to determine which terms from a text corpus have high entropy.
• This is done by combining influence of a term with the frequency.
• Once these nodes of information have been identified, we begin to build topics around them.
Understand the Data using Topics
Understanding the Topic Model helps us understand the data.
Solving Business Problems with Text Analytics
Automation (prediction?)
• Text Analytics can help automate a range of business processes.
• NER and Classification can be used to: – Assign support tickets to the right person
(routing) – Determine if email is spam – Automatically tag new documents in a
database – Fraud detection
Enterprise Search
• Using a combination of Topic Modeling and Classification / NER, it’s possible to come up with a bunch of different approaches to search.
• NER can be used for “semantic search”. • Abstract Topic Modeling (the type where the
topics are abstract constructs) is great for More Like This.
• Concrete is great for understanding the search results and finding what you are looking for (quick demo).
Voice of the Customer
• Perhaps the most powerful tool in sales and marketing is knowing what your customers think about your brand / product / business.
• It has always been possible to just ask them of course, but what do you do with the responses? Read them all?
• Actually, that is the exact approach most companies take. They develop complicated coding frameworks and offshore it all.
• Obviously, that is a seriously flawed (human bias?) and expensive approach. So much so that surveys are tailored to be easier to extract knowledge from.
Sentiment Analysis for VotC
• Sentiment Analysis is usually how people get started. It has problems though.
Gee, I really love the complementary snacks on Virgin
Airlines!
• Sentiment analysis is traditionally just a classification problem using machine learning.
• Generally require a new model for each data domain.
Topic Modeling for VotC
• Companies like Kapiche (and Luminoso for example) are trying to make it easy to understand your customer.
• The approach is generally based around some degree of automated insight extraction.
• In the case of Kapiche, we are trying to reduce the noise to significantly decrease the time to understand customers.
• This technology doesn’t replace the analyst! It does reduce the amount of expertise need though.
Demo!
Future of VotC
• The current best practice for survey design, which a bunch of structured multiple choice questions, is flawed.
• It’s build around the idea that automating the extraction of insights from text is hard.
• These complex surveys also result in low engagement rates.
• Technology like this has the ability to change how we design customer surveys.
• I propose simple surveys with only 2 questions. • Also consider how we are extracting value from social
media, call centre data, etc.
Machine Learning
• Another way to describe dimensionality reduction in a manner for Machine Learning is feature extraction.
• Combining features extracted using some techniques from Text Analytics with structured data to build a classifier has lots and lots of uses. – News reports and stock price changes? – Book content and customer review scores? – Movie scripts and critic ratings?
• The traditional approach here has been Bag of Words. • New methods like Word2Vec and GloVe are emerging that
don’t discard structure of the text.
Resources
• Word2Vec - https://en.wikipedia.org/wiki/Word2vec • GloVe - http://nlp.stanford.edu/projects/glove/ • Sentiment Analysis -
https://blog.monkeylearn.com/sentiment-analysis-apis-benchmark/
• Kapiche for Research – https://research.kapiche.com • Gensim - https://radimrehurek.com/gensim/index.html • NLTK - http://www.nltk.org/