Upload
qubole
View
1.169
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Slide deck from a hands on workshop: Covers the following 1. Learn what Sentiment Analysis and how it can be used 2. Perform pre-processing and post-processing of textual data using Hive 3. Use n-gram language model built into Hive for perform sentiment analysis 4. Learn how to use Hive extensibility to plug-in other language models
Citation preview
Sentiment Analysis using Hive Secrets From the Pros
We will be starting at 11:03 PDTUse the Chat Pane in GoToWebinar to Ask Questions!
Assess your level and learn new stuffThis webinar is intended for intermediate audiences (familiar with Apache Hive and Hadoop, but not experts)?
News Cycle for “Mortgage” 2008-09Mortgage- Crisis, Foreclosures, Fraud
6/12/04 8/1/04 9/20/04 11/9/04 12/29/04 2/17/05 4/8/05 5/28/050
10
20
30
40
50
60
70
80
90
CrisisLinear (Crisis)ForeclosureLinear (Foreclosure)FraudLinear (Fraud)
# of records: 90M/partitionPartitions: MonthColumns: URL Timestamp Array of Memes Links
Table: MemeTracker
36GB of JSON Data
AGENDA
This Webinar provides tips on doing basic sentiment analysis on large data sets using Hive:
• Overview of Sentiment Analysis (SA)• Hive UDFs useful for SA• Demo, Guided Tutorial• Developing advanced, custom SA Engines
Sentiment AnalysisApplications
Direct-- Call center logs, Emails, Chat logsIndirect-- Social Media, Forums, Review websites
Gather Customer Feedback
Over time, geographyBy customer, market segments
Sentiment AnalysisProduct / service decisionsCustomer supportMarketing- messaging, offersCustomer retention, upsell
Use for Decision making
Sentiment AnalysisHow to operationalize a Sentiment Analysis App
1. Crawl, Scrape, API calls, collect
2. Create “Documents”
3. Pre-process Data
4. Apply Language Model, Extract
Sentiment
5. Integrate with Mktg Automn., CRM, CCA, etc
OLTP
6. Improve Product, Better
CS, Targeted Offers
Pre and Post PreprocessingHive Built-In Functions
Goal Input Data Output Data Use this Hive UDF
Tokenization (“Hello There! How are you?”)
( (“Hello”, “There”), (“How”, “are”, “you”))
sentences
Column (array) to rows [1, 2, 3]123
explode
Navigating documents, extracting fields
{"store": {"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} }, "email":"[email protected]", "owner":"amy"}
{"weight":8,"type":"apple"}
get_json_object(src_json.json, '$.fruit\[0]')
N-GramLanguage Models
Q: What is a language model?A: A mathematical model that assigns probability to a sequence of m words
Q: What is “n-gram” model?A: Probabilistic language model for predicting next word in a sequence of words
Q: What is an n-gram?A: A contiguous sequence of n items from a given sequence of text Eg: “Mary had a little lamb” Bi-grams: “Mary had”, “had a”, “a little”, “little lamb”
N-Gram Language ModelHive Built-In Functions
Goal Input Data Output Data Use this Hive UDF
Find important topics using a stop word list, trending topics
Collection of sentences k most frequently occurring n-grams ngrams
Extract intelligence around certain keywords, pre-compute search look aheads
Collection of sentencesk most frequently occuring n-grams around a “context” word. Eg: “Government shutdown”
context_ngrams
Dataset used-- Meme TrackerHow MemeTracker.org creates the dataset
90 Million sources900K news stories / dayTrack 17M memes
# of records: 90M/partitionPartitions: MonthColumns: URL Timestamp Array of Memes Links
Table: MemeTracker
6GB of Data / month
Crawl, Scrape
Create Documents
Extract “Memes”
Analyze Sentiment on “Mortgage”
By Tracking How Memes spread, using Hive
What is a Meme? “Government Shutdown”, “Affordable Care Act”, “Green Eggs and Ham”, etc
# of records: 90M/partitionPartitions: MonthColumns: URL Timestamp Array of Memes Links
Table: MemeTracker
36GB of JSON Data
Prepare Data
Apply language
model, Extract sentiment
Demo
Hive’s Extensibility Framework
• There are many UDFs built into Hive
• For more advanced users Hive allows many ways to extend the language– SERDEs– UDFs, UDAFs, and UDTFs– Hive Streaming
How to access this Tutorial
• Create a free Qubole Account (www.qubole.com)• Login Click on “Analyze” Look for “Tutorials” tab
at top of page
Summary• Pre and post processing
– Use Hive
• Language Models– Use pre-existing language models codified as Hive UDFs such as ngrams
and context_ngrams– UDFs-- Build your own language model in java using Hive UDF
framework– Hive Streaming-- Plug-in your existing language models or 3rd party
libraries
• Visualization– Use a spreadsheet / BI reporting tool
THANK YOU
Managed Cluster Built-In Connectors Friendly User-Interface Dedicated Support
• 100% Managed Hadoop Cluster in the Cloud• Auto-Scaling Cluster. Full Life-cycle Management• +12 Connectors to Applications and Data Sources• 14-Day Free Trial (free account available)• 24/7 Customer Support
What’s Included?
www.qubole.com/try