How to make lean Big Data with six
tools from Google Nikolay Novozhilov ([email protected])
June 2014
2
Silicon Valley veterans with top engineers from around the world
One slide about Bubbly
40M+ users just over three years since launch of Bubbly
Leading mobile social media & messaging service across Asia
Sequoia Capital, SingTel, JAFCO & Comcast
Singapore (HQ) + Mumbai, Manila, Jakarta, Tokyo, Hanoi & Bangkok
• Tony Bates, CEO Skype / CSO Microsoft• Jeff Karras, MD SingTel Innov8• Dave Williams, former CTO O2, AT&T, and Telefonica• Jimmy Iovine, Chairman, Interscope Records (Judge on American Idol)• Gaurav Garg, Sequoia Capital US• Nikki Han, President, SM Entertainment (Korea)• Mohit Bhatnagar, Sequoia India
Overview
Offices
Investors
Users
Team
Board /
Advisors
3
What do we want from Data Analytics?
Make the Dashboard with key metrics
Dive deep in user behavior and A/B testing
Monitor availability and performance
Produce reports for external users
Etc…
Everybody needs the same
4
What did we do?
We have tried many things to satisfy our needs.
And found solution optimal for us
Fast to make and cheap
Flexible and with a lot of functionality
Able to deal with Big Data – we log 60 mln events a day
In this presentation we show how it’s done
5
Why we didn’t use Mixpanel
Not enough configurabilityOnce you really care about your data – standard charts are not enough!
Mixpanel export APIs don’t solve all issues
What about extra features – not data mining:Use results inside your product
Send monitoring alerts the way you want
Give limited access to 3rd parties
Costs a lot! People often sample data to Mixpanel.
But what if you need full data dumped in one place?
There are tons of other cloud-solutions, that might be doing some of these tricks, but I don’t trust “small projects”
6
Why we didn’t use Hadoop
It is too complicated
Hadoop needs server infrastructure
Even with hosted Hadoop solution you need a lot to setup
Batch processing – Hadoop is not reactive to your queries. It kills
you when you do:
Ad hoc and trial-and-error data analysis
Mistakes in scripts
…I mean – you do it every day!
Hadoop doesn’t give you visualization, monitoring, etc… You still
have to build it.
7
Why we didn’t use MySQL
We have too much data for MySQL
Still need to host it, build all functionality, etc…
Already enough reasons!
8
What did we do instead?
Google Big Query
Google Spreadsheets
Google Charts
Google Drive / Google Sites
Store all possible events from users
Query and transform data
Interactive visualization
Host the Dashboard
Google Analytics Look after Dashboard users
9
Why BigQuery?
Solution hosted by Google – ready to use today!
Much cheaper than hosting own applications in AWS.
Established API – easy to add logging to your code.
Web UI for queries
Our trick to make it “schema less”
For every upload check current schema in BigQuery
Compare with schema of current upload
If you have extra fields – add these fields using BigQuery API
10
Why Google Spreadsheets?
Nothing is better for analytics than spreadsheets!!!
But why not MS Excel? Several reasons:
Easy to query data from BigQuery (Tutorial from Goolge)
Cloud hosted solution with cron-like scheduler for scripts
Cross platform solution (Excel VBA scripts fail on Mac)
Security – you can give read-only rights to some users
Already has email functionality for alerts and much more…
11
How to use Google Spreadsheets?
Example - link!
The goal was to make it usable for SQL-only people (no coding)
How it works
Our Google apps script is triggered periodically
It scans all sheets for value “SQL” in A1.
If it finds “SQL”, then A2 contains SQL query that is pushed to BigQuery
Results are populated below on the same page
12
Why Google Charts?
Big visualization library, free, done by Google
Integrated with Google spreadsheets (Google Tutorial)
Interactive controls – business people can explore data
too!
Example - link
13
Why Google Sites / Google Drive?
Easy to manage access to data for all users (including 3rd
parties)
Dropbox gives you only “full-access”
Google Drive has many roles: “owner”, “can edit”, “read only”
After using BigQuery, Spreadsheets and Charts from Google –
why not everything
Google Drive – host html files with Charts. It has good desktop
client so it is easy to manage charts
Google Sites has WYSIWYG site builder
14
Why Google Analytics?
Dashboard is a product itself. In our case in has about 30
users.
You need data from users to improve your product
You need analytics tool for it!
I use Google Analytics to watch how users visit my
Dashboard on Google Sites
… and punish ones who is not using it ;)
15
What about costs?
In the whole solution only BigQuery costs money!
We never paid more than 200$ per month
Real costs come from time/efforts to develop and
support. Our solution is smart but lean:
The whole project is done by one analyst/developer
1 month from idea to fist live version
16
Best practice to optimize costs of BigQuery
BigQuery performs full-table scans
In most queries you care only about recent events
If you store all data in one table with time you scan a lot of data for nothing resulting in
Higher costs
Slower queries
We rotate event tables monthly, creating tables inside one dataset (like events_2014Jan, events_2014Feb,…)
Google scripts Apps are ideal for monthly rotation
For queries that require historical data we use meta-SQL that is parced by Google Spreadsheets script
• “FROMDATASET dataset” – query all tables in dataset
• “FROMLAST table” – query “table” and “table_2014Jan” (table from last month)
17
Example dashboard
Check out this page for example dashboard with all
working source code:
https://sites.google.com/site/leanbigdatawith6tools/