API analytics with Bigquery, by Javier Ramirez from teowaki

Preview:

DESCRIPTION

At https://teowaki.com we have a system for API usage analytics, with Redis as a fast intermediate store and google Bigquery as a big data backend. As a result, we can launch aggregated queries on our traffic/usage data in just a few seconds and we can try and find for usage patterns that wouldn’t be obvious otherwise. In this session I will talk about how we entered the Big Data world, which alternatives we evaluated, and how we are using Redis and Bigquery to solve our problem.

Citation preview

javier ramirez@supercoco9

API Analytics withRedis, BigQuery, and AppsScripts

a two peoplestart-up

a different league...

.. or maybe not

moral of the story

you can do big, if you know how

Set a distance.

Set an expiration time.

Bye bye noise.

javier ramirez @supercoco9 https://teowaki.com

REST API (Ruby on Rails) +

Web on top (AngularJS)

javier ramirez @supercoco9 https://teowaki.com

data that’s an order of magnitude greater than data you’re accustomed to

javier ramirez @supercoco9 https://teowaki.com

Doug Laney VP Research, Business Analytics and Performance Management at Gartner

data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures.

Ed Dumbill program chair for the O’Reilly Strata Conference

javier ramirez @supercoco9 https://teowaki.com

bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds

javier ramirez @supercoco9 https://teowaki.com

Javier Ramirezimpresionable teowaki founder

1. non intrusive metrics2. keep the history3. avoid vendor lock-in4. interactive queries5. cheap6. extra ball: real time

javier ramirez @supercoco9 https://teowaki.com

twitterstackoverflowpinterestbooking.comWorld of WarcraftYouPornHipChatSnapchat

javier ramirez @supercoco9 https://teowaki.com

ntopngLogStash

javier ramirez @supercoco9 https://teowaki.com

Non intrusive metrics

Capture data really fast.

Then process the data on the background

javier ramirez @supercoco9 https://teowaki.com

javier ramirez @supercoco9 https://teowaki.com

Gzip to AWS S3/Glacier

orGoogle Cloud Storage

javier ramirez @supercoco9 https://teowaki.com

javier ramirez @supercoco9 https://teowaki.com

HadoopCassandraHadoop + Voldemort + KafkaHBase…Amazon Redshift

javier ramirez @supercoco9 https://teowaki.com

tools we considered:

but...

hard to set up and monitor

not interactive enough

expensive cluster

javier ramirez @supercoco9 https://teowaki.com

Our choice:

Google BigQuery

Data analysis as a service

http://developers.google.com/bigquery

javier ramirez @supercoco9 https://teowaki.com

Based on “Dremel”

Specifically designed for interactive queries over petabytes of real-time data

javier ramirez @supercoco9 https://teowaki.com

loading data

You just send the data intext (or JSON) format

javier ramirez @supercoco9 https://teowaki.com

SQL

javier ramirez @supercoco9 https://teowaki.com

select name from USERS order by date;

select count(*) from users;

select max(date) from USERS;

select sum(total) from ORDERS group by user;

specific extensions for analytics

javier ramirez @supercoco9 https://teowaki.com

withinflattennest

stddev

topfirstlastnth

variance

var_popvar_samp

covar_popcovar_samp

quantiles

correlations

Things you always wanted to try but were too scared to

javier ramirez @supercoco9 https://teowaki.com

select count(*) from publicdata:samples.wikipedia

where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0;

223,163,387Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)

columnar storage

javier ramirez @supercoco9 https://teowaki.com

highly distributed execution using a tree

javier ramirez @supercoco9 https://teowaki.com

web console screenshot

javier ramirez @supercoco9 https://teowaki.com

javier ramirez @supercoco9 https://teowaki.com

country segmented traffic

javier ramirez @supercoco9 https://teowaki.com

window functions

javier ramirez @supercoco9 https://teowaki.com

our most active user

new users per month

javier ramirez @supercoco9 https://teowaki.com

10 request we should be caching

javier ramirez @supercoco9 http://teowaki.com

5 most created resources

select uri, count(*) total from stats where method = 'POST' group by URI;

javier ramirez @supercoco9 http://teowaki.com

...but

/users/javier/shouts/users/rgo/shouts/teams/javier-community/links/teams/nosqlmatters-cgn/links

javier ramirez @supercoco9 http://teowaki.com

5 most created resources

SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt,repository_urlFROM github.timelineWHERE type="WatchEvent"AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00")AND repository_url IN (

SELECT repository_urlFROM github.timelineWHERE type="CreateEvent"AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00')AND repository_fork = "false"AND payload_ref_type = "repository"GROUP BY repository_url

)GROUP BY repository_name, repository_language, repository_description, repository_urlHAVING cnt >= 5ORDER BY cnt DESCLIMIT 25

NO

Automation with Apps Script

Read from bigquery

Create a spreadsheet on Drive

E-mail it everyday as a PDF

javier ramirez @supercoco9 https://teowaki.com

bigquery pricing

$26 per stored TB1000000 rows => $0.00416 / month

£0.00243 / month

$5 per processed TB1 full scan = 160 MB1 count = 0 MB1 full scan over 1 column = 5.4 MB100 GB => $0.05 / month £0.03

javier ramirez @supercoco9 https://teowaki.com

£0.054307 / month*

per 1MM rows

*the 1st 1TB every month is free of charge

javier ramirez @supercoco9 https://teowaki.com

1. non intrusive metrics2. keep the history3. avoid vendor lock-in4. interactive queries5. cheap6. extra ball: real time

javier ramirez @supercoco9 https://teowaki.com

ig

Find related links at

https://teowaki.com/teams/javier-community/link-categories/bigquery-talk

Thanks!תודה

Javier Ramírez@supercoco9

Recommended