Upload
javier-ramirez
View
1.115
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Big data is amazing. You can get insights from your users, find interesting patterns and have lots of geek fun. Problem is big data usually means many servers, a complex set up, intensive monitoring and a steep learning curve. All those things cost money. If you don’t have the money, you are losing all the fun. In my talk I show you how you can use Google BigQuery to manage big data from your application using a hosted solution. And you can start with less than $1 per month.
Citation preview
@supercoco9#devoxxBigQuery
Big Data with Google BigQuery
Javier Ramirez@supercoco9https://teowaki.com
@supercoco9#DevoxxBigquery
Managing Big Data with BigQuery
Javier Ramirez
•Writing software since 1996
•Web dev. since 1999 (C++, JAVA, PHP, Ruby, JS...)
•Founder of https://teowaki.com
•Google Developer Expert on the Cloud Platform
@YourTwitterHandle@supercoco9#DevoxxBigquery
BIG
BIG
DAT
A
DAT
A
@YourTwitterHandle@supercoco9#DevoxxBigquery
BIG
BIG
SERVER
S
SERVER
S
@YourTwitterHandle@supercoco9#DevoxxBigquery
BIG
BIG
DEV
OPS
DEV
OPS
@YourTwitterHandle@supercoco9#DevoxxBigquery
BIG
BIG
MONEY
MONEY
bigdata is cool but...
hard to set up and monitor
expensive cluster
not interactive enough
@supercoco9#DevoxxBigquery
bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds
Google BigQuery
Data analysis as a service
http://developers.google.com/bigquery
Based on “Dremel”
Specifically designed for interactive queries over
petabytes of real-time data
@supercoco9#DevoxxBigquery
Your only worries
•Load data
•Query the dataset
loading data.
You just send the data in
text (or JSON) format
up to 100K inserts per second
in stream mode
It's just SQL
select name from USERS order by date;
select count(*) from users;
select max(date) from USERS;
select sum(total) from ORDERS group by user;
@supercoco9#DevoxxBigquery
Subselect and joins out of the box
SELECT Year, Actor1Name, Actor2Name, Count FROM (SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rankFROM
(SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),
WHERE Actor1Name IS NOT nullAND Actor2Name IS NOT nullGROUP EACH BY 1, 2, 3HAVING Count > 100)
WHERE rank=1ORDER BY Year
http://gdeltproject.org/data.html#googlebigquery
@supercoco9#DevoxxBigquery
specific extensions for analytics
withinflattennest
stddev
topfirstlastnth
variance
var_popvar_samp
covar_popcovar_samp
quantiles
correlations
Things you always wanted to try but were too scared to
select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0;
223,163,387 Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)
columnar storage
https://cookbook.experiencesaphana.com/crm/what-is-crm-on-hana/technology-innovation/row-vs-column-based/
highly distributed execution using a tree
web console screenshot
@supercoco9#DevoxxBigquery
country segmented traffic
@supercoco9#DevoxxBigqueryjavier ramirez @supercoco9 https://teowaki.com
window functions
@supercoco9#DevoxxBigquery
our most active user
@supercoco9#DevoxxBigquery
Worldwide events in the last 36 years
SELECT Year, Actor1Name, Actor2Name, Count FROM (SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rankFROM
(SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),
WHERE Actor1Name IS NOT nullAND Actor2Name IS NOT nullGROUP EACH BY 1, 2, 3HAVING Count > 100)
WHERE rank=1ORDER BY Year
http://gdeltproject.org/data.html#googlebigquery
SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt,repository_urlFROM github.timelineWHERE type="WatchEvent"AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00")AND repository_url IN (
SELECT repository_urlFROM github.timelineWHERE type="CreateEvent"AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday}
20:00:00')AND repository_fork = "false"AND payload_ref_type = "repository"GROUP BY repository_url
)GROUP BY repository_name, repository_language, repository_description, repository_urlHAVING cnt >= 5ORDER BY cnt DESCLIMIT 25
@supercoco9#DevoxxBigquery
@supercoco9#DevoxxBigquery
Automation with Apps Script
●Read from BigQuery
●Create a spreadsheet on Drive
●E-mail it everyday as a PDF
https://developers.google.com/apps-script/
@supercoco9#DevoxxBigquery
bigquery pricing
$26 per stored TB1000000 rows => $0.00416 / month
£0.00243 / month
$5 per processed TB1 full scan = 160 MB
1 count = 0 MB1 full scan over 1 column = 5.4 MB100 GB => $0.05 / month £0.03
AppsScripts is for free
@supercoco9#DevoxxBigquery
£0.054307 / month*
per 1MM rows
*the 1st 1TB every month is free of charge**assumming your rows have web server logs-like info
price per month
@supercoco9#DevoxxBigquery
ig
@YourTwitterHandle#DVXFR14{session hashtag} @supercoco9#devoxxBigquery
THAN
KS!
Javier Ramirez@supercoco9https://teowaki.com
Related links at:
https://teowaki.com/teams/javier-community/link-categories/bigquery-talk
@supercoco9#DevoxxBigquery
Thanks / Creative Commons
•Presentation Template — Guillaume LaForge
•The Queen — A prestigious heritage with some inspiration from The Sex Pistols and funny Devoxxians
•Girl with a Balloon — Banksy
•Tube — Michael Keen