39
@supercoco9 #devoxxBigQuery Big Data with Google BigQuery Javier Ramirez @supercoco9 https://teowaki.com

Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

Embed Size (px)

DESCRIPTION

Big data is amazing. You can get insights from your users, find interesting patterns and have lots of geek fun. Problem is big data usually means many servers, a complex set up, intensive monitoring and a steep learning curve. All those things cost money. If you don’t have the money, you are losing all the fun. In my talk I show you how you can use Google BigQuery to manage big data from your application using a hosted solution. And you can start with less than $1 per month.

Citation preview

Page 1: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#devoxxBigQuery

Big Data with Google BigQuery

Javier Ramirez@supercoco9https://teowaki.com

Page 2: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

Managing Big Data with BigQuery

Javier Ramirez

•Writing software since 1996

•Web dev. since 1999 (C++, JAVA, PHP, Ruby, JS...)

•Founder of https://teowaki.com

•Google Developer Expert on the Cloud Platform

Page 3: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@YourTwitterHandle@supercoco9#DevoxxBigquery

BIG

BIG

DAT

A

DAT

A

Page 4: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@YourTwitterHandle@supercoco9#DevoxxBigquery

BIG

BIG

SERVER

S

SERVER

S

Page 5: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@YourTwitterHandle@supercoco9#DevoxxBigquery

BIG

BIG

DEV

OPS

DEV

OPS

Page 6: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@YourTwitterHandle@supercoco9#DevoxxBigquery

BIG

BIG

MONEY

MONEY

Page 7: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

bigdata is cool but...

hard to set up and monitor

expensive cluster

not interactive enough

Page 8: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds

Page 9: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

Google BigQuery

Data analysis as a service

http://developers.google.com/bigquery

Page 10: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

Based on “Dremel”

Specifically designed for interactive queries over

petabytes of real-time data

Page 11: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

Your only worries

•Load data

•Query the dataset

Page 12: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

loading data.

You just send the data in

text (or JSON) format

Page 13: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

up to 100K inserts per second

in stream mode

Page 14: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

It's just SQL

select name from USERS order by date;

select count(*) from users;

select max(date) from USERS;

select sum(total) from ORDERS group by user;

Page 15: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

Subselect and joins out of the box

SELECT Year, Actor1Name, Actor2Name, Count FROM (SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rankFROM

(SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),

WHERE Actor1Name IS NOT nullAND Actor2Name IS NOT nullGROUP EACH BY 1, 2, 3HAVING Count > 100)

WHERE rank=1ORDER BY Year

http://gdeltproject.org/data.html#googlebigquery

Page 16: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

specific extensions for analytics

withinflattennest

stddev

topfirstlastnth

variance

var_popvar_samp

covar_popcovar_samp

quantiles

correlations

Page 17: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

Things you always wanted to try but were too scared to

select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0;

223,163,387 Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)

Page 18: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

columnar storage

https://cookbook.experiencesaphana.com/crm/what-is-crm-on-hana/technology-innovation/row-vs-column-based/

Page 19: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

highly distributed execution using a tree

Page 20: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

web console screenshot

Page 21: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

country segmented traffic

Page 22: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigqueryjavier ramirez @supercoco9 https://teowaki.com

window functions

Page 23: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

our most active user

Page 24: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

Worldwide events in the last 36 years

SELECT Year, Actor1Name, Actor2Name, Count FROM (SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rankFROM

(SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),

WHERE Actor1Name IS NOT nullAND Actor2Name IS NOT nullGROUP EACH BY 1, 2, 3HAVING Count > 100)

WHERE rank=1ORDER BY Year

http://gdeltproject.org/data.html#googlebigquery

Page 25: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki
Page 26: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki
Page 27: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt,repository_urlFROM github.timelineWHERE type="WatchEvent"AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00")AND repository_url IN (

SELECT repository_urlFROM github.timelineWHERE type="CreateEvent"AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday}

20:00:00')AND repository_fork = "false"AND payload_ref_type = "repository"GROUP BY repository_url

)GROUP BY repository_name, repository_language, repository_description, repository_urlHAVING cnt >= 5ORDER BY cnt DESCLIMIT 25

Page 28: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

Page 29: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki
Page 30: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki
Page 31: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

Automation with Apps Script

●Read from BigQuery

●Create a spreadsheet on Drive

●E-mail it everyday as a PDF

https://developers.google.com/apps-script/

Page 32: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki
Page 33: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki
Page 34: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki
Page 35: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

bigquery pricing

$26 per stored TB1000000 rows => $0.00416 / month

£0.00243 / month

$5 per processed TB1 full scan = 160 MB

1 count = 0 MB1 full scan over 1 column = 5.4 MB100 GB => $0.05 / month £0.03

AppsScripts is for free

Page 36: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

£0.054307 / month*

per 1MM rows

*the 1st 1TB every month is free of charge**assumming your rows have web server logs-like info

price per month

Page 37: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

ig

Page 38: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@YourTwitterHandle#DVXFR14{session hashtag} @supercoco9#devoxxBigquery

THAN

KS!

Javier Ramirez@supercoco9https://teowaki.com

Related links at:

https://teowaki.com/teams/javier-community/link-categories/bigquery-talk

Page 39: Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

@supercoco9#DevoxxBigquery

Thanks / Creative Commons

•Presentation Template — Guillaume LaForge

•The Queen — A prestigious heritage with some inspiration from The Sex Pistols and funny Devoxxians

•Girl with a Balloon — Banksy

•Tube — Michael Keen