47
Batch and Stream processing with SQL 2013/11/07 Cloudera World Tokyo 2013 TAGOMORI Satoshi @tagomoris LINE Corp. 13117日木曜日

Batch and Stream processing with SQL

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Batch and Stream processing with SQL

Batch and Stream processingwith SQL

2013/11/07 Cloudera World Tokyo 2013

TAGOMORI Satoshi @tagomorisLINE Corp.

13年11月7日木曜日

Page 2: Batch and Stream processing with SQL

SQL、宇宙、すべての答え

SELECT 42 FROM anywhere

13年11月7日木曜日

Page 3: Batch and Stream processing with SQL

TAGOMORI Satoshi (@tagomoris)LINE Corp.

Hadoop, Fluentd, Norikra, ...

13年11月7日木曜日

Page 4: Batch and Stream processing with SQL

13年11月7日木曜日

Page 5: Batch and Stream processing with SQL

13年11月7日木曜日

Page 6: Batch and Stream processing with SQL

13年11月7日木曜日

Page 7: Batch and Stream processing with SQL

Data Collecting,Aggregation, Analytics,

Visualization

13年11月7日木曜日

Page 8: Batch and Stream processing with SQL

SQL好きですか?

13年11月7日木曜日

Page 9: Batch and Stream processing with SQL

How to write M/R(or Storm app, or...)

Java (or Scala, Clojure, JRuby, ...)

Hadoop Streaming

Pig

Hive, Impala (SQL!)

13年11月7日木曜日

Page 10: Batch and Stream processing with SQL

Our log trafficDaily

2.1+ TB (non compressed)

6.8+ Billion lines / day

Peak time

150,000+ lines / sec

380+ Mbps

13年11月7日木曜日

Page 11: Batch and Stream processing with SQL

Our Hadoop cluster

CDH 4.2.0

Master Nodes: 3 (NameNodeHA+QJM)

NameNode, JournalNode, JobTracker

Slave Nodes: 20

13年11月7日木曜日

Page 12: Batch and Stream processing with SQL

What we want to doCOUNT PV,UU and others (daily/realtime)

COUNT Service metrics (daily/hourly/realtime)

FIND Surprising Errors [4xx,5xx] (immediately)

CHECK Response Times (immediately)

SERCH Logs in troubles (hourly/immediately)

VISUALIZE/NOTIFY App Status (realtime)

13年11月7日木曜日

Page 13: Batch and Stream processing with SQL

Batches and StreamsHadoop is for batchesHigh performance batch is important

HDFS has good performance

Stream log writing and calculationsare also VERY VERY IMPORTANT

Hybrid System:Stream processing + Batch

13年11月7日木曜日

Page 14: Batch and Stream processing with SQL

System OverviewWeb Servers Fluentd

Cluster

ArchiveStorage(scribed)

FluentdWatchers

GraphTools

Notifications(IRC)

Hadoop Cluster(HDFS, MR)

webhdfs

HuahinManager

hiveserver

STREAM

Shib ShibUI

BATCH SCHEDULEDBATCH

Norikra

13年11月7日木曜日

Page 15: Batch and Stream processing with SQL

Data analytics players

StoragesHadoop Cluster

Visualization Tools

ADMINISTRATOR

Raw Log FormatsApplication Logs

Data SizesData Semantics

PROGRAMMER

SERVICE DIRECTORSALES

Whatever Metrics They Want

BOARD MEMBER

........

13年11月7日木曜日

Page 16: Batch and Stream processing with SQL

Data analytics players

StoragesHadoop Cluster

Visualization Tools

ADMINISTRATOR

Raw Log FormatsApplication Logs

Data SizesData Semantics

PROGRAMMER

SERVICE DIRECTORSALES

Whatever Metrics They Want

BOARD MEMBER

........

WE NEED THE QUERY LANGUAGEWHAT THEY ALL CAN

RUN AND UNDERSTAND!!!!!!!!!!

13年11月7日木曜日

Page 17: Batch and Stream processing with SQL

SQL: Hive

13年11月7日木曜日

Page 18: Batch and Stream processing with SQL

SQL: Hive

13年11月7日木曜日

Page 19: Batch and Stream processing with SQL

Hive

SQL: w/o compile, w/o deployment

HiveServer: w/o server login

Shib: Select only

13年11月7日木曜日

Page 20: Batch and Stream processing with SQL

13年11月7日木曜日

Page 21: Batch and Stream processing with SQL

Hive:

Simplify versioning problems

Hive 0.10 of CDH 4.2.0

Upgrade CDH for only Hive version

13年11月7日木曜日

Page 22: Batch and Stream processing with SQL

Hive: Pros

Many Scheduled Queries Metrics OnDemand Queries

13年11月7日木曜日

Page 23: Batch and Stream processing with SQL

Hive: Cons

Too Many Scheduled Queries for short time window

13年11月7日木曜日

Page 24: Batch and Stream processing with SQL

Stream processingQueries for fixed Window

every 1hour, 10minutes, 1minutes, ...latest 10evens, ...all events

Once query registered, Runs foreverResults appear automatically

NO MORE STORAGES

13年11月7日木曜日

Page 25: Batch and Stream processing with SQL

Stream processing

And

SQL13年11月7日木曜日

Page 26: Batch and Stream processing with SQL

Norikra:Schema-less Stream Processing with SQL

13年11月7日木曜日

Page 27: Batch and Stream processing with SQL

Norikra(1):Schema-less event stream:

Add/Remove data fields whenever you want

SQL:No more restarts to add/remove queriesw/ JOINs, w/ SubQueriesw/UDF

Truly Complex events:Nested Hash/Array, accessible directly from SQL

13年11月7日木曜日

Page 28: Batch and Stream processing with SQL

Norikra(2):Open source software:

Licensed under GPLv2Based on EsperUDF plugins from rubygems.org

Ultra-fast bootstrap & small start:3mins to install/start1 server

13年11月7日木曜日

Page 29: Batch and Stream processing with SQL

Norikra Queries: (1)

SELECT name, ageFROM events

13年11月7日木曜日

Page 30: Batch and Stream processing with SQL

Norikra Queries: (1)

SELECT name, ageFROM events

{“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Meguro”}

{“name”:”tagomoris”,”age”:34}

13年11月7日木曜日

Page 31: Batch and Stream processing with SQL

Norikra Queries: (1)

SELECT name, ageFROM events

{“name”:”tagomoris”, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Meguro”}

nothing

13年11月7日木曜日

Page 32: Batch and Stream processing with SQL

Norikra Queries: (2)

SELECT name, ageFROM events

WHERE current=”Meguro”

{“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Meguro”}

{“name”:”tagomoris”,”age”:34}

13年11月7日木曜日

Page 33: Batch and Stream processing with SQL

Norikra Queries: (2){“name”:”frsyuki”, “age”:25, “address”:”MountainView”, “corp”:”TD”, “current”:”BayArea”}

SELECT name, ageFROM events

WHERE current=”Meguro”

nothing

13年11月7日木曜日

Page 34: Batch and Stream processing with SQL

Norikra Queries: (3)

SELECT age, COUNT(*) as cntFROM events.win:time_batch(5 mins)

GROUP BY age

13年11月7日木曜日

Page 35: Batch and Stream processing with SQL

Norikra Queries: (3)

SELECT age, COUNT(*) as cntFROM events.win:time_batch(5 mins)

GROUP BY age

{“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Meguro”}

{”age”:34,”cnt”:3}, {“age”:33,”cnt”:1}, ...

every 5 mins

13年11月7日木曜日

Page 36: Batch and Stream processing with SQL

Norikra Queries: (4)

SELECT age, COUNT(*) as cntFROM

events.win:time_batch(5 mins)GROUP BY age

{“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Meguro”}

{”age”:34,”cnt”:3}, {“age”:33,”cnt”:1}, ...every 5 mins

SELECT max(age) as maxFROM

events.win:time_batch(5 mins)

{“max”:51}

13年11月7日木曜日

Page 37: Batch and Stream processing with SQL

Norikra Queries: (5)

SELECT age, COUNT(*) as cntFROM events.win:time_batch(5 mins)

GROUP BY age

{“name”:”tagomoris”, “user:{“age”:34, “corp”:”LINE”, “address”:”Tokyo”}, “current”:”Meguro”, “speaker”:true, “attend”:[true,true,false, ...]}

13年11月7日木曜日

Page 38: Batch and Stream processing with SQL

Norikra Queries: (5)

SELECT user.age, COUNT(*) as cntFROM events.win:time_batch(5 mins)

GROUP BY user.age

{“name”:”tagomoris”, “user:{“age”:34, “corp”:”LINE”, “address”:”Tokyo”}, “current”:”Meguro”, “speaker”:true, “attend”:[true,true,false, ...]}

13年11月7日木曜日

Page 39: Batch and Stream processing with SQL

Norikra Queries: (5)

SELECT user.age, COUNT(*) as cntFROM events.win:time_batch(5 mins)

WHERE current=”Meguro” AND attend.$0 AND attend.$1GROUP BY user.age

{“name”:”tagomoris”, “user:{“age”:34, “corp”:”LINE”, “address”:”Tokyo”}, “current”:”Meguro”, “speaker”:true, “attend”:[true,true,false, ...]}

13年11月7日木曜日

Page 40: Batch and Stream processing with SQL

Before: Hive EVERY HOUR!SELECT yyyymmdd, hh, campaign_id, region, lang, count(*) AS click, count(distinct member_id) AS uuFROM ( SELECT yyyymmdd, hh, get_json_object(log, '$.campaign.id') AS campaign_id, get_json_object(log, '$.member.region') AS region, get_json_object(log, '$.member.lang') AS lang, get_json_object(log, '$.member.id') AS member_id FROM applog WHERE service='myservice' AND yyyymmdd='20131101' AND hh='00' AND get_json_object(log, '$.type') = 'click') xGROUP BY yyyymmdd, hh, campaign_id, region, lang

13年11月7日木曜日

Page 41: Batch and Stream processing with SQL

After: NorikraSELECT campaign.id AS campaign_id, member.region AS region, count(*) AS click, count(distinct member.id) AS uuFROM myservice.win:time_batch(1 hours)WHERE type="click"GROUP BY campaign.id, member.region

13年11月7日木曜日

Page 42: Batch and Stream processing with SQL

Norikra: Current Status

v0.1.0: Released at 2013/11/01

by tagomoris

http://norikra.github.io/

Documents: under development

Just started to use in production

13年11月7日木曜日

Page 43: Batch and Stream processing with SQL

SQL Queries for batches for streams

13年11月7日木曜日

Page 44: Batch and Stream processing with SQL

企画・開発 幅広く募集中

•データマーケティング•データベースエンジニア• BI企画・開発• etc…

13年11月7日木曜日

Page 45: Batch and Stream processing with SQL

MUSIC�

SHOPPING� COOKING�

MOVIE�

GAME�

TRAVEL�

NEWS�

MOM&KIDS�

SPORTS�

BOOK�

GIRLS�

Variety Volume Velocity�

13年11月7日木曜日

Page 46: Batch and Stream processing with SQL

企画・開発 幅広く募集中

コーポレートサイトからどうぞ応募を!

データ分析・解析 規模拡大/強化中

13年11月7日木曜日

Page 47: Batch and Stream processing with SQL

See Also:Log analysis system with Hadoop in livedoor 2013

http://www.slideshare.net/tagomoris/log-analysis-with-hadoop-in-livedoor-2013

Norikrahttp://norikra.github.io/https://github.com/norikra

Shibhttps://github.com/tagomoris/shib

Fluentdhttp://fluentd.org/https://github.com/fluent/fluentd

13年11月7日木曜日