68
Druid for real-time analysis Yann Esposito 7 Avril 2016

Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Druid for real-time analysis

Yann Esposito

7 Avril 2016

Page 2: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Druid the Sales Pitch

Page 3: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Intro

Page 4: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Experience

▶ Real Time Social Media Analytics

Page 5: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Real Time?

▶ Ingestion Latency: seconds▶ Query Latency: seconds

Page 6: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Demand

▶ Twitter: 20k msg/s, 1msg = 10ko during 24h▶ Facebook public: 1000 to 2000 msg/s

continuously▶ Low Latency

Page 7: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Reality

▶ Twitter: 400 msg/s continuously, burst to 1500▶ Facebook: 1000 to 2000 msg/s

Page 8: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Origin (PHP)

Page 9: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

1st Refactoring (Node.js)

Page 10: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Return of Experience

Page 11: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Return of Experience

Page 12: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

2nd Refactoring

Page 13: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

2nd Refactoring (FTW!)

Page 14: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

2nd Refactoring return ofexperience

Page 15: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Demo

Page 16: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Pre Considerations

Page 17: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Discovered vs Invented

Try to conceptualize a s.t.

▶ Ingest Events▶ Real-Time Queries▶ Scalable▶ Highly Available

Analytics: timeseries, alerting system, top N, etc…

Page 18: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

In the End

Druid concepts are always emerging naturally

Page 19: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Druid

Page 20: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Who?

MetamarketsPowered by Druid

▶ Alibaba, Cisco, Criteo, eBay, Hulu, Netflix,Paypal…

Page 21: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Goal

Druid is an open source store designed forreal-time exploratory analytics on largedata sets.

hosted dashboard that would allow usersto arbitrarily explore and visualize eventstreams.

Page 22: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Concepts

▶ Column-oriented storage layout▶ distributed, shared-nothing architecture▶ advanced indexing structure

Page 23: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Key Features

▶ Sub-second OLAP Queries▶ Real-time Streaming Ingestion▶ Power Analytic Applications▶ Cost Effective▶ High Available▶ Scalable

Page 24: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Right for me?

▶ require fast aggregations▶ exploratory analytics▶ analysis in real-time▶ lots of data (trillions of events, petabytes of

data)▶ no single point of failure

Page 25: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

High Level Architecture

Page 27: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Index / Immutability

Druid indexes data to create mostly immutableviews.

Page 28: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Storage

Store data in custom column format highlyoptimized for aggregation & filter.

Page 29: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Specialized Nodes

▶ A Druid cluster is composed of various type ofnodes

▶ Each designed to do a small set of things verywell

▶ Nodes don’t need to be deployed on individualhardware

▶ Many node types can be colocated inproduction

Page 30: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Druid vs X

Page 31: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Elasticsearch

▶ resource requirement much higher for ingestion& aggregation

▶ No data summarization (100x in real worlddata)

Page 32: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Key/Value Stores(HBase/Cassandra/OpenTSDB)

▶ Must Pre-compute Result▶ Exponential storage▶ Hours of pre-processing time

▶ Use the dimensions as key (like in OpenTSDB)▶ No filter index other than range▶ Hard for complex predicates

Page 33: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Spark

▶ Druid can be used to accelerate OLAP queriesin Spark

▶ Druid focuses on the latencies to ingest andserve queries

▶ Too long for end user to arbitrarily explore data

Page 34: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

SQL-on-Hadoop (Impala/Drill/SparkSQL/Presto)

▶ Queries: more data transfer between nodes▶ Data Ingestion: bottleneck by backing store▶ Query Flexibility: more flexible (full joins)

Page 35: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Data

Page 36: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Concepts

▶ Timestamp column: query centered on timeaxis

▶ Dimension columns: strings (used to filter orto group)

▶ Metric columns: used for aggregations(count, sum, mean, etc…)

Page 37: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Indexing

▶ Immutable snapshots of data▶ data structure highly optimized for analytic

queries▶ Each column is stored separately▶ Indexes data on a per shard (segment) level

Page 38: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Loading

▶ Real-Time▶ Batch

Page 39: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Querying

▶ JSON over HTTP▶ Single Table Operations, no joins.

Page 40: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Segments

▶ Per time interval▶ skip segments when querying

▶ Immutable▶ Cache friendly▶ No locking

▶ Versioned▶ No locking▶ Read-write concurrency

Page 41: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Roll-up

Page 42: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Exampletimestamp page ... added deleted2011-01-01T00:01:35Z Cthulhu 10 652011-01-01T00:03:63Z Cthulhu 15 622011-01-01T01:04:51Z Cthulhu 32 452011-01-01T01:01:00Z Azatoth 17 872011-01-01T01:02:00Z Azatoth 43 992011-01-01T02:03:00Z Azatoth 12 53

timestamp page ... nb added deleted2011-01-01T00:00:00Z Cthulhu 2 25 1272011-01-01T01:00:00Z Cthulhu 1 32 452011-01-01T01:00:00Z Azatoth 2 60 1862011-01-01T02:00:00Z Azatoth 1 12 53

Page 43: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

as SQL

GROUP BY timestamp, page, nb, added, deleted:: nb = COUNT(1), added = SUM(added), deleted = SUM(deleted)

In practice can dramatically reduce the size (up tox100)

Page 44: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Segments

Page 45: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

ShardingsampleData_2011-01-01T01:00:00:00Z_2011-01-01T02:00:00:00Z_v1_0

timestamp page ... nb added deleted2011-01-01T01:00:00Z Cthulhu 1 20 452011-01-01T01:00:00Z Azatoth 1 30 106

sampleData_2011-01-01T01:00:00:00Z_2011-01-01T02:00:00:00Z_v1_0

timestamp page ... nb added deleted2011-01-01T01:00:00Z Cthulhu 1 12 452011-01-01T01:00:00Z Azatoth 2 30 80

Page 46: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Core Data Structure

▶ dictionary▶ a bitmap for each value▶ a list of the columns values encoded using the

dictionary

Page 47: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Example

dictionary: { "Cthulhu": 0, "Azatoth": 1 }

column data: [0, 0, 1, 1]

bitmaps (one for each value of the column):value="Cthulhu": [1,1,0,0]value="Azatoth": [0,0,1,1]

Page 48: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Example (multiple matches)

dictionary: { "Cthulhu": 0, "Azatoth": 1 }

column data: [0, [0,1], 1, 1]

bitmaps (one for each value of the column):value="Cthulhu": [1,1,0,0]value="Azatoth": [0,1,1,1]

Page 49: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Real-time ingestion

▶ Via Real-Time Node and Firehose▶ No redundancy or HA, thus not recommended

▶ Via Indexing Service and Tranquility API▶ Core API▶ Integration with Streaming Frameworks▶ HTTP Server▶ Kafka Consumer

Page 50: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Batch Ingestion

▶ File based (HDFS, S3, …)

Page 51: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Real-time Ingestion

Task 1: [ Interval ][ Window ]Task 2: [ ]----------------------------------------------------->

time

Page 52: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Querying

Page 53: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Query types

▶ Group by: group by multiple dimensions▶ Top N: like grouping by a single dimension▶ Timeseries: without grouping over dimensions▶ Search: Dimensions lookup▶ Time Boundary: Find available data timeframe▶ Metadata queries

Page 54: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Example(s)

{"queryType": "groupBy","dataSource": "druidtest","granularity": "all","dimensions": [],"aggregations": [

{"type": "count", "name": "rows"},{"type": "longSum", "name": "imps", "fieldName": "impressions"},{"type": "doubleSum", "name": "wp", "fieldName": "wp"}

],"intervals": ["2010-01-01T00:00/2020-01-01T00"]}

Page 55: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Result

[ {"version" : "v1","timestamp" : "2010-01-01T00:00:00.000Z","event" : {

"imps" : 5,"wp" : 15000.0,"rows" : 5

}} ]

Page 56: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Caching

▶ Historical node level▶ By segment

▶ Broker Level▶ By segment and query▶ groupBy is disabled on purpose!

▶ By default: local caching

Page 57: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Druid Components

Page 58: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Druid

▶ Real-time Nodes▶ Historical Nodes▶ Broker Nodes▶ Coordinator▶ For indexing:

▶ Overlord▶ Middle Manager

Page 59: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Also

▶ Deep Storage (S3, HDFS, …)▶ Metadata Storage (SQL)▶ Load Balancer▶ Cache

Page 60: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Coordinator

▶ Real-time Nodes (pull data, index it)▶ Historical Nodes (keep old segments)▶ Broker Nodes (route queries to RT & Hist.

nodes, merge)▶ Coordinator (manage segemnts)▶ For indexing:

▶ Overlord (distribute task to the middle manager)▶ Middle Manager (execute tasks via Peons)

Page 61: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

When not to choose Druid

Page 62: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Graphite (metrics)

Page 63: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Pivot (exploring data)

Page 64: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Caravel

Page 65: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Conclusions

Page 66: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Precompute your time series?

Page 67: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Don’t reinvent it

▶ need a user facing API▶ need time series on many dimensions▶ need real-time▶ big volume of data

Page 68: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right

Druid way is the right way!

1. Push in kafka2. Add the right dimensions3. Push in druid4. ???5. Profit!