39
openTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Embed Size (px)

Citation preview

Page 1: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

openTSDB - Metrics for a

distributed world Oliver Hankeln / gutefrage.net

@mydalon

Page 2: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Who am I?

Senior Engineer - Data and Infrastructure at

gutefrage.net GmbH

Was doing software development before

DevOps advocate

Page 3: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Who is Gutefrage.net?

Germany‘s biggest Q&A platform

#1 German site (mobile) about 5M Unique Users

#3 German site (desktop) about 17M Unique Users

> 4 Mio PI/day

Part of the Holtzbrinck group

Running several platforms (Gutefrage.net,

Helpster.de, Cosmiq, Comprano, ...)

Page 4: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

What you will get

Why we chose openTSDB

What is openTSDB?

How does openTSDB store the data?

Our experiences

Some advice

Page 5: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Why we chose

openTSDB

Page 6: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

We were looking at some

options

Munin Graphite openTSDB Ganglia

Scales well no sort of yes yes

Keeps all

data no no yes no

Creating

metrics easy easy easy easy

Page 7: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

We have a winner!

Munin Graphite openTSDB Ganglia

Scales well no sort of yes yes

Keeps all

data no no yes no

Creating

metrics easy easy easy easy B

ing

o!

Page 8: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Separation of concerns

$

unzip|strip|touch|finger|grep|mount|fsck|more|yes|fsck|fsck|fsck|umo

unt|sleep

Page 9: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

The ecosystem

App feeds metrics in via RabbitMQ

We base Icinga checks on the metrics

We evaluate etsy Skyline for anomaly

detection

We deploy sensors via chef

Page 10: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

openTSDB

Written at StumbleUpon but OpenSource

Uses HBase (which is based on HDFS) as a

storage

Distributed system (multiple TSDs)

Page 11: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

The big picture

HBase

TSD

TSD

TSD

TSD UI

API

tcollector

This is really a

cluster

Page 12: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Putting data into

openTSDB

$ telnet tsd01.acme.com 4242

put proc.load.avg5min 1382536472 23.2 host=db01.acme.com

Page 13: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

It gets even better

tcollector is a python script that runs your

collectors

handles network connection, starts your

collectors at set intervals

does basic process management

adds host tag, does deduplication

Page 14: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

A simple tcollector script

#!/usr/bin/php

<?php#Cast a die$die = rand(1,6);echo "roll.a.d6 " . time() . " " . $die . "\n";

Page 15: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

What was that HDFS

again?

HDFS is a distributed filesystem suitable for

Petabytes of data on thousands of machines.

Runs on commodity hardware

Takes care of redundancy

Used by e.g. Facebook, Spotify, eBay,...

Page 16: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Okay... and HBase?

HBase is a NoSQL database / data store on

top of HDFS

Modeled after Google‘s BigTable

Built for big tables (billions of rows, millions of

columns)

Automatic sharding by row key

Page 17: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

How openTSDB stores

the data

Page 18: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Keys are key!

Data is sharded across regions based on their

row key

You query data based on the row key

You can query row key ranges (say e.g. A...D)

So: think about key design

Page 19: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Take 1

Row key format: timestamp, metric id

Page 20: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Take 1

Row key format: timestamp, metric id

1382536472, 5 17

Server A

Server B

Page 21: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Take 1

Row key format: timestamp, metric id

1382536472, 5 17

1382536472, 6 24 Server A

Server B

Page 22: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Take 1

Row key format: timestamp, metric id

1382536472, 5 17

1382536472, 6 24

1382536472, 8 12

1382536473, 5 134

1382536473, 6 10

1382536473, 8 99

Server A

Server B

Page 23: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Take 1

Row key format: timestamp, metric id

1382536472, 5 17

1382536472, 6 24

1382536472, 8 12

1382536473, 5 134

1382536473, 6 10

1382536473, 8 99

1382536474, 5 12

1382536474, 6 42

Server A

Server B

Page 24: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Solution: Swap

timestamp and metric id Row key format: metric id, timestamp

5, 1382536472 17

6, 1382536472 24

8, 1382536472 12

5, 1382536473 134

6, 1382536473 10

8, 1382536473 99

5, 1382536474 12

6, 1382536474 42

Server A

Server B

Page 25: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Solution: Swap

timestamp and metric id Row key format: metric id, timestamp

5, 1382536472 17

6, 1382536472 24

8, 1382536472 12

5, 1382536473 134

6, 1382536473 10

8, 1382536473 99

5, 1382536474 12

6, 1382536474 42

Server A

Server B

Page 26: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Take 2

Metric ID first, then timestamp

Searching through many rows is slower than

searching through viewer rows. (Obviously)

So: Put multiple data points into one row

Page 27: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Take 2 continued

5, 1382608800 +23 +35 +94 +142

17 1 23 42

5, 1382612400 +13 +25 +88 +89

3 44 12 2

Page 28: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Take 2 continued

5, 1382608800 +23 +35 +94 +142

17 1 23 42

5, 1382612400 +13 +25 +88 +89

3 44 12 2

Row key

Page 29: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Take 2 continued

5, 1382608800 +23 +35 +94 +142

17 1 23 42

5, 1382612400 +13 +25 +88 +89

3 44 12 2

Row key

Cell Name

Page 30: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Take 2 continued

5, 1382608800 +23 +35 +94 +142

17 1 23 42

5, 1382612400 +13 +25 +88 +89

3 44 12 2

Row key

Cell Name Data point

Page 31: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Where are the tags

stored?

They are put at the end of the row key

Both metric names and metric values are

represented by IDs

Page 32: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

The Row Key

3 Bytes - metric ID

4 Bytes - timestamp (rounded down to the

hour)

3 Bytes tag ID

3 Bytes tag value ID

Total: 7 Bytes + 6 Bytes * Number of tags

Page 33: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Let‘s look at some

graphs

Page 34: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Our experiences

Page 35: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

What works well

We store about 200M data points in several

thousand time series with no issues

tcollector is decoupling measurement from

storage

Creating new metrics is really easy

Page 36: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Challenges

The UI is seriously lacking

no annotation support out of the box

Only 1s time resolution (and only 1

value/s/time series)

Page 37: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

salvation is coming

OpenTSDB 2 is around the corner

millisecond precision

annotations and meta data

improved API

Page 38: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Friendly advice

Pick a naming scheme and stick to it

Use tags wisely (not more than 6 or 7 tags per

data point)

Use tcollector

wait for openTSDB 2 ;-)

Page 39: openTSDB - Metrics for a distributed world - NETWAYS · PDF fileopenTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon

Questions?

Please contact me:

[email protected]

@mydalon

I‘ll upload the slides and tweet about it