46
openTSDB - Metrics for a distributed world Oliver Hankeln / gutefrage.net @mydalon Mittwoch, 30. Oktober 13

openTSDB - Metrics for a distributed world

Embed Size (px)

DESCRIPTION

These are the slides for my talk at the IPC13/WTC13 in Munich on openTSDB. openTSDB ist the software that we at gutefrage.net use to store about 200 million data points in several thousand time series per day. I will talk about how openTSDB stores the data to efficiently query them afterwards. Some cultural issues and some myths are also covered.

Citation preview

Page 1: openTSDB - Metrics for a distributed world

openTSDB - Metrics for a distributed world

Oliver Hankeln / gutefrage.net@mydalon

Mittwoch, 30. Oktober 13

Page 2: openTSDB - Metrics for a distributed world

Who am I?

Senior Engineer - Data and Infrastructure at gutefrage.net GmbH

Was doing software development before

DevOps advocate

Mittwoch, 30. Oktober 13

Page 3: openTSDB - Metrics for a distributed world

Who is Gutefrage.net?

Germany‘s biggest Q&A platform

#1 German site (mobile) about 5M Unique Users

#3 German site (desktop) about 17M Unique Users

> 4 Mio PI/day

Part of the Holtzbrinck group

Running several platforms (Gutefrage.net, Helpster.de, Cosmiq, Comprano, ...)

Mittwoch, 30. Oktober 13

Page 4: openTSDB - Metrics for a distributed world

What you will get

Why we chose openTSDB

What is openTSDB?

How does openTSDB store the data?

Our experiences

Some advice

Mittwoch, 30. Oktober 13

Page 5: openTSDB - Metrics for a distributed world

Why we chose openTSDB

Mittwoch, 30. Oktober 13

Page 6: openTSDB - Metrics for a distributed world

We were looking at some options

Munin Graphite openTSDB Ganglia

Scales well

no sort of yes yes

Keeps all data

no no yes no

Creating metrics

easy easy easy easy

Mittwoch, 30. Oktober 13

Page 7: openTSDB - Metrics for a distributed world

We have a winner!

Munin Graphite openTSDB Ganglia

Scales well

no sort of yes yes

Keeps all data

no no yes no

Creating metrics

easy easy easy easyBing

o!Mittwoch, 30. Oktober 13

Page 8: openTSDB - Metrics for a distributed world

Separation of concerns

Mittwoch, 30. Oktober 13

Page 9: openTSDB - Metrics for a distributed world

Separation of concerns

UI was not important for our decision

Alerting is not what we are looking for in our time series data base

$ unzip|strip|touch|finger|grep|mount|fsck|more|yes|fsck|fsck|fsck|umount|sleep

Mittwoch, 30. Oktober 13

Page 10: openTSDB - Metrics for a distributed world

The ecosystem

App feeds metrics in via RabbitMQ

We base Icinga checks on the metrics

We evaluate Skyline and Oculus by Etsy for anomaly detection

We deploy sensors via chef

Mittwoch, 30. Oktober 13

Page 11: openTSDB - Metrics for a distributed world

openTSDB

Written by Benoît Sigoure at StumbleUpon

OpenSource (get it from github)

Uses HBase (which is based on HDFS) as a storage

Distributed system (multiple TSDs)

Mittwoch, 30. Oktober 13

Page 12: openTSDB - Metrics for a distributed world

The big picture

HBase

TSD

TSD

TSD

TSDUI

API

tcollector

This is really a cluster

Mittwoch, 30. Oktober 13

Page 13: openTSDB - Metrics for a distributed world

Putting data into openTSDB

$ telnet tsd01.acme.com 4242put proc.load.avg5min 1382536472 23.2 host=db01.acme.com

Mittwoch, 30. Oktober 13

Page 14: openTSDB - Metrics for a distributed world

It gets even better

tcollector is a python script that runs your collectors

handles network connection, starts your collectors at set intervals

does basic process management

adds host tag, does deduplication

Mittwoch, 30. Oktober 13

Page 15: openTSDB - Metrics for a distributed world

A simple tcollector script

#!/usr/bin/php<?php

#Cast a die$die = rand(1,6);

echo "roll.a.d6 " . time() . " " . $die . "\n";

Mittwoch, 30. Oktober 13

Page 16: openTSDB - Metrics for a distributed world

What was that HDFS again?

HDFS is a distributed filesystem suitable for Petabytes of data on thousands of machines.

Runs on commodity hardware

Takes care of redundancy

Used by e.g. Facebook, Spotify, eBay,...

Mittwoch, 30. Oktober 13

Page 17: openTSDB - Metrics for a distributed world

Okay... and HBase?

HBase is a NoSQL database / data store on top of HDFS

Modeled after Google‘s BigTable

Built for big tables (billions of rows, millions of columns)

Automatic sharding by row key

Mittwoch, 30. Oktober 13

Page 18: openTSDB - Metrics for a distributed world

How openTSDB stores the data

Mittwoch, 30. Oktober 13

Page 19: openTSDB - Metrics for a distributed world

Keys are key!

Data is sharded across regions based on their row key

You query data based on the row key

You can query row key ranges (say e.g. A...D)

So: think about key design

Mittwoch, 30. Oktober 13

Page 20: openTSDB - Metrics for a distributed world

Take 1Row key format: timestamp, metric id

Mittwoch, 30. Oktober 13

Page 21: openTSDB - Metrics for a distributed world

Take 1Row key format: timestamp, metric id

1382536472, 5 17

Server A

Server B

Mittwoch, 30. Oktober 13

Page 22: openTSDB - Metrics for a distributed world

Take 1Row key format: timestamp, metric id

1382536472, 5 171382536472, 6 24

Server A

Server B

Mittwoch, 30. Oktober 13

Page 23: openTSDB - Metrics for a distributed world

Take 1Row key format: timestamp, metric id

1382536472, 5 171382536472, 6 241382536472, 8 121382536473, 5 1341382536473, 6 101382536473, 8 99

Server A

Server B

Mittwoch, 30. Oktober 13

Page 24: openTSDB - Metrics for a distributed world

Take 1Row key format: timestamp, metric id

1382536472, 5 171382536472, 6 241382536472, 8 121382536473, 5 1341382536473, 6 101382536473, 8 991382536474, 5 121382536474, 6 42

Server A

Server B

Mittwoch, 30. Oktober 13

Page 25: openTSDB - Metrics for a distributed world

Solution: Swap timestamp and metric id

Row key format: metric id, timestamp5, 1382536472 176, 1382536472 248, 1382536472 125, 1382536473 1346, 1382536473 108, 1382536473 995, 1382536474 126, 1382536474 42

Server A

Server B

Mittwoch, 30. Oktober 13

Page 26: openTSDB - Metrics for a distributed world

Solution: Swap timestamp and metric id

Row key format: metric id, timestamp5, 1382536472 176, 1382536472 248, 1382536472 125, 1382536473 1346, 1382536473 108, 1382536473 995, 1382536474 126, 1382536474 42

Server A

Server B

Mittwoch, 30. Oktober 13

Page 27: openTSDB - Metrics for a distributed world

Take 2

Metric ID first, then timestamp

Searching through many rows is slower than searching through viewer rows. (Obviously)

So: Put multiple data points into one row

Mittwoch, 30. Oktober 13

Page 28: openTSDB - Metrics for a distributed world

Take 2 continued

5, 1382608800+23 +35 +94 +142

5, 138260880017 1 23 42

5, 1382612400+13 +25 +88 +89

5, 13826124003 44 12 2

Mittwoch, 30. Oktober 13

Page 29: openTSDB - Metrics for a distributed world

Take 2 continued

5, 1382608800+23 +35 +94 +142

5, 138260880017 1 23 42

5, 1382612400+13 +25 +88 +89

5, 13826124003 44 12 2

Row key

Mittwoch, 30. Oktober 13

Page 30: openTSDB - Metrics for a distributed world

Take 2 continued

5, 1382608800+23 +35 +94 +142

5, 138260880017 1 23 42

5, 1382612400+13 +25 +88 +89

5, 13826124003 44 12 2

Row key

Cell Name

Mittwoch, 30. Oktober 13

Page 31: openTSDB - Metrics for a distributed world

Take 2 continued

5, 1382608800+23 +35 +94 +142

5, 138260880017 1 23 42

5, 1382612400+13 +25 +88 +89

5, 13826124003 44 12 2

Row key

Cell Name Data point

Mittwoch, 30. Oktober 13

Page 32: openTSDB - Metrics for a distributed world

Where are the tags stored?

They are put at the end of the row key

Both tag names and tag values are represented by IDs

Mittwoch, 30. Oktober 13

Page 33: openTSDB - Metrics for a distributed world

The Row Key

3 Bytes - metric ID

4 Bytes - timestamp (rounded down to the hour)

3 Bytes tag ID

3 Bytes tag value ID

Total: 7 Bytes + 6 Bytes * Number of tags

Mittwoch, 30. Oktober 13

Page 34: openTSDB - Metrics for a distributed world

Let‘s look at some graphs

Mittwoch, 30. Oktober 13

Page 35: openTSDB - Metrics for a distributed world

Busting some Myths

Mittwoch, 30. Oktober 13

Page 36: openTSDB - Metrics for a distributed world

Myth: Keeping Data is expensive

Gartner found the price for enterprise SSDs at 1$/GB in 2013

A data point gets compressed to 2-3 Bytes

A metric that you measure every second then uses disk space for 18.9ct per year.

Usually it is even cheaper

Mittwoch, 30. Oktober 13

Page 37: openTSDB - Metrics for a distributed world

If your work costs 50$ per hour and it takes you only one minute to think about

and configure your RRD compaction setting, you could have collected that metric on a second-by-second basis for

4.4 YEARS instead.

Mittwoch, 30. Oktober 13

Page 38: openTSDB - Metrics for a distributed world

Myth: the amount of metrics is too limited

Don‘t confuse Graphite metric count with openTSBD metric count.

3 Bytes of metric ID = 16.7M possibilities

3 Bytes tag value ID = 16.7M possibilities

=> at least 280 T metrics (graphite counting)

Mittwoch, 30. Oktober 13

Page 39: openTSDB - Metrics for a distributed world

Cultural issues

Mittwoch, 30. Oktober 13

Page 40: openTSDB - Metrics for a distributed world

Tools shape culture shapes tools

It is time for a new monitoring culture!

Embrace machine learning!

Monitor everything in your organisation!

Throw of the shackles of fixed intervals!

Come, join the revolution!

Mittwoch, 30. Oktober 13

Page 41: openTSDB - Metrics for a distributed world

Our experiences

Mittwoch, 30. Oktober 13

Page 42: openTSDB - Metrics for a distributed world

What works well

We store about 200M data points in several thousand time series with no issues

tcollector is decoupling measurement from storage

Creating new metrics is really easy

You are free to choose your rhythm

Mittwoch, 30. Oktober 13

Page 43: openTSDB - Metrics for a distributed world

Challenges

The UI is seriously lacking

no annotation support out of the box

no meta data for time series

Only 1s time resolution (and only 1 value/s/time series)

Mittwoch, 30. Oktober 13

Page 44: openTSDB - Metrics for a distributed world

salvation is coming

OpenTSDB 2 is around the corner

millisecond precision

annotations and meta data

improved API

improved UI

Mittwoch, 30. Oktober 13

Page 45: openTSDB - Metrics for a distributed world

Friendly advice

Pick a naming scheme and stick to it

Use tags wisely (not more than 6 or 7 tags per data point)

Use tcollector

wait for openTSDB 2 ;-)

Mittwoch, 30. Oktober 13

Page 46: openTSDB - Metrics for a distributed world

Questions?

Please contact me:

[email protected]

@mydalon

I‘ll upload the slides and tweet about it

Mittwoch, 30. Oktober 13