FOSDEM 2020 Querying millions to billions of metrics with ...€¦ · metrics mostly.Adding more...

Preview:

Citation preview

@chronosphereio

Querying millions to billions of metrics with M3DB’s index

FOSDEM 2020

@chronosphereio

@roskilli

Previously M3 tech lead at Uber, creator of M3DB.

CTO at Chronosphere.

Member of OpenMetrics.

@chronosphereio

Schema for data you would like to collect and aggregate

Name

● http_requests

Dimensions/Labels

● endpoint (e.g. /api/search)● status_code (e.g. 500)● deploy_version_git_sha (e.g. 25149a04c)

Monitoring: what is a metric?

@chronosphereio

1. Increasing number of regions, containers, k8s pods, tracking deployed version - (cardinality!)

2. Metrics can have arbitrary number of dimensions

3. Building compound index is expensive

Problem

@chronosphereio

1. We have monitoring, it’s awesome and developers are happy with standardized metrics mostly.

Adding more metrics at organizations2. Developers put custom metrics on everything and I am deploying tons of applications in something like Kubernetes, things are ok!

3. Things are on way too on fire, we can’t manage this many things anymore, can everyone just stop please.

??

???

@chronosphereio

Timeseries

Timeseries from lots of hosts and container pods

ID Timeseries

1 __name__=cpu_seconds_total, pod=foo-123abc

8 __name__=memory_memfree, pod=foo-123abc

33 __name__=cpu_seconds_total, pod=foo-456def

44 __name__=memory_memfree, pod=foo-456def

45 __name__=cpu_seconds_total, pod=bar-768ghe

58 __name__=memory_memfree, pod=bar-768ghe

… millions .. and if you are unfortunate... billions

@chronosphereio

Aggregate metric cpu_seconds_total

Timeseries from lots of hosts and container pods

ID Timeseries

1 __name__=cpu_seconds_total, pod=foo-123abc

8 __name__=memory_memfree, pod=foo-123abc

33 __name__=cpu_seconds_total, pod=foo-456def

44 __name__=memory_memfree, pod=foo-456def

45 __name__=cpu_seconds_total, pod=bar-768ghe

58 __name__=memory_memfree, pod=bar-768ghe

… millions .. and if you are unfortunate... billions

@chronosphereio

cpu_seconds_total and pod=foo-(.+)

Timeseries from lots of hosts and container pods

ID Timeseries

1 __name__=cpu_seconds_total, pod=foo-123abc

8 __name__=memory_memfree, pod=foo-123abc

33 __name__=cpu_seconds_total, pod=foo-456def

44 __name__=memory_memfree, pod=foo-456def

45 __name__=cpu_seconds_total, pod=bar-768ghe

58 __name__=memory_memfree, pod=bar-768ghe

… millions .. and if you are unfortunate... billions

@chronosphereio

Need high flexibility and speed1. Any arbitrary set of dimensions/labels can be

specified for filtering

2. Ideally speed is sub-linear

@chronosphereio

Timeseries column lookup1. Secondary lookup using prefix ordered table

2. Secondary inverted index

Labels Timeseries ID (fingerprint)

__name__=cpu, pod=foo-123abc 1 ID Column key Col value

1 __name__=cpu, pod=foo-123abc {t=...,v=...} ➡

2 __name__=cpu, pod=foo-456def {t=...,v=...} ➡

3 __name__=cpu, pod=bar-123abc {t=...,v=...} ➡Label Label value Timeseries IDs

__name__ cpu 1, 2, 3

pod foo-123abc 1

foo-456def 2

bar-123abc 3

@chronosphereio

Ways to keep timeseries index/data1. Index and data live separately

Lookup and returning timeseries data across processes, typically making network request between the two operations.

2. Index and data live togetherLookup next to timeseries data, send data back directly once matches index query.

@chronosphereio

v1

M3 storage evolution (pre-open release, 2015)

Cassandra

ElasticSearch

AlreadyIndexedCache

Heavy read

cache

QueryQuery

QueryQueryRecently

read cache

1. Fetch index (ES)

2. Fetch data (C*)

>100 servers

>1,000 servers

@chronosphereio

v1

M3 storage evolution (pre-open release, 2016)

Cassandra

ElasticSearch

AlreadyIndexedCache

Heavy read

cache

QueryQuery

QueryQueryRecently

read cache

>100 servers

>1,000 servers

@chronosphereio

v2

M3 storage evolution (pre-open release, 2016)

M3DB(data on disk

with LRU caches)

ElasticSearch

AlreadyIndexedCache

Heavy read

cache

QueryQuery

QueryQuery

With M3DB 7x less servers from Cassandra, while increasing RF=2 to RF=3

@chronosphereio

v2

M3 storage evolution (pre-open release, 2018)

M3DB(data on disk

with LRU caches)

ElasticSearch

AlreadyIndexedCache

Heavy read

cache

QueryQuery

QueryQuery

@chronosphereio

v4

All read/write caches for data/index now in M3DB nodes

M3 storage evolution (open release, 2018)

M3DB(data and

index on disk with LRU caches)

QueryQuery

QueryQuery

@chronosphereio

Inverted index w/ Prometheus

Timeseries IDs 1, 33, 45

Timeseries IDs 8, 44, 58

Timeseries IDs 1, 8

Timeseries IDs 33, 44

Timeseries IDs 45, 58

__name__

cpu_seconds

mem_free

pod

foo-123abc

foo-456def

bar-123abc

https://github.com/prometheus/prometheus/blob/master/tsdb/docs/format/index.md

@chronosphereio

Inverted index w/ Prometheushttps://github.com/prometheus/prometheus/blob/master/tsdb/docs/format/index.md

TS IDs 1, 33, 45

TS IDs 8, 44, 58

TS IDs 1, 8

TS IDs 33, 44

TS IDs 45, 58

__name__

cpu_seconds

mem_free

pod

foo-123abc

foo-456def

bar-123abc

ID Timeseries

1 __name__=cpu_seconds, pod=foo-123abc

8 __name__=mem_free, pod=foo-123abc

33 __name__=cpu_seconds, pod=foo-456abc

44 __name__=mem_free, pod=foo-456abc

45 __name__=cpu_seconds, pod=bar-123abc

58 __name__=mem_free, pod=bar-123abc

@chronosphereio

Inverted index w/ PrometheusLabels (name and distinct values entries)

@chronosphereio

Inverted index w/ PrometheusPostings/Timeseries IDs

@chronosphereio

Inverted index w/ PrometheusMatching label valueshttps://github.com/prometheus/prometheus/blob/38d32e06862f6b72700f67043ce574508b5697f0/tsdb/querier.go#L417-L451

vals, err := ix.LabelValues(m.Name)

...

var res []string

for _, val := range vals {

if m.Matches(val) {

res = append(res, val)

}

}

...

return ix.Postings(m.Name, res...) // Merges postings/timeseries IDs together

@chronosphereio

Inverted index w/ M31. Inverted index more similar to ElasticSearch & Apache Lucene.

2. Instead of storing distinct label values with associated postings, instead stores distinct label values in FST (Finite State Transducer).

3. Instead of storing postings/timeseries IDs as integer sets (one after another), instead stores using Roaring Bitmaps (compressed bitmaps) for fast intersection (across thousands of sets).

@chronosphereio

What is an FST?Like a compressed trie.

Good overview and some examples at https://blog.burntsushi.net/transducers/

Searching data set of wikipedia titles is more than 10x faster than grep.

This matters when you have billions of metrics, i.e. Uber with 11 billion metrics.

@chronosphereio

https://github.com/chronosphereiox/high_cardinality_microbenchmark

Disclaimer: This is only testing one part of much bigger systems, mainly to support architectural choices not for real world performance.

Demo

@chronosphereio

Thank you to M3 contributors:…@chronosphere.io, …@uber.com, …@aiven.io, …@cloudera.com, …@linkedin.com and many other great individuals!

Learn more (release 0.15.0 coming soon):

● Slack https://bit.ly/m3slack● Mailing list https://groups.google.com/forum/#!forum/m3db● GitHub https://github.com/m3db/m3● Documentation https://m3db.io● Chronosphere contact@chronosphere.io

Thank you, questions? Come say hi

@chronosphereio