OpenTSDB for monitoring @ Criteo

Nathaniel Braun

Thursday, April 28th, 2016

OpenTSDB for

monitoring @ Criteo

•Overview of Hadoop @ Criteo

•Our experimental cluster

•Rationale for OpenTSDB

•Stabilizing & scaling OpenTSDB

•OpenTSDB to the rescue in practice

Hitch hiker’s guide to this presentation

Overview of Hadoop @ Criteo

Tokyo TY5 – PROD AS

Sunnyvale SV6 – PROD NAHongKong HK5 – PROD CN

Paris PA4 – PROD / PREPROD

Paris PA3 –PREPROD / EXP

Amsterdam AM5 – PROD

Criteo’s 8 Hadoop clusters – running CDH Community Edition

AM5: main production cluster

• In use since 2011

• Running CDH3 initially, CDH4 currently

• 1118 DataNodes

• 13 400+ compute cores

• 39 PB of raw disk storage

• 105 TB of RAM capacity

• 40 TB of data imported every day, mostly through HTTPFS

• 100 000+ jobs run daily

Overview of Hadoop @ Criteo – Production AM5

PA4: comparable to AM5, with fewer machines

• Migration done in Q4 2015 – H1 2016

• Running CDH5

• 650+ DataNodes

• 15 600+ compute cores

• 54 PB of raw disk storage

• 143 TB of RAM capacity

• Huawei servers (AM5 is HP-based)

Overview of Hadoop @ Criteo – Production PA4

Criteo has 3 local production Hadoop clusters

• Sunnyvale (SV6): 20 nodes

• Tokyo (TY5): 35 nodes

• Hong Kong (HK5): 20 nodes

Overview of Hadoop @ Criteo – Production local clusters

Criteo has 3 preproduction Hadoop clusters

• Preprod PA3: 54 nodes, running CDH4

• Preprod PA4: 42 nodes, running CDH5

• Experimental: 53 nodes, running CDH5

Overview of Hadoop @ Criteo – Preproduction clusters

Overview of Hadoop @ Criteo – Usage

Types of jobs running on our clusters

• Cascading jobs, mostly for joins between different types of logs (e.g. displays & clicks)

• Pure Map/Reduce jobs for recommendation, Hadoop streaming jobs for learning

• Scalding jobs for analytics

• Hive queries for Business Intelligence

• Spark jobs on CDH5

Overview of Hadoop @ Criteo – Special consideration

• Kerberos for security

• High-availability on NameNodes and ResourceManager (CDH5 only)

• Infrastructure installed & maintained with Chef

How can we monitor this complex

infrastructure and services running on top

of it?

Our experimental cluster

• Useful for testing infrastructure changes without impacting users (no SLA)

• Test environment for new technologies

• HBase

oNatural joins

oOpenTSDB for metrology & monitoring

ohRaven for job detailed data (not used anymore)

• Spark, now in production @ PA4

Our experimental cluster – Purpose

• Based on Google BigTable paper

• Integrated with the Hadoop stack

• Stores data in rows sorted by row key

• Uses regions as an ordered set of rows

• Regions sharded by row key bounds

• Regions managed by Region servers, collocated with DataNodes (data is stored on HDFS)

• Oversize regions split into two regions

• Values stored in columns, with no fixed schema as in RDBMS

• Columns grouped in column families

Our experimental cluster – HBase features

Our experimental cluster – HBase architecture

Row key

(user UID)

CF0: user CF1: event

C0: IP C2: browser C3: e-mail C0: time C1: type C2: web site

AAA value Firefox NULL Click Client #0

BBB value Chrome NULL Click Client #0

CCC value Chrome ccc@mail.com Display Client #1

DDD value IE NULL Sales Client #2

EEE value IE NULL Display Client #0

FFF value IE NULL Display Client #3

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

XXX value Firefox NULL Sales Client #4

YYY value Chrome NULL Bid Client #5

ZZZ value Opera zzz@mail.com Click Client #5

Row key

(user UID)

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

Row key

(user UID)

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

HBase on the experimental cluster

• 50 region servers

• 44 000+ regions

• ~90 000 requests / second from OpenTSDB

Our experimental cluster – HBase @ Criteo

Rationale for OpenTSDB

Metrics to monitor:

• CPU load

• Processes & threads

• RAM available/reserved

• Free/used disk space

• Network statistics

• Sockets open/closed

• Open connections with their statuses

• Network traffic

Rationale for using OpenTSDB – Infrastructure monitoring

Rationale for using OpenTSDB – Service monitoring

NodeManagers ResourceManagersYARN

DataNodes NameNodes JournalNodesHDFS

ZooKeeper Kerberos

Kafka Storm

Rationale for using OpenTSDB – Service monitoring

NodeManagers ResourceManagersYARN

DataNodes NameNodes JournalNodesHDFS

ZooKeeper Kerberos

Kafka Storm

Huge diversity of services!

• Diversity

• Many types of nodes & services

• Must be extensible simply to add new metrics

• Scale

• > 2 500 servers

• ~ 90 000 requests / second

• Storage

• Keep fine-grained resolution (down to the minute, at least)

• Long-term storage for analysis & investigation

Rationale for using OpenTSDB – Scale

• Suits the problem well: “Hadoop for monitoring Hadoop”

• Designed for time series: HBase schema optimized for time series queries

• Scalable and resilient, thanks to HBase

• Extensible easily: writing data collector is easy

• Simple to query

Rationale for using OpenTSDB – Solution

Rationale for using OpenTSDB – Easy to query

uri = URI.parse("http://0.rtsd.hpc.criteo.preprod:4242/api/query")http = Net::HTTP.start(uri.hostname, uri.port)http.read_timeout = 300

params = {'start' => '2016/04/21-10:00:00','end' => '2016/04/21-12:00:00','queries‘ => {'aggregator' => 'min','downsample' => '5m-min','metric' => 'hadoop.resourcemanager.queuemetrics.root.AllocatedMB','tags' => {'cluster' => 'ams','host' => 'rm.hpc.criteo.prod'

request = Net::HTTP::Post.new(uri.path, initheader = {'Content-Type' =>'application/json'})request.body = params.to_jsonresponse = http.request(request)

Rationale for using OpenTSDB – Practical UI

Metric

Time rangeMetric

Tag keys/values

Time rangeMetric

Tag keys/valuesAggregator

• OpenTSDB consists in Time Series Daemons (TSDs) and tcollectors

• Some TSDs used for writing, others for reading, while tcollectors collect metrics

• TSDs are stateless

• TSDs use asyncHBase to scale

• Quiz: what are the advantages?

Rationale for using OpenTSDB – Design

• OpenTSDB consists in Time Series Daemons (TSDs) and tcollectors

• Some TSDs used for writing, others for reading, while tcollectors collect metrics

• TSDs are stateless

• TSDs use asyncHBase to scale

• Quiz: what are the advantages?

Rationale for using OpenTSDB – Design

1. Clients never interact

with HBase directly

2. Simple protocol → easy

to use & extend

3. No state, no

synchronization → great

scalability

• Metrics consist in:

• metric name

• UNIX timestamp

• value (64 bit integer or single-precision floating point value).

• tags (key-value pairs) specific to that metric instance

• Tags useful for aggregations on time series

proc.loadavg.15min 1461781436 15 host=0.namenode.hpc.criteo.prod

• Charts: average load in 15 minutes with the count

aggregator (proxy to machine count)

• Quiz: what is the chart below?

Rationale for using OpenTSDB – Metrics

proc.loadavg.15min

• Metrics consist in:

• metric name

• UNIX timestamp

• value (64 bit integer or single-precision floating point value).

• tags (key-value pairs) specific to that metric instance

• Tags useful for aggregations on time series

proc.loadavg.15min 1461781436 15 host=0.namenode.hpc.criteo.prod

• Charts: average load in 15 minutes with the count

aggregator (proxy to machine count)

• Quiz: what is the chart below?

Rationale for using OpenTSDB – Metrics

proc.loadavg.15min

proc.loadavg.15mincluster=*

• A single data table (split in regions), named tsdb

• Row key: <metric_uid><timestamp><tagk1><tagv1>[...<tagkN><tagvN>]

• timestamp is rounded down to the hour

• This schema helps group data from the same metric & time bucket close together (HBase sorts rows based on the row key)

• Assumption: query first on time range, then metric, then tags, in that order of preference

• Tag keys are sorted lexicographically

• Tags should be limited, because they are in the row key. Usually less than 5 tags.

• Values are stored in columns

• Column name: 2 or 4 bytes. For 2 bytes:

• Encode offset up to 3 600 seconds → 212 = 4096 → 12 bits

• 4 bits left for format/type

• Other tables, for metadata and name ↔ ID mappings

Rationale for using OpenTSDB – HBase schema

Hexadecimal representation of a row key, with two tags

Sorted row keys for the same metric: 000001

Note: row key size varies across rows, because of tags

Rationale for using OpenTSDB – Statistics

Quiz: what should we look for?

367 513 metrics

30 tag keys (!)

86 194 tag values

Stabilizing & scaling OpenTSDB

OpenTSDB was hard to scale at first. What problem can you see?

Scaling OpenTSDB

OpenTSDB was hard to scale at first. What problem can you see?

Scaling OpenTSDB

We’re missing data points

• Analyze all the layers of the system

• Logs are your friends

• Change parameters one by one, not all at once

• Measure, change, deploy, measure. Rinse, repeat

Scaling OpenTSDB – Lessons learned

Varnish & OpenResty save the day

Scaling OpenTSDB – Nifty trick

OpenRestyPOST -> GET

VarnishCache + LB

RTSDRead OpenTSDB

Varnish & OpenResty save the day

Scaling OpenTSDB – Nifty trick

VarnishCache + LB

RTSDRead OpenTSDB

OpenTSDB to the rescue in practice

OpenTSDB to the rescue in practice – Easier to use than logs

hadoop.namenode.fsnamesystem.tag.HAState

Two NameNode failovers in one night!

• Hard to spot : it in the morning nothing has changed

• Would be impossible to see with daily aggregation

• Trivia: we fixed the tcollector to get that metric

OpenTSDB to the rescue in practice – Investigation

hadoop.nodemanager.direct.TotalCapacity

Huge memory capacity spike

Huge memory capacity spike Node not reporting points

Huge memory capacity spike Node not reporting pointsAnother huge spike

No data

OpenTSDB to the rescue in practice – Superimpose charts

hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis

Service restart – configuration change

Service restart – configuration change Service restart – OOM

Log extract:NodeManagerconfigured with 192 GB physical memory

allocated to containers,

which is more than 80% of the total physical memory

available (89 GB)

OpenTSDB to the rescue in practice – Hiccups

OpenTSDB problem – not node-specific

OpenTSDB problem – not node-specific Node probably dead

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystem.BlocksTotal

File deletion

File creation

hadoop.namenode.fsnamesystem.BlocksTotalhadoop.namenode.fsnamesystem.FilesTotal

Be careful about the scale!

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Quiz: what is this pattern?

• Answer: NameNode checkpoint

• Note: done at regular intervals

• Trivia: never do a failover during a checkpoint!

Quiz: what is the problem?

• Answer: no NameNode checkpoint → no FS image!

• Follow-up: standby namenode could not startup after a failover, because its FS image was too old

Criteo ♥ BigData

- Very accessible: only 50 euros, which will be given to charity

- Speakers from leading organizations: Google, Spotify, Mesosphere, Criteo …

https://www.eventbrite.co.uk/e/nabdc-not-another-big-data-conference-registration-24415556587

Criteo is hiring!

http://labs.criteo.com/

Criteo is hiring!

OpenTSDB for monitoring @ Criteo

Software

Criteo Retail Media 101 › download › Criteo... · Criteo Retail Media 101. 2 • ... Digital. Source: 2017 Marketing Spending Industry Study; Cadent Consulting Group Advertiser

Criteo travel flash

Le BigData avance à grands pas - · PDF filewith low latency and high availability (Hbase / openTSDB) GridPocket – Michael Defoin-Platel 46 . OpenTSDB Distributed, Scalable, Time

OpenTSDB 2 - TsunaNETtsunanet.net/~tsuna/hbasecon-2014-opentsdb-2.0.pdf · Who We Are Benoit Sigoure Created OpenTSDB at StumbleUpon Software Engineer @ Arista Networks Chris Larsen

The Distributed, Scalable, Time Series Databasetsunanet.net/~tsuna/opentsdb/opentsdb-oscon.pdf · OpenTSDB The Distributed, Scalable, Time Series Database For your modern monitoring

eTravel12 - Piet-Hein Kerkhof - Criteo

Zenoss Core Configuration Guide · PDF fileConfiguring OpenTSDB for an external HBase cluster.....39 Configuring the OpenTSDB service startup command

Criteo ecommerce-industry-outlook-2015

Introducing Criteo

Global Mobile Commerce - Criteo report Q42014

The Distributed, Scalable, Time Series Database For your ...opentsdb.net/misc/opentsdb-oscon.pdf · The Distributed, Scalable, Time Series Database For your modern monitoring needs

Criteo ramadan-2015-1-final (1)

Lessons Learned from OpenTSDBopentsdb.net/misc/opentsdb-hbasecon.pdf · Lessons Learned from OpenTSDB Benoît “tsuna” Sigoure tsuna@stumbleupon.com Or why OpenTSDB is the way

openTSDB - Metrics for a distributed world

Criteo TektosData Meetup

OpenTSDB 2 - TsunaNET.nettsuna/hbasecon-2014-opentsdb-2.0.pdfOpenTSDB 2.x Schema Design Goals Must maintain backwards compatibility Support millisecond precision timestamps Support

emerce12 - Piet-Hein Kerkhof - Criteo

HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon

Criteo: Beyond the search box

Introduction Criteo - 2.0