OpenTSDB for monitoring @ Criteo

Nathaniel Braun

Thursday, April 28th, 2016

OpenTSDB for

monitoring @ Criteo

@

2 | Copyright © 2016 Criteo

•Overview of Hadoop @ Criteo

•Our experimental cluster

•Rationale for OpenTSDB

•Stabilizing & scaling OpenTSDB

•OpenTSDB to the rescue in practice

Hitch hiker’s guide to this presentation

Overview of Hadoop @ Criteo

@



Tokyo TY5 – PROD AS

Sunnyvale SV6 – PROD NAHongKong HK5 – PROD CN

Paris PA4 – PROD / PREPROD

Paris PA3 –PREPROD / EXP

Amsterdam AM5 – PROD

Criteo’s 8 Hadoop clusters – running CDH Community Edition


AM5: main production cluster

• In use since 2011

• Running CDH3 initially, CDH4 currently

• 1118 DataNodes

• 13 400+ compute cores

• 39 PB of raw disk storage

• 105 TB of RAM capacity

• 40 TB of data imported every day, mostly through HTTPFS

• 100 000+ jobs run daily

Overview of Hadoop @ Criteo – Production AM5


PA4: comparable to AM5, with fewer machines

• Migration done in Q4 2015 – H1 2016

• Running CDH5

• 650+ DataNodes

• 15 600+ compute cores

• 54 PB of raw disk storage

• 143 TB of RAM capacity

• Huawei servers (AM5 is HP-based)

Overview of Hadoop @ Criteo – Production PA4


Criteo has 3 local production Hadoop clusters

• Sunnyvale (SV6): 20 nodes

• Tokyo (TY5): 35 nodes

• Hong Kong (HK5): 20 nodes

Overview of Hadoop @ Criteo – Production local clusters


Criteo has 3 preproduction Hadoop clusters

• Preprod PA3: 54 nodes, running CDH4

• Preprod PA4: 42 nodes, running CDH5

• Experimental: 53 nodes, running CDH5

Overview of Hadoop @ Criteo – Preproduction clusters


Overview of Hadoop @ Criteo – Usage

Types of jobs running on our clusters

• Cascading jobs, mostly for joins between different types of logs (e.g. displays & clicks)

• Pure Map/Reduce jobs for recommendation, Hadoop streaming jobs for learning

• Scalding jobs for analytics

• Hive queries for Business Intelligence

• Spark jobs on CDH5


Overview of Hadoop @ Criteo – Special consideration

• Kerberos for security

• High-availability on NameNodes and ResourceManager (CDH5 only)

• Infrastructure installed & maintained with Chef



How can we monitor this complex

infrastructure and services running on top

of it?

Our experimental cluster

@


• Useful for testing infrastructure changes without impacting users (no SLA)

• Test environment for new technologies

• HBase

oNatural joins

oOpenTSDB for metrology & monitoring

ohRaven for job detailed data (not used anymore)

• Spark, now in production @ PA4

Our experimental cluster – Purpose


• Based on Google BigTable paper

• Integrated with the Hadoop stack

• Stores data in rows sorted by row key

• Uses regions as an ordered set of rows

• Regions sharded by row key bounds

• Regions managed by Region servers, collocated with DataNodes (data is stored on HDFS)

• Oversize regions split into two regions

• Values stored in columns, with no fixed schema as in RDBMS

• Columns grouped in column families

Our experimental cluster – HBase features


Our experimental cluster – HBase architecture

Row key

(user UID)

CF0: user CF1: event

C0: IP C2: browser C3: e-mail C0: time C1: type C2: web site

AAA value Firefox NULL Click Client #0

BBB value Chrome NULL Click Client #0

CCC value Chrome [email protected] Display Client #1

DDD value IE NULL Sales Client #2

EEE value IE NULL Display Client #0

FFF value IE NULL Display Client #3

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

XXX value Firefox NULL Sales Client #4

YYY value Chrome NULL Bid Client #5

ZZZ value Opera [email protected] Click Client #5



Row key

(user UID)









∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙




R0

R1

R5



Row key

(user UID)









∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙




R0

R1

R5

RS1

RS2


HBase on the experimental cluster

• 50 region servers

• 44 000+ regions

• ~90 000 requests / second from OpenTSDB

Our experimental cluster – HBase @ Criteo

Rationale for OpenTSDB

on


Metrics to monitor:

• CPU load

• Processes & threads

• RAM available/reserved

• Free/used disk space

• Network statistics

• Sockets open/closed

• Open connections with their statuses

• Network traffic

Rationale for using OpenTSDB – Infrastructure monitoring


Rationale for using OpenTSDB – Service monitoring

NodeManagers ResourceManagersYARN

DataNodes NameNodes JournalNodesHDFS

ZooKeeper Kerberos

HBase

Kafka Storm


Rationale for using OpenTSDB – Service monitoring

NodeManagers ResourceManagersYARN

DataNodes NameNodes JournalNodesHDFS

ZooKeeper Kerberos

HBase

Kafka Storm

Huge diversity of services!


• Diversity

• Many types of nodes & services

• Must be extensible simply to add new metrics

• Scale

• > 2 500 servers

• ~ 90 000 requests / second

• Storage

• Keep fine-grained resolution (down to the minute, at least)

• Long-term storage for analysis & investigation

Rationale for using OpenTSDB – Scale


• Suits the problem well: “Hadoop for monitoring Hadoop”

• Designed for time series: HBase schema optimized for time series queries

• Scalable and resilient, thanks to HBase

• Extensible easily: writing data collector is easy

• Simple to query

Rationale for using OpenTSDB – Solution


Rationale for using OpenTSDB – Easy to query

uri = URI.parse("http://0.rtsd.hpc.criteo.preprod:4242/api/query")http = Net::HTTP.start(uri.hostname, uri.port)http.read_timeout = 300

params = {'start' => '2016/04/21-10:00:00','end' => '2016/04/21-12:00:00','queries‘ => {'aggregator' => 'min','downsample' => '5m-min','metric' => 'hadoop.resourcemanager.queuemetrics.root.AllocatedMB','tags' => {'cluster' => 'ams','host' => 'rm.hpc.criteo.prod'

}}

request = Net::HTTP::Post.new(uri.path, initheader = {'Content-Type' =>'application/json'})request.body = params.to_jsonresponse = http.request(request)


Rationale for using OpenTSDB – Practical UI



Metric



Time rangeMetric



Time rangeMetric

Tag keys/values



Time rangeMetric

Tag keys/valuesAggregator


• OpenTSDB consists in Time Series Daemons (TSDs) and tcollectors

• Some TSDs used for writing, others for reading, while tcollectors collect metrics

• TSDs are stateless

• TSDs use asyncHBase to scale

• Quiz: what are the advantages?

Rationale for using OpenTSDB – Design


• OpenTSDB consists in Time Series Daemons (TSDs) and tcollectors

• Some TSDs used for writing, others for reading, while tcollectors collect metrics

• TSDs are stateless

• TSDs use asyncHBase to scale

• Quiz: what are the advantages?

Rationale for using OpenTSDB – Design

1. Clients never interact

with HBase directly

2. Simple protocol → easy

to use & extend

3. No state, no

synchronization → great

scalability


• Metrics consist in:

• metric name

• UNIX timestamp

• value (64 bit integer or single-precision floating point value).

• tags (key-value pairs) specific to that metric instance

• Tags useful for aggregations on time series

proc.loadavg.15min 1461781436 15 host=0.namenode.hpc.criteo.prod

• Charts: average load in 15 minutes with the count

aggregator (proxy to machine count)

• Quiz: what is the chart below?

Rationale for using OpenTSDB – Metrics

proc.loadavg.15min


• Metrics consist in:

• metric name

• UNIX timestamp

• value (64 bit integer or single-precision floating point value).

• tags (key-value pairs) specific to that metric instance

• Tags useful for aggregations on time series

proc.loadavg.15min 1461781436 15 host=0.namenode.hpc.criteo.prod

• Charts: average load in 15 minutes with the count

aggregator (proxy to machine count)

• Quiz: what is the chart below?

Rationale for using OpenTSDB – Metrics

proc.loadavg.15min

proc.loadavg.15mincluster=*


• A single data table (split in regions), named tsdb

• Row key: <metric_uid><timestamp><tagk1><tagv1>[...<tagkN><tagvN>]

• timestamp is rounded down to the hour

• This schema helps group data from the same metric & time bucket close together (HBase sorts rows based on the row key)

• Assumption: query first on time range, then metric, then tags, in that order of preference

• Tag keys are sorted lexicographically

• Tags should be limited, because they are in the row key. Usually less than 5 tags.

• Values are stored in columns

• Column name: 2 or 4 bytes. For 2 bytes:

• Encode offset up to 3 600 seconds → 212 = 4096 → 12 bits

• 4 bits left for format/type

• Other tables, for metadata and name ↔ ID mappings

Rationale for using OpenTSDB – HBase schema


Rationale for using OpenTSDB – HBase schema

Hexadecimal representation of a row key, with two tags

Sorted row keys for the same metric: 000001

Note: row key size varies across rows, because of tags


Rationale for using OpenTSDB – Statistics

Quiz: what should we look for?







367 513 metrics

30 tag keys (!)

86 194 tag values

Stabilizing & scaling OpenTSDB


OpenTSDB was hard to scale at first. What problem can you see?

Scaling OpenTSDB


OpenTSDB was hard to scale at first. What problem can you see?

Scaling OpenTSDB

We’re missing data points


• Analyze all the layers of the system

• Logs are your friends

• Change parameters one by one, not all at once

• Measure, change, deploy, measure. Rinse, repeat

Scaling OpenTSDB – Lessons learned


Varnish & OpenResty save the day

Scaling OpenTSDB – Nifty trick

OpenRestyPOST -> GET

VarnishCache + LB


VarnishCache + LB


VarnishCache + LB

RTSDRead OpenTSDB

RTSDRead OpenTSDB

RTSDRead OpenTSDB


Varnish & OpenResty save the day

Scaling OpenTSDB – Nifty trick


VarnishCache + LB


VarnishCache + LB


VarnishCache + LB

RTSDRead OpenTSDB

RTSDRead OpenTSDB

RTSDRead OpenTSDB

OpenTSDB to the rescue in practice


OpenTSDB to the rescue in practice – Easier to use than logs

hadoop.namenode.fsnamesystem.tag.HAState



Two NameNode failovers in one night!





• Hard to spot : it in the morning nothing has changed






• Would be impossible to see with daily aggregation






• Would be impossible to see with daily aggregation

• Trivia: we fixed the tcollector to get that metric



OpenTSDB to the rescue in practice – Investigation

hadoop.nodemanager.direct.TotalCapacity




Huge memory capacity spike




Huge memory capacity spike Node not reporting points




Huge memory capacity spike Node not reporting pointsAnother huge spike




Huge memory capacity spike Node not reporting pointsAnother huge spike

No data


OpenTSDB to the rescue in practice – Superimpose charts

hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis




Service restart – configuration change




Service restart – configuration change Service restart – OOM




Service restart – configuration change Service restart – OOM

Log extract:NodeManagerconfigured with 192 GB physical memory

allocated to containers,

which is more than 80% of the total physical memory

available (89 GB)


OpenTSDB to the rescue in practice – Hiccups





OpenTSDB problem – not node-specific




OpenTSDB problem – not node-specific Node probably dead


OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystem.BlocksTotal



File deletion

File deletion




File deletion

File deletion

File creation




hadoop.namenode.fsnamesystem.BlocksTotalhadoop.namenode.fsnamesystem.FilesTotal



Slope




Slope


Be careful about the scale!



hadoop.namenode.fsnamesystemstate.NumLiveDataNodes




Quiz: what is this pattern?





• Answer: NameNode checkpoint






• Note: done at regular intervals






• Note: done at regular intervals

• Trivia: never do a failover during a checkpoint!










Quiz: what is the problem?





• Answer: no NameNode checkpoint → no FS image!





• Answer: no NameNode checkpoint → no FS image!

• Follow-up: standby namenode could not startup after a failover, because its FS image was too old


Criteo ♥ BigData

- Very accessible: only 50 euros, which will be given to charity

- Speakers from leading organizations: Google, Spotify, Mesosphere, Criteo …

https://www.eventbrite.co.uk/e/nabdc-not-another-big-data-conference-registration-24415556587

https://www.eventbrite.co.uk/e/nabdc-not-another-big-data-conference-registration-24415556587


Criteo is hiring!

http://labs.criteo.com/

Criteo is hiring!

Software

OpenTSDB for monitoring @ Criteo