OpenTSDB for monitoring @ Criteo

Preview:

Citation preview

Nathaniel Braun

Thursday, April 28th, 2016

OpenTSDB for

monitoring @ Criteo

@

2 | Copyright © 2016 Criteo

•Overview of Hadoop @ Criteo

•Our experimental cluster

•Rationale for OpenTSDB

•Stabilizing & scaling OpenTSDB

•OpenTSDB to the rescue in practice

Hitch hiker’s guide to this presentation

Overview of Hadoop @ Criteo

@

4 | Copyright © 2016 Criteo

Overview of Hadoop @ Criteo

Tokyo TY5 – PROD AS

Sunnyvale SV6 – PROD NAHongKong HK5 – PROD CN

Paris PA4 – PROD / PREPROD

Paris PA3 –PREPROD / EXP

Amsterdam AM5 – PROD

Criteo’s 8 Hadoop clusters – running CDH Community Edition

5 | Copyright © 2016 Criteo

AM5: main production cluster

• In use since 2011

• Running CDH3 initially, CDH4 currently

• 1118 DataNodes

• 13 400+ compute cores

• 39 PB of raw disk storage

• 105 TB of RAM capacity

• 40 TB of data imported every day, mostly through HTTPFS

• 100 000+ jobs run daily

Overview of Hadoop @ Criteo – Production AM5

6 | Copyright © 2016 Criteo

PA4: comparable to AM5, with fewer machines

• Migration done in Q4 2015 – H1 2016

• Running CDH5

• 650+ DataNodes

• 15 600+ compute cores

• 54 PB of raw disk storage

• 143 TB of RAM capacity

• Huawei servers (AM5 is HP-based)

Overview of Hadoop @ Criteo – Production PA4

7 | Copyright © 2016 Criteo

Criteo has 3 local production Hadoop clusters

• Sunnyvale (SV6): 20 nodes

• Tokyo (TY5): 35 nodes

• Hong Kong (HK5): 20 nodes

Overview of Hadoop @ Criteo – Production local clusters

8 | Copyright © 2016 Criteo

Criteo has 3 preproduction Hadoop clusters

• Preprod PA3: 54 nodes, running CDH4

• Preprod PA4: 42 nodes, running CDH5

• Experimental: 53 nodes, running CDH5

Overview of Hadoop @ Criteo – Preproduction clusters

9 | Copyright © 2016 Criteo

Overview of Hadoop @ Criteo – Usage

Types of jobs running on our clusters

• Cascading jobs, mostly for joins between different types of logs (e.g. displays & clicks)

• Pure Map/Reduce jobs for recommendation, Hadoop streaming jobs for learning

• Scalding jobs for analytics

• Hive queries for Business Intelligence

• Spark jobs on CDH5

10 | Copyright © 2016 Criteo

Overview of Hadoop @ Criteo – Special consideration

• Kerberos for security

• High-availability on NameNodes and ResourceManager (CDH5 only)

• Infrastructure installed & maintained with Chef

11 | Copyright © 2016 Criteo

Overview of Hadoop @ Criteo

How can we monitor this complex

infrastructure and services running on top

of it?

Our experimental cluster

@

13 | Copyright © 2016 Criteo

• Useful for testing infrastructure changes without impacting users (no SLA)

• Test environment for new technologies

• HBase

oNatural joins

oOpenTSDB for metrology & monitoring

ohRaven for job detailed data (not used anymore)

• Spark, now in production @ PA4

Our experimental cluster – Purpose

14 | Copyright © 2016 Criteo

• Based on Google BigTable paper

• Integrated with the Hadoop stack

• Stores data in rows sorted by row key

• Uses regions as an ordered set of rows

• Regions sharded by row key bounds

• Regions managed by Region servers, collocated with DataNodes (data is stored on HDFS)

• Oversize regions split into two regions

• Values stored in columns, with no fixed schema as in RDBMS

• Columns grouped in column families

Our experimental cluster – HBase features

15 | Copyright © 2016 Criteo

Our experimental cluster – HBase architecture

Row key

(user UID)

CF0: user CF1: event

C0: IP C2: browser C3: e-mail C0: time C1: type C2: web site

AAA value Firefox NULL Click Client #0

BBB value Chrome NULL Click Client #0

CCC value Chrome ccc@mail.com Display Client #1

DDD value IE NULL Sales Client #2

EEE value IE NULL Display Client #0

FFF value IE NULL Display Client #3

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

XXX value Firefox NULL Sales Client #4

YYY value Chrome NULL Bid Client #5

ZZZ value Opera zzz@mail.com Click Client #5

16 | Copyright © 2016 Criteo

Our experimental cluster – HBase architecture

Row key

(user UID)

CF0: user CF1: event

C0: IP C2: browser C3: e-mail C0: time C1: type C2: web site

AAA value Firefox NULL Click Client #0

BBB value Chrome NULL Click Client #0

CCC value Chrome ccc@mail.com Display Client #1

DDD value IE NULL Sales Client #2

EEE value IE NULL Display Client #0

FFF value IE NULL Display Client #3

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

XXX value Firefox NULL Sales Client #4

YYY value Chrome NULL Bid Client #5

ZZZ value Opera zzz@mail.com Click Client #5

R0

R1

R5

17 | Copyright © 2016 Criteo

Our experimental cluster – HBase architecture

Row key

(user UID)

CF0: user CF1: event

C0: IP C2: browser C3: e-mail C0: time C1: type C2: web site

AAA value Firefox NULL Click Client #0

BBB value Chrome NULL Click Client #0

CCC value Chrome ccc@mail.com Display Client #1

DDD value IE NULL Sales Client #2

EEE value IE NULL Display Client #0

FFF value IE NULL Display Client #3

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

XXX value Firefox NULL Sales Client #4

YYY value Chrome NULL Bid Client #5

ZZZ value Opera zzz@mail.com Click Client #5

R0

R1

R5

RS1

RS2

18 | Copyright © 2016 Criteo

HBase on the experimental cluster

• 50 region servers

• 44 000+ regions

• ~90 000 requests / second from OpenTSDB

Our experimental cluster – HBase @ Criteo

Rationale for OpenTSDB

on

20 | Copyright © 2016 Criteo

Metrics to monitor:

• CPU load

• Processes & threads

• RAM available/reserved

• Free/used disk space

• Network statistics

• Sockets open/closed

• Open connections with their statuses

• Network traffic

Rationale for using OpenTSDB – Infrastructure monitoring

21 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Service monitoring

NodeManagers ResourceManagersYARN

DataNodes NameNodes JournalNodesHDFS

ZooKeeper Kerberos

HBase

Kafka Storm

22 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Service monitoring

NodeManagers ResourceManagersYARN

DataNodes NameNodes JournalNodesHDFS

ZooKeeper Kerberos

HBase

Kafka Storm

Huge diversity of services!

23 | Copyright © 2016 Criteo

• Diversity

• Many types of nodes & services

• Must be extensible simply to add new metrics

• Scale

• > 2 500 servers

• ~ 90 000 requests / second

• Storage

• Keep fine-grained resolution (down to the minute, at least)

• Long-term storage for analysis & investigation

Rationale for using OpenTSDB – Scale

24 | Copyright © 2016 Criteo

• Suits the problem well: “Hadoop for monitoring Hadoop”

• Designed for time series: HBase schema optimized for time series queries

• Scalable and resilient, thanks to HBase

• Extensible easily: writing data collector is easy

• Simple to query

Rationale for using OpenTSDB – Solution

25 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Easy to query

uri = URI.parse("http://0.rtsd.hpc.criteo.preprod:4242/api/query")http = Net::HTTP.start(uri.hostname, uri.port)http.read_timeout = 300

params = {'start' => '2016/04/21-10:00:00','end' => '2016/04/21-12:00:00','queries‘ => {'aggregator' => 'min','downsample' => '5m-min','metric' => 'hadoop.resourcemanager.queuemetrics.root.AllocatedMB','tags' => {'cluster' => 'ams','host' => 'rm.hpc.criteo.prod'

}}

request = Net::HTTP::Post.new(uri.path, initheader = {'Content-Type' =>'application/json'})request.body = params.to_jsonresponse = http.request(request)

26 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Practical UI

27 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Practical UI

Metric

28 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Practical UI

Time rangeMetric

29 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Practical UI

Time rangeMetric

Tag keys/values

30 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Practical UI

Time rangeMetric

Tag keys/valuesAggregator

31 | Copyright © 2016 Criteo

• OpenTSDB consists in Time Series Daemons (TSDs) and tcollectors

• Some TSDs used for writing, others for reading, while tcollectors collect metrics

• TSDs are stateless

• TSDs use asyncHBase to scale

• Quiz: what are the advantages?

Rationale for using OpenTSDB – Design

32 | Copyright © 2016 Criteo

• OpenTSDB consists in Time Series Daemons (TSDs) and tcollectors

• Some TSDs used for writing, others for reading, while tcollectors collect metrics

• TSDs are stateless

• TSDs use asyncHBase to scale

• Quiz: what are the advantages?

Rationale for using OpenTSDB – Design

1. Clients never interact

with HBase directly

2. Simple protocol → easy

to use & extend

3. No state, no

synchronization → great

scalability

33 | Copyright © 2016 Criteo

• Metrics consist in:

• metric name

• UNIX timestamp

• value (64 bit integer or single-precision floating point value).

• tags (key-value pairs) specific to that metric instance

• Tags useful for aggregations on time series

proc.loadavg.15min 1461781436 15 host=0.namenode.hpc.criteo.prod

• Charts: average load in 15 minutes with the count

aggregator (proxy to machine count)

• Quiz: what is the chart below?

Rationale for using OpenTSDB – Metrics

proc.loadavg.15min

34 | Copyright © 2016 Criteo

• Metrics consist in:

• metric name

• UNIX timestamp

• value (64 bit integer or single-precision floating point value).

• tags (key-value pairs) specific to that metric instance

• Tags useful for aggregations on time series

proc.loadavg.15min 1461781436 15 host=0.namenode.hpc.criteo.prod

• Charts: average load in 15 minutes with the count

aggregator (proxy to machine count)

• Quiz: what is the chart below?

Rationale for using OpenTSDB – Metrics

proc.loadavg.15min

proc.loadavg.15mincluster=*

35 | Copyright © 2016 Criteo

• A single data table (split in regions), named tsdb

• Row key: <metric_uid><timestamp><tagk1><tagv1>[...<tagkN><tagvN>]

• timestamp is rounded down to the hour

• This schema helps group data from the same metric & time bucket close together (HBase sorts rows based on the row key)

• Assumption: query first on time range, then metric, then tags, in that order of preference

• Tag keys are sorted lexicographically

• Tags should be limited, because they are in the row key. Usually less than 5 tags.

• Values are stored in columns

• Column name: 2 or 4 bytes. For 2 bytes:

• Encode offset up to 3 600 seconds → 212 = 4096 → 12 bits

• 4 bits left for format/type

• Other tables, for metadata and name ↔ ID mappings

Rationale for using OpenTSDB – HBase schema

36 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – HBase schema

Hexadecimal representation of a row key, with two tags

Sorted row keys for the same metric: 000001

Note: row key size varies across rows, because of tags

37 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Statistics

Quiz: what should we look for?

38 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Statistics

Quiz: what should we look for?

39 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Statistics

Quiz: what should we look for?

367 513 metrics

30 tag keys (!)

86 194 tag values

Stabilizing & scaling OpenTSDB

41 | Copyright © 2016 Criteo

OpenTSDB was hard to scale at first. What problem can you see?

Scaling OpenTSDB

42 | Copyright © 2016 Criteo

OpenTSDB was hard to scale at first. What problem can you see?

Scaling OpenTSDB

We’re missing data points

43 | Copyright © 2016 Criteo

• Analyze all the layers of the system

• Logs are your friends

• Change parameters one by one, not all at once

• Measure, change, deploy, measure. Rinse, repeat

Scaling OpenTSDB – Lessons learned

44 | Copyright © 2016 Criteo

Varnish & OpenResty save the day

Scaling OpenTSDB – Nifty trick

OpenRestyPOST -> GET

VarnishCache + LB

OpenRestyPOST -> GET

VarnishCache + LB

OpenRestyPOST -> GET

VarnishCache + LB

RTSDRead OpenTSDB

RTSDRead OpenTSDB

RTSDRead OpenTSDB

45 | Copyright © 2016 Criteo

Varnish & OpenResty save the day

Scaling OpenTSDB – Nifty trick

OpenRestyPOST -> GET

VarnishCache + LB

OpenRestyPOST -> GET

VarnishCache + LB

OpenRestyPOST -> GET

VarnishCache + LB

RTSDRead OpenTSDB

RTSDRead OpenTSDB

RTSDRead OpenTSDB

OpenTSDB to the rescue in practice

47 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Easier to use than logs

hadoop.namenode.fsnamesystem.tag.HAState

48 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Easier to use than logs

Two NameNode failovers in one night!

hadoop.namenode.fsnamesystem.tag.HAState

49 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Easier to use than logs

Two NameNode failovers in one night!

• Hard to spot : it in the morning nothing has changed

hadoop.namenode.fsnamesystem.tag.HAState

50 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Easier to use than logs

Two NameNode failovers in one night!

• Hard to spot : it in the morning nothing has changed

• Would be impossible to see with daily aggregation

hadoop.namenode.fsnamesystem.tag.HAState

51 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Easier to use than logs

Two NameNode failovers in one night!

• Hard to spot : it in the morning nothing has changed

• Would be impossible to see with daily aggregation

• Trivia: we fixed the tcollector to get that metric

hadoop.namenode.fsnamesystem.tag.HAState

52 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Investigation

hadoop.nodemanager.direct.TotalCapacity

53 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Investigation

hadoop.nodemanager.direct.TotalCapacity

Huge memory capacity spike

54 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Investigation

hadoop.nodemanager.direct.TotalCapacity

Huge memory capacity spike Node not reporting points

55 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Investigation

hadoop.nodemanager.direct.TotalCapacity

Huge memory capacity spike Node not reporting pointsAnother huge spike

56 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Investigation

hadoop.nodemanager.direct.TotalCapacity

Huge memory capacity spike Node not reporting pointsAnother huge spike

No data

57 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Superimpose charts

hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis

58 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Superimpose charts

hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis

Service restart – configuration change

59 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Superimpose charts

hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis

Service restart – configuration change Service restart – OOM

60 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Superimpose charts

hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis

Service restart – configuration change Service restart – OOM

Log extract:NodeManagerconfigured with 192 GB physical memory

allocated to containers,

which is more than 80% of the total physical memory

available (89 GB)

61 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Hiccups

hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis

62 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Hiccups

hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis

OpenTSDB problem – not node-specific

63 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Hiccups

hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis

OpenTSDB problem – not node-specific Node probably dead

64 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystem.BlocksTotal

65 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

File deletion

File deletion

hadoop.namenode.fsnamesystem.BlocksTotal

66 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

File deletion

File deletion

File creation

hadoop.namenode.fsnamesystem.BlocksTotal

67 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystem.BlocksTotalhadoop.namenode.fsnamesystem.FilesTotal

68 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

Slope

hadoop.namenode.fsnamesystem.BlocksTotalhadoop.namenode.fsnamesystem.FilesTotal

69 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

Slope

hadoop.namenode.fsnamesystem.BlocksTotalhadoop.namenode.fsnamesystem.FilesTotal

Be careful about the scale!

70 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

71 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Quiz: what is this pattern?

72 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Quiz: what is this pattern?

• Answer: NameNode checkpoint

73 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Quiz: what is this pattern?

• Answer: NameNode checkpoint

• Note: done at regular intervals

74 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Quiz: what is this pattern?

• Answer: NameNode checkpoint

• Note: done at regular intervals

• Trivia: never do a failover during a checkpoint!

75 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

76 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

77 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Quiz: what is the problem?

78 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Quiz: what is the problem?

• Answer: no NameNode checkpoint → no FS image!

79 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Quiz: what is the problem?

• Answer: no NameNode checkpoint → no FS image!

• Follow-up: standby namenode could not startup after a failover, because its FS image was too old

80 | Copyright © 2016 Criteo

Criteo ♥ BigData

- Very accessible: only 50 euros, which will be given to charity

- Speakers from leading organizations: Google, Spotify, Mesosphere, Criteo …

https://www.eventbrite.co.uk/e/nabdc-not-another-big-data-conference-registration-24415556587

81 | Copyright © 2016 Criteo

Criteo is hiring!

http://labs.criteo.com/

Criteo is hiring!

Recommended