81
Nathaniel Braun Thursday, April 28 th , 2016 OpenTSDB for monitoring @ Criteo @

OpenTSDB for monitoring @ Criteo

Embed Size (px)

Citation preview

Page 1: OpenTSDB for monitoring @ Criteo

Nathaniel Braun

Thursday, April 28th, 2016

OpenTSDB for

monitoring @ Criteo

@

Page 2: OpenTSDB for monitoring @ Criteo

2 | Copyright © 2016 Criteo

•Overview of Hadoop @ Criteo

•Our experimental cluster

•Rationale for OpenTSDB

•Stabilizing & scaling OpenTSDB

•OpenTSDB to the rescue in practice

Hitch hiker’s guide to this presentation

Page 3: OpenTSDB for monitoring @ Criteo

Overview of Hadoop @ Criteo

@

Page 4: OpenTSDB for monitoring @ Criteo

4 | Copyright © 2016 Criteo

Overview of Hadoop @ Criteo

Tokyo TY5 – PROD AS

Sunnyvale SV6 – PROD NAHongKong HK5 – PROD CN

Paris PA4 – PROD / PREPROD

Paris PA3 –PREPROD / EXP

Amsterdam AM5 – PROD

Criteo’s 8 Hadoop clusters – running CDH Community Edition

Page 5: OpenTSDB for monitoring @ Criteo

5 | Copyright © 2016 Criteo

AM5: main production cluster

• In use since 2011

• Running CDH3 initially, CDH4 currently

• 1118 DataNodes

• 13 400+ compute cores

• 39 PB of raw disk storage

• 105 TB of RAM capacity

• 40 TB of data imported every day, mostly through HTTPFS

• 100 000+ jobs run daily

Overview of Hadoop @ Criteo – Production AM5

Page 6: OpenTSDB for monitoring @ Criteo

6 | Copyright © 2016 Criteo

PA4: comparable to AM5, with fewer machines

• Migration done in Q4 2015 – H1 2016

• Running CDH5

• 650+ DataNodes

• 15 600+ compute cores

• 54 PB of raw disk storage

• 143 TB of RAM capacity

• Huawei servers (AM5 is HP-based)

Overview of Hadoop @ Criteo – Production PA4

Page 7: OpenTSDB for monitoring @ Criteo

7 | Copyright © 2016 Criteo

Criteo has 3 local production Hadoop clusters

• Sunnyvale (SV6): 20 nodes

• Tokyo (TY5): 35 nodes

• Hong Kong (HK5): 20 nodes

Overview of Hadoop @ Criteo – Production local clusters

Page 8: OpenTSDB for monitoring @ Criteo

8 | Copyright © 2016 Criteo

Criteo has 3 preproduction Hadoop clusters

• Preprod PA3: 54 nodes, running CDH4

• Preprod PA4: 42 nodes, running CDH5

• Experimental: 53 nodes, running CDH5

Overview of Hadoop @ Criteo – Preproduction clusters

Page 9: OpenTSDB for monitoring @ Criteo

9 | Copyright © 2016 Criteo

Overview of Hadoop @ Criteo – Usage

Types of jobs running on our clusters

• Cascading jobs, mostly for joins between different types of logs (e.g. displays & clicks)

• Pure Map/Reduce jobs for recommendation, Hadoop streaming jobs for learning

• Scalding jobs for analytics

• Hive queries for Business Intelligence

• Spark jobs on CDH5

Page 10: OpenTSDB for monitoring @ Criteo

10 | Copyright © 2016 Criteo

Overview of Hadoop @ Criteo – Special consideration

• Kerberos for security

• High-availability on NameNodes and ResourceManager (CDH5 only)

• Infrastructure installed & maintained with Chef

Page 11: OpenTSDB for monitoring @ Criteo

11 | Copyright © 2016 Criteo

Overview of Hadoop @ Criteo

How can we monitor this complex

infrastructure and services running on top

of it?

Page 12: OpenTSDB for monitoring @ Criteo

Our experimental cluster

@

Page 13: OpenTSDB for monitoring @ Criteo

13 | Copyright © 2016 Criteo

• Useful for testing infrastructure changes without impacting users (no SLA)

• Test environment for new technologies

• HBase

oNatural joins

oOpenTSDB for metrology & monitoring

ohRaven for job detailed data (not used anymore)

• Spark, now in production @ PA4

Our experimental cluster – Purpose

Page 14: OpenTSDB for monitoring @ Criteo

14 | Copyright © 2016 Criteo

• Based on Google BigTable paper

• Integrated with the Hadoop stack

• Stores data in rows sorted by row key

• Uses regions as an ordered set of rows

• Regions sharded by row key bounds

• Regions managed by Region servers, collocated with DataNodes (data is stored on HDFS)

• Oversize regions split into two regions

• Values stored in columns, with no fixed schema as in RDBMS

• Columns grouped in column families

Our experimental cluster – HBase features

Page 15: OpenTSDB for monitoring @ Criteo

15 | Copyright © 2016 Criteo

Our experimental cluster – HBase architecture

Row key

(user UID)

CF0: user CF1: event

C0: IP C2: browser C3: e-mail C0: time C1: type C2: web site

AAA value Firefox NULL Click Client #0

BBB value Chrome NULL Click Client #0

CCC value Chrome [email protected] Display Client #1

DDD value IE NULL Sales Client #2

EEE value IE NULL Display Client #0

FFF value IE NULL Display Client #3

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

XXX value Firefox NULL Sales Client #4

YYY value Chrome NULL Bid Client #5

ZZZ value Opera [email protected] Click Client #5

Page 16: OpenTSDB for monitoring @ Criteo

16 | Copyright © 2016 Criteo

Our experimental cluster – HBase architecture

Row key

(user UID)

CF0: user CF1: event

C0: IP C2: browser C3: e-mail C0: time C1: type C2: web site

AAA value Firefox NULL Click Client #0

BBB value Chrome NULL Click Client #0

CCC value Chrome [email protected] Display Client #1

DDD value IE NULL Sales Client #2

EEE value IE NULL Display Client #0

FFF value IE NULL Display Client #3

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

XXX value Firefox NULL Sales Client #4

YYY value Chrome NULL Bid Client #5

ZZZ value Opera [email protected] Click Client #5

R0

R1

R5

Page 17: OpenTSDB for monitoring @ Criteo

17 | Copyright © 2016 Criteo

Our experimental cluster – HBase architecture

Row key

(user UID)

CF0: user CF1: event

C0: IP C2: browser C3: e-mail C0: time C1: type C2: web site

AAA value Firefox NULL Click Client #0

BBB value Chrome NULL Click Client #0

CCC value Chrome [email protected] Display Client #1

DDD value IE NULL Sales Client #2

EEE value IE NULL Display Client #0

FFF value IE NULL Display Client #3

∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙

XXX value Firefox NULL Sales Client #4

YYY value Chrome NULL Bid Client #5

ZZZ value Opera [email protected] Click Client #5

R0

R1

R5

RS1

RS2

Page 18: OpenTSDB for monitoring @ Criteo

18 | Copyright © 2016 Criteo

HBase on the experimental cluster

• 50 region servers

• 44 000+ regions

• ~90 000 requests / second from OpenTSDB

Our experimental cluster – HBase @ Criteo

Page 19: OpenTSDB for monitoring @ Criteo

Rationale for OpenTSDB

on

Page 20: OpenTSDB for monitoring @ Criteo

20 | Copyright © 2016 Criteo

Metrics to monitor:

• CPU load

• Processes & threads

• RAM available/reserved

• Free/used disk space

• Network statistics

• Sockets open/closed

• Open connections with their statuses

• Network traffic

Rationale for using OpenTSDB – Infrastructure monitoring

Page 21: OpenTSDB for monitoring @ Criteo

21 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Service monitoring

NodeManagers ResourceManagersYARN

DataNodes NameNodes JournalNodesHDFS

ZooKeeper Kerberos

HBase

Kafka Storm

Page 22: OpenTSDB for monitoring @ Criteo

22 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Service monitoring

NodeManagers ResourceManagersYARN

DataNodes NameNodes JournalNodesHDFS

ZooKeeper Kerberos

HBase

Kafka Storm

Huge diversity of services!

Page 23: OpenTSDB for monitoring @ Criteo

23 | Copyright © 2016 Criteo

• Diversity

• Many types of nodes & services

• Must be extensible simply to add new metrics

• Scale

• > 2 500 servers

• ~ 90 000 requests / second

• Storage

• Keep fine-grained resolution (down to the minute, at least)

• Long-term storage for analysis & investigation

Rationale for using OpenTSDB – Scale

Page 24: OpenTSDB for monitoring @ Criteo

24 | Copyright © 2016 Criteo

• Suits the problem well: “Hadoop for monitoring Hadoop”

• Designed for time series: HBase schema optimized for time series queries

• Scalable and resilient, thanks to HBase

• Extensible easily: writing data collector is easy

• Simple to query

Rationale for using OpenTSDB – Solution

Page 25: OpenTSDB for monitoring @ Criteo

25 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Easy to query

uri = URI.parse("http://0.rtsd.hpc.criteo.preprod:4242/api/query")http = Net::HTTP.start(uri.hostname, uri.port)http.read_timeout = 300

params = {'start' => '2016/04/21-10:00:00','end' => '2016/04/21-12:00:00','queries‘ => {'aggregator' => 'min','downsample' => '5m-min','metric' => 'hadoop.resourcemanager.queuemetrics.root.AllocatedMB','tags' => {'cluster' => 'ams','host' => 'rm.hpc.criteo.prod'

}}

request = Net::HTTP::Post.new(uri.path, initheader = {'Content-Type' =>'application/json'})request.body = params.to_jsonresponse = http.request(request)

Page 26: OpenTSDB for monitoring @ Criteo

26 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Practical UI

Page 27: OpenTSDB for monitoring @ Criteo

27 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Practical UI

Metric

Page 28: OpenTSDB for monitoring @ Criteo

28 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Practical UI

Time rangeMetric

Page 29: OpenTSDB for monitoring @ Criteo

29 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Practical UI

Time rangeMetric

Tag keys/values

Page 30: OpenTSDB for monitoring @ Criteo

30 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Practical UI

Time rangeMetric

Tag keys/valuesAggregator

Page 31: OpenTSDB for monitoring @ Criteo

31 | Copyright © 2016 Criteo

• OpenTSDB consists in Time Series Daemons (TSDs) and tcollectors

• Some TSDs used for writing, others for reading, while tcollectors collect metrics

• TSDs are stateless

• TSDs use asyncHBase to scale

• Quiz: what are the advantages?

Rationale for using OpenTSDB – Design

Page 32: OpenTSDB for monitoring @ Criteo

32 | Copyright © 2016 Criteo

• OpenTSDB consists in Time Series Daemons (TSDs) and tcollectors

• Some TSDs used for writing, others for reading, while tcollectors collect metrics

• TSDs are stateless

• TSDs use asyncHBase to scale

• Quiz: what are the advantages?

Rationale for using OpenTSDB – Design

1. Clients never interact

with HBase directly

2. Simple protocol → easy

to use & extend

3. No state, no

synchronization → great

scalability

Page 33: OpenTSDB for monitoring @ Criteo

33 | Copyright © 2016 Criteo

• Metrics consist in:

• metric name

• UNIX timestamp

• value (64 bit integer or single-precision floating point value).

• tags (key-value pairs) specific to that metric instance

• Tags useful for aggregations on time series

proc.loadavg.15min 1461781436 15 host=0.namenode.hpc.criteo.prod

• Charts: average load in 15 minutes with the count

aggregator (proxy to machine count)

• Quiz: what is the chart below?

Rationale for using OpenTSDB – Metrics

proc.loadavg.15min

Page 34: OpenTSDB for monitoring @ Criteo

34 | Copyright © 2016 Criteo

• Metrics consist in:

• metric name

• UNIX timestamp

• value (64 bit integer or single-precision floating point value).

• tags (key-value pairs) specific to that metric instance

• Tags useful for aggregations on time series

proc.loadavg.15min 1461781436 15 host=0.namenode.hpc.criteo.prod

• Charts: average load in 15 minutes with the count

aggregator (proxy to machine count)

• Quiz: what is the chart below?

Rationale for using OpenTSDB – Metrics

proc.loadavg.15min

proc.loadavg.15mincluster=*

Page 35: OpenTSDB for monitoring @ Criteo

35 | Copyright © 2016 Criteo

• A single data table (split in regions), named tsdb

• Row key: <metric_uid><timestamp><tagk1><tagv1>[...<tagkN><tagvN>]

• timestamp is rounded down to the hour

• This schema helps group data from the same metric & time bucket close together (HBase sorts rows based on the row key)

• Assumption: query first on time range, then metric, then tags, in that order of preference

• Tag keys are sorted lexicographically

• Tags should be limited, because they are in the row key. Usually less than 5 tags.

• Values are stored in columns

• Column name: 2 or 4 bytes. For 2 bytes:

• Encode offset up to 3 600 seconds → 212 = 4096 → 12 bits

• 4 bits left for format/type

• Other tables, for metadata and name ↔ ID mappings

Rationale for using OpenTSDB – HBase schema

Page 36: OpenTSDB for monitoring @ Criteo

36 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – HBase schema

Hexadecimal representation of a row key, with two tags

Sorted row keys for the same metric: 000001

Note: row key size varies across rows, because of tags

Page 37: OpenTSDB for monitoring @ Criteo

37 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Statistics

Quiz: what should we look for?

Page 38: OpenTSDB for monitoring @ Criteo

38 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Statistics

Quiz: what should we look for?

Page 39: OpenTSDB for monitoring @ Criteo

39 | Copyright © 2016 Criteo

Rationale for using OpenTSDB – Statistics

Quiz: what should we look for?

367 513 metrics

30 tag keys (!)

86 194 tag values

Page 40: OpenTSDB for monitoring @ Criteo

Stabilizing & scaling OpenTSDB

Page 41: OpenTSDB for monitoring @ Criteo

41 | Copyright © 2016 Criteo

OpenTSDB was hard to scale at first. What problem can you see?

Scaling OpenTSDB

Page 42: OpenTSDB for monitoring @ Criteo

42 | Copyright © 2016 Criteo

OpenTSDB was hard to scale at first. What problem can you see?

Scaling OpenTSDB

We’re missing data points

Page 43: OpenTSDB for monitoring @ Criteo

43 | Copyright © 2016 Criteo

• Analyze all the layers of the system

• Logs are your friends

• Change parameters one by one, not all at once

• Measure, change, deploy, measure. Rinse, repeat

Scaling OpenTSDB – Lessons learned

Page 44: OpenTSDB for monitoring @ Criteo

44 | Copyright © 2016 Criteo

Varnish & OpenResty save the day

Scaling OpenTSDB – Nifty trick

OpenRestyPOST -> GET

VarnishCache + LB

OpenRestyPOST -> GET

VarnishCache + LB

OpenRestyPOST -> GET

VarnishCache + LB

RTSDRead OpenTSDB

RTSDRead OpenTSDB

RTSDRead OpenTSDB

Page 45: OpenTSDB for monitoring @ Criteo

45 | Copyright © 2016 Criteo

Varnish & OpenResty save the day

Scaling OpenTSDB – Nifty trick

OpenRestyPOST -> GET

VarnishCache + LB

OpenRestyPOST -> GET

VarnishCache + LB

OpenRestyPOST -> GET

VarnishCache + LB

RTSDRead OpenTSDB

RTSDRead OpenTSDB

RTSDRead OpenTSDB

Page 46: OpenTSDB for monitoring @ Criteo

OpenTSDB to the rescue in practice

Page 47: OpenTSDB for monitoring @ Criteo

47 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Easier to use than logs

hadoop.namenode.fsnamesystem.tag.HAState

Page 48: OpenTSDB for monitoring @ Criteo

48 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Easier to use than logs

Two NameNode failovers in one night!

hadoop.namenode.fsnamesystem.tag.HAState

Page 49: OpenTSDB for monitoring @ Criteo

49 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Easier to use than logs

Two NameNode failovers in one night!

• Hard to spot : it in the morning nothing has changed

hadoop.namenode.fsnamesystem.tag.HAState

Page 50: OpenTSDB for monitoring @ Criteo

50 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Easier to use than logs

Two NameNode failovers in one night!

• Hard to spot : it in the morning nothing has changed

• Would be impossible to see with daily aggregation

hadoop.namenode.fsnamesystem.tag.HAState

Page 51: OpenTSDB for monitoring @ Criteo

51 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Easier to use than logs

Two NameNode failovers in one night!

• Hard to spot : it in the morning nothing has changed

• Would be impossible to see with daily aggregation

• Trivia: we fixed the tcollector to get that metric

hadoop.namenode.fsnamesystem.tag.HAState

Page 52: OpenTSDB for monitoring @ Criteo

52 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Investigation

hadoop.nodemanager.direct.TotalCapacity

Page 53: OpenTSDB for monitoring @ Criteo

53 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Investigation

hadoop.nodemanager.direct.TotalCapacity

Huge memory capacity spike

Page 54: OpenTSDB for monitoring @ Criteo

54 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Investigation

hadoop.nodemanager.direct.TotalCapacity

Huge memory capacity spike Node not reporting points

Page 55: OpenTSDB for monitoring @ Criteo

55 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Investigation

hadoop.nodemanager.direct.TotalCapacity

Huge memory capacity spike Node not reporting pointsAnother huge spike

Page 56: OpenTSDB for monitoring @ Criteo

56 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Investigation

hadoop.nodemanager.direct.TotalCapacity

Huge memory capacity spike Node not reporting pointsAnother huge spike

No data

Page 57: OpenTSDB for monitoring @ Criteo

57 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Superimpose charts

hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis

Page 58: OpenTSDB for monitoring @ Criteo

58 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Superimpose charts

hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis

Service restart – configuration change

Page 59: OpenTSDB for monitoring @ Criteo

59 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Superimpose charts

hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis

Service restart – configuration change Service restart – OOM

Page 60: OpenTSDB for monitoring @ Criteo

60 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Superimpose charts

hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis

Service restart – configuration change Service restart – OOM

Log extract:NodeManagerconfigured with 192 GB physical memory

allocated to containers,

which is more than 80% of the total physical memory

available (89 GB)

Page 61: OpenTSDB for monitoring @ Criteo

61 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Hiccups

hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis

Page 62: OpenTSDB for monitoring @ Criteo

62 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Hiccups

hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis

OpenTSDB problem – not node-specific

Page 63: OpenTSDB for monitoring @ Criteo

63 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – Hiccups

hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis

OpenTSDB problem – not node-specific Node probably dead

Page 64: OpenTSDB for monitoring @ Criteo

64 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystem.BlocksTotal

Page 65: OpenTSDB for monitoring @ Criteo

65 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

File deletion

File deletion

hadoop.namenode.fsnamesystem.BlocksTotal

Page 66: OpenTSDB for monitoring @ Criteo

66 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

File deletion

File deletion

File creation

hadoop.namenode.fsnamesystem.BlocksTotal

Page 67: OpenTSDB for monitoring @ Criteo

67 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystem.BlocksTotalhadoop.namenode.fsnamesystem.FilesTotal

Page 68: OpenTSDB for monitoring @ Criteo

68 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

Slope

hadoop.namenode.fsnamesystem.BlocksTotalhadoop.namenode.fsnamesystem.FilesTotal

Page 69: OpenTSDB for monitoring @ Criteo

69 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

Slope

hadoop.namenode.fsnamesystem.BlocksTotalhadoop.namenode.fsnamesystem.FilesTotal

Be careful about the scale!

Page 70: OpenTSDB for monitoring @ Criteo

70 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Page 71: OpenTSDB for monitoring @ Criteo

71 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Quiz: what is this pattern?

Page 72: OpenTSDB for monitoring @ Criteo

72 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Quiz: what is this pattern?

• Answer: NameNode checkpoint

Page 73: OpenTSDB for monitoring @ Criteo

73 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Quiz: what is this pattern?

• Answer: NameNode checkpoint

• Note: done at regular intervals

Page 74: OpenTSDB for monitoring @ Criteo

74 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Quiz: what is this pattern?

• Answer: NameNode checkpoint

• Note: done at regular intervals

• Trivia: never do a failover during a checkpoint!

Page 75: OpenTSDB for monitoring @ Criteo

75 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Page 76: OpenTSDB for monitoring @ Criteo

76 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Page 77: OpenTSDB for monitoring @ Criteo

77 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Quiz: what is the problem?

Page 78: OpenTSDB for monitoring @ Criteo

78 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Quiz: what is the problem?

• Answer: no NameNode checkpoint → no FS image!

Page 79: OpenTSDB for monitoring @ Criteo

79 | Copyright © 2016 Criteo

OpenTSDB to the rescue in practice – NameNode rescue

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Quiz: what is the problem?

• Answer: no NameNode checkpoint → no FS image!

• Follow-up: standby namenode could not startup after a failover, because its FS image was too old

Page 80: OpenTSDB for monitoring @ Criteo

80 | Copyright © 2016 Criteo

Criteo ♥ BigData

- Very accessible: only 50 euros, which will be given to charity

- Speakers from leading organizations: Google, Spotify, Mesosphere, Criteo …

https://www.eventbrite.co.uk/e/nabdc-not-another-big-data-conference-registration-24415556587

Page 81: OpenTSDB for monitoring @ Criteo

81 | Copyright © 2016 Criteo

Criteo is hiring!

http://labs.criteo.com/

Criteo is hiring!