23
StatsCraft StatsCraft Monitoring Conference Monitoring Conference website and agenda: twitter: (#statscraft) facebook: email: http://statscraft.org.il @statscraft https://www.facebook.com/statscraft.il [email protected]

StatsCraft 2015: The problem (Keynote) - Nir Cohen

Embed Size (px)

Citation preview

StatsCraftStatsCraftMonitoring ConferenceMonitoring Conference

website and agenda: twitter: (#statscraft)facebook: email:

http://statscraft.org.il@statscraft

https://www.facebook.com/[email protected]

AgendaAgenda1. Understand the problem.2. Understand what monitoring is.3. Example use-case(s)4. A different approach5. Learn methodologies and tools

The ProblemThe ProblemNir Cohen @ Gigaspaces

@thinkopshttp://github.com/nir0s

WeWemonitor because...monitor because...

We want to satify theWe want to satify thecustomer.customer.

(make money?)

Automated Resource ProvisioningConfiguration ManagementAutomated Code DeploymentContinuous WhateverMonitoring

Still underrated...Still underrated...Automated Resource ProvisioningConfiguration ManagementAutomated Code DeploymentContinuous WhateverMonitoring

PROBLEM!PROBLEM!

Blame the tools?Blame the tools?

Problem originProblem origin

DISCLAIMERDISCLAIMER

We're monitoringWe're monitoringthe wrong things.the wrong things.

_rootCauseAnalysis:

the alternative is harder.

We're consideringWe're consideringlogs a second classlogs a second class

citizen.citizen.

_rootCauseAnalysis:

the alternative is harder.

Our data is lacking.Our data is lacking.

_rootCauseAnalysis:

inertia. that's how it was, that's how it is.

We separateWe separatemonitoring frommonitoring from

applicationapplication

_rootCauseAnalysis:

we're not used to this. (Ops problem)

We monitorWe monitorreactively, notreactively, not

proactivelyproactively

_rootCauseAnalysis:

reaction requires less initial energy than anticipation.

We put uptimeWe put uptimeabove system andabove system and

product qualityproduct quality

_rootCauseAnalysis:

it's much easier.

We deal with hardWe deal with hardlimits.limits.

_rootCauseAnalysis:

arbitrary numbers are easier to set.

Monitoring is non-Monitoring is non-functional butfunctional but

resource hungryresource hungry

_rootCauseAnalysis:

we just don't accept it.

Good monitoringGood monitoringrequires the rightrequires the right

people, not just Ops!people, not just Ops!

_rootCauseAnalysis:

delegation is natural. other have more important things to do.

Alert fatigue isAlert fatigue iscommon.common.

_rootCauseAnalysis:

solving issues is much easier than solving problems, and apparently, we are additted to non-actionable alerts.

We're auto-scalingWe're auto-scalingprematurelyprematurely

_rootCauseAnalysis:

brute force is natural

We're choosing theWe're choosing thewrong tools.wrong tools.

_rootCauseAnalysis:

it's easier to choose the tool than to choose what to monitor.

Good monitoringGood monitoringis hardis hard

_rootCauseAnalysis:

systems become complex, so they're harder to monitor.

So, after all, why do weSo, after all, why do wenot monitor properly?not monitor properly?

1. SimplificationSimplification2. DelegationDelegation3. RationalizationRationalization

_rootCauseAnalysis:

No fear,No fear,

Let's see how we can makeLet's see how we can make

this all betterthis all better

is here!is here!

“ If a service crashes and no one isaround to monitor it, does it raise an

alert?