40
Intelligent Monitoring Denis A. Vieira Jr. Ricardo Clemente

Intelligent Monitoring

  • Upload
    intelie

  • View
    3.907

  • Download
    0

Embed Size (px)

DESCRIPTION

This presentation describes a intelligent IT monitoring solution that uses Nagios as source of information, Esper as the CEP engine and a PCA algorithm.

Citation preview

Page 1: Intelligent Monitoring

Intelligent Monitoring

Denis A. Vieira Jr.

Ricardo Clemente

Page 2: Intelligent Monitoring

Intelligent Monitoring

Summary:

Motivation

Where are we?

Where are we going?

Action Plan

Event Correlation

Page 3: Intelligent Monitoring

Intelligent Monitoring

Summary:

Motivation

Where are we?

Where are we going?

Action Plan

Event Correlation

Page 4: Intelligent Monitoring

Motivation:

Only ponctual monitoring available

Decrease time to repair incidents

Proactive monitoring

Realistic view from live environment

Intelligent Monitoring

Page 5: Intelligent Monitoring

Motivation:

Learn (identify patterns )

Automation

Store historical data with no loss

Improve credibility and Situational Awareness

Intelligent Monitoring

Page 6: Intelligent Monitoring

Intelligent Monitoring

Summary:

Motivation

Where are we?

Where are we going?

Action Plan

Event Correlation

Page 7: Intelligent Monitoring

Where are we?:

Lots of information (1200 servers with more than 14000 monitors)

– more than 40000 graphs being plot

Lots of tools for monitoring running (SME, IPMonitor, Cricket,

SiteScope, SiteSeer, Logs)

Difficulties with specific customizations, performance and cost

No credibility (lots of emails) with alarms. But much better than

before.

Intelligent Monitoring

Page 8: Intelligent Monitoring

Intelligent Monitoring

Summary:

Motivation

Where are we?

Where are we going?

Action Plan

Event Correlation

Page 9: Intelligent Monitoring

Were are we going:

Use of events. E.g.: Appenders for log frameworks to integrate

information from applications

Knowledge to antecipate undesired situations

Unified interface for monitoring

Root cause detection

Intelligent Monitoring

Page 10: Intelligent Monitoring

Intelligent Monitoring

Summary:

Motivation

Where are we?

Where are we going?

Action Plan

Event Correlation

Page 11: Intelligent Monitoring

Intelligent Monitoring

Action Plan:

Unify the monitoring tools with Nagios (scalability and integration)

Integrate Nagios with correlation system using NEB (Nagios Event

Broker)

available ate:

code.google.com/p/neb2activemq

Map event and systems to correlate

(manual and analytic task)

Page 12: Intelligent Monitoring

Intelligent Monitoring

Summary:

Motivation

Where are we?

Where are we going?

Action Plan

Event Correlation

Orverview and system architecture

Event Bus

Correlation tecnique

Correlation egine

Visualization

Machine Learning

Project

Page 13: Intelligent Monitoring

Overview and system architecture

Modular and event-driven architecture

EVENT BUS

CORRELATION

ENGINE

MACHINE LEARN

COLLECTOR

VISUALIZATION

Page 14: Intelligent Monitoring

What is the system architecture?

Unique bus for message exchange

Modules are separte process for operating system and can be on

differente machines

Modules can publish / subscribe to queue / topic from bus

Why an Event Driven Architecture ?

Loose coupled e Distributed

Less intrusive for monitored systems

Modules are independent

Overview and system architecture

Page 15: Intelligent Monitoring

Event bus

Open source project

Chosen Apache ActiveMQ:

Stable

Performance

Active Comunity

Conectivity

JMS

STOMP

REST

XMPP (...)

Page 16: Intelligent Monitoring

Event Bus

Message format

JSON ( not XML)

Simplicity

Structure

Header : channel type(queue or topic) and event type

Body: data

$ curl -d "type=queue&body={'idle'=70, 'sys’=20,

'usr'=10, 'host'='ws122' }&eventtype=CPU"

http://barramento/message/events;

Page 17: Intelligent Monitoring

Correlation Technique

CEP (Complex Event Processing )

Technology that enables processing mutiple events in real time with

the goal to identify meaningful events

Based on rules or queries (“SQL like”)

Queries created on execution time

History

On1995, professor David Luckham from Stanford, working on Rapide

project coined the term CEP

Database research topic: Data Stream Management Systems (DSMS)

Page 18: Intelligent Monitoring

Correlation technique

Query Processing

Memory

DadosDadosData

Persistents relations

query answer

Processamento de

consultas

Memória

dados dados

continuos

queryanswer

Data stream

“upside down database”

Page 19: Intelligent Monitoring

Correlation Technique

Marketing

Trend(Buzz)

CEP market is estimated on 460 milion dolars by 2010 (source: IEEE

Computer Society – April 2009)

Useful where there are data streams and necessity to extract

information on real time from that data

Financial Market

Logistic process (RFID)

Airport control

ICUs

Datacenters

Page 20: Intelligent Monitoring

Correlation Technique

Big Players

Page 21: Intelligent Monitoring

Correlation Technique

Open Source Players

Academic projects:

STREAM – Stanford – 2003 (officialy deprecated)

TelegraphCQ – Berkeley - 2003

Based on PostgreSQL 7.3.2

No activity

Cayuga – Cornell

From the industry:

Esper, a codehaus project complete in terms features

Compact syntax and flexible

Excelent documentation

Performance

Our choice!

Page 22: Intelligent Monitoring

Correlation Engine

If session raised 10% on the

last 3 min, and the average

from Servers cpu didn’t raise

5%, and Mysql slow queries

are above 10, so there is a

database retention causing

users to queue

Application

Page 23: Intelligent Monitoring

Correlation Engine

Application

Mysql

Server

Vip

t – 3 min t

t – 3 min t

t

cpu_usr

slow_query

session

Page 24: Intelligent Monitoring

SELECT Server.host , Server.cpu_usr, Server_PAST.cpu_usr, Vip.session,

Vip_PAST.session, Mysql.slow_query

FROM

Server.win:time(1 min) as Server,

Server.win:ext_timed(current_timestamp(), 3 min) as Server_PAST,

Vip.win:time(1 min) as Vip,

Vip.win:ext_timed(current_timestamp(), 3 min) as Vip_PAST ,

Mysql.win:time (1min) as Mysql

HAVING

Vip.session > Vip_PAST.session * 1.10 AND

avg(Server.cpu_usr) < avg (Server_PAST.cpu_usr) * 1.05 AND

Mysql.slow_query > 10

Correlation Engine

Application

Page 25: Intelligent Monitoring

Identifing na outlier

select host, free, avg(free)

from Memory.win:time(240 sec) group by host

having free < avg(free)

Events sequence

select * from

pattern [every Memory(free < 10) ->

(timer:interval(60 sec) and Log(text like ‘%OutOfMemory%’)) ]

Schedule and extensions

select idle from pattern [every timer:at(*, [16:22], *, [0,3], *) ].win:time(30

sec), CPU.win:time(30) where idle < 30 AND Filter.isInNode(id,

“Sports.BigFarm")

Correlation Engine

Page 26: Intelligent Monitoring

Motor de correlação

Source: Esper Performance - http://docs.codehaus.org/display/ESPER/Esper+performance

Item Especificação

HW Servidor Esper 2 x Intel Xeon 5130 2GHz (4 cores total), 16GB RAM

VM config -Xms2g -Xmx2g -Xns128m -Xgc:gencon

Consulta # cons. evt/s Latência Latência

média

Nota

select '$' as ticker from

Market(ticker='$').win:lengt

h(1000).stat:weighted_avg('p

rice', 'volume') output last

every 30 seconds

1000 519 728 99.66% <

10us

2.8us CPU com 85%,

70 Mbit/s

Performance Esper

Page 27: Intelligent Monitoring

Correlation engine

Process inside Correlaion engine

Page 28: Intelligent Monitoring

Visualization – Console

Quering the live environment

Page 29: Intelligent Monitoring

Visualization – Troubleshooting

Antecipating and solving incidents quicker

Page 30: Intelligent Monitoring

Visualization- Dashboard

Consolidate view of environment

Page 31: Intelligent Monitoring

What about unseen problems?

Page 32: Intelligent Monitoring

Machine Learning

Choice for non-supervised and incremental algorithms

Incremental PCA

Transforms a number of possible correlated variables in a minor

number of non-correlated, the principal componnents

A change on principal componnents means a broken correlation, or

annomaly

Can be used for data compression

Inspired on a paper from Carnegie Mellon University (Hoke et al. 2006)

Source: http://www.pdl.cmu.edu/PDL-FTP/SelfStar/osr_sub.pdf

Implementation had two main challenges: measures with missing values

and different scales

Page 33: Intelligent Monitoring

60 input signals

Machine Learning

Page 34: Intelligent Monitoring

Summarized on 1 principal component + gerenation matriz

Machine Learning

Page 35: Intelligent Monitoring

Second principal component

sensibility

three annomaly

Machine Learning

Page 36: Intelligent Monitoring

Project

Status

Developed all functionalities

Algorithms being validated through tests with

RRDs and meeting with operation team

Performance tests on going

System on live enviroment with reduced scope

Page 37: Intelligent Monitoring

Project at Globo.com – Next challenges

Scale

Events“Sharding”

Rule balance

Cache

Otimize algorithm

Adaptative control of memory and sensibility parameters

Insert a supervisioned layer

Other algorithms to cooperate

Page 38: Intelligent Monitoring

Intelligent Monitoring

Final considerations

Page 39: Intelligent Monitoring

References

http://delicious.com/fisl10

Page 40: Intelligent Monitoring

Questions

Contacts

Denis A. Vieira Jr

[email protected] (www.globo.com)

Ricardo Clemente

[email protected] (www.intelie.com.br)

Globo.com stand

This afternoon

Raise your hand!