16
Performance and Exception Monitoring Project Tim Smith CERN/IT

Performance and Exception Monitoring Project Tim Smith CERN/IT

Embed Size (px)

Citation preview

Page 1: Performance and Exception Monitoring Project Tim Smith CERN/IT

Performance and Exception Monitoring Project

Tim Smith CERN/IT

Page 2: Performance and Exception Monitoring Project Tim Smith CERN/IT

2000/11/02 Tim Smith: HEPiX @ JLab 2

Overview

Motivation Objectives

Analysis and Design Prototyping Perspective and Future

Page 3: Performance and Exception Monitoring Project Tim Smith CERN/IT

2000/11/02 Tim Smith: HEPiX @ JLab 3

Motivation

Alarm Recovery action

Monitoring System

Local Remote

Process killer Console Resource planning

Accounting Security Inventory

Independent systems No single overview Duplicated collection

Host based: Want Service Perceived problems not

real Scalability

Page 4: Performance and Exception Monitoring Project Tim Smith CERN/IT

2000/11/02 Tim Smith: HEPiX @ JLab 4

Motivation

Alarm Recovery action

Monitoring System

Local Remote

Console Resource planning

Accounting Security Inventory

Configuration Collection Transport Repository mgmt Display

Page 5: Performance and Exception Monitoring Project Tim Smith CERN/IT

2000/11/02 Tim Smith: HEPiX @ JLab 5

Objectives

To provide tools in which the alarms and displays are orientated to the overall service provided:

User end-to-end views, Quality of service views Managerial views of resource usage / evolution / failure rates Service provider views, and detailed machine views Link the alarms to both the monitoring and corrective actions

To provide service level metrics To provide a uniform monitoring infrastructure

Coordinated central repositories + Common logging format Averaging and archiving of logged information Correlations between logged information

Multiple input routes; extensible moni. clients Modular tools; demonstrated scalability

Page 6: Performance and Exception Monitoring Project Tim Smith CERN/IT

2000/11/02 Tim Smith: HEPiX @ JLab 6

Process

Analysis User Requirements Document Current Tools survey

Enterprise/Cluster mgmt, Pub domain, other labs, building blocks, DAQ, Run Control, Slow Control

Goal / Question / Metric formalism System Requirements Document

Design Interfaces Document Prototyping

Page 7: Performance and Exception Monitoring Project Tim Smith CERN/IT

2000/11/02 Tim Smith: HEPiX @ JLab 7

Goal / Question / Metric

Ensure quality of Interactive Service Sufficient nodes? Low enough load? Slow to respond to commands? Contactable via network

Network daemons alive No nologin Free ptys Connection test from remote node

Page 8: Performance and Exception Monitoring Project Tim Smith CERN/IT

2000/11/02 Tim Smith: HEPiX @ JLab 8

PEM Architecture

UserInterface

MonitoringAgent

MonitoringBroker

MeasurementRepository

ConfigurationRepository

CorrelationEngine

AccessServer

1

1

1

1

1

1

1

1

1 1..n

1..n

1..n

1..n

1..n

1..n1..n

OutsidePEM

Page 9: Performance and Exception Monitoring Project Tim Smith CERN/IT

2000/11/02 Tim Smith: HEPiX @ JLab 9

Configuration Repository

<TAG>

</TAG>

Parser

<TAG>

</TAG>

<TAG>

</TAG>

<TAG>

</TAG>

<TAG>

</TAG>

XML-DBMS

jdbc RDBMS

Viewers XercesFrom Apache

XML-DBMS freeware(Tried XSU from Oracle)

XMLSchema

Loading the DB

Host, Host typeMetrics, Services

Page 10: Performance and Exception Monitoring Project Tim Smith CERN/IT

2000/11/02 Tim Smith: HEPiX @ JLab 10

Configuration Repository

<TAG>

</TAG>

Parser

<TAG>

</TAG>

<TAG>

</TAG>

<TAG>

</TAG>

<TAG>

</TAG>

XML-DBMS

jdbc RDBMS

XML DB

Querying the DB

jdbc

ConfigurationItems

Java Objects

Page 11: Performance and Exception Monitoring Project Tim Smith CERN/IT

2000/11/02 Tim Smith: HEPiX @ JLab 11

Correlation Engine

To correlate metrics from the MRS according to configuration in the CRS Metric collections: trends + multiple machines Samplings: Union for read efficiency from MRS

Example Java Classes: Correlation coordinator Sampling cache Evaluators Timers

Page 12: Performance and Exception Monitoring Project Tim Smith CERN/IT

2000/11/02 Tim Smith: HEPiX @ JLab 12

Publish / Subscribe : Java RMI Interfaces Document

Events

UserInterface

MonitoringAgent

MonitoringBroker

MeasurementRepository

ConfigurationRepository

CorrelationEngine

AccessServer

metric stream

metric value

exception

configuration

Page 13: Performance and Exception Monitoring Project Tim Smith CERN/IT

2000/11/02 Tim Smith: HEPiX @ JLab 13

Monitoring Agent/Broker I

SNMP extended existing infrastructure Multithreaded broker loading DB

JMX / JDMK JMX public specification: managed resources Plugable agents Reported several important bugs Demo at JavaOne conference

Linux/NT remote reset Netlogger instrumentation

Opened up license negotiations

Page 14: Performance and Exception Monitoring Project Tim Smith CERN/IT

2000/11/02 Tim Smith: HEPiX @ JLab 14

Monitoring Agent/Broker II

C Low overhead

SNMP

/proc

netlogger

Script

Spool

Monitoring Process Spool Manager Monitoring Broker

Not yet … DMTF DMI, CMI

Page 15: Performance and Exception Monitoring Project Tim Smith CERN/IT

2000/11/02 Tim Smith: HEPiX @ JLab 15

PEM Futures

Today: CERN CC needs it Prototype for ALICE MDC III in January

Tomorrow: Tier-0 RC / GRID node need it More complete management solutions

Integrate into the Fabric Management WP ‘GRIDification’

Rapidly evolving technologies Lots of middleware

Lots of companies wanting collaboration still need framework

Page 16: Performance and Exception Monitoring Project Tim Smith CERN/IT

2000/11/02 Tim Smith: HEPiX @ JLab 16

Configuration

Management

Alarm

Recovery Actions

Inventory

Resource Planning

Security

PEM in Perspective

PC Hardware

Console Mgmt

Power Mgmt/Remote Reset

OS Installation/Update

OS Configuration/Update

Application Inst/Update

Monitoring