48
Iosif Legrand N ovember 2008 1 Iosif Legrand, Harvey Newman, Iosif Legrand, Harvey Newman, Ramiro Voicu , Costin Grigoras, Ramiro Voicu , Costin Grigoras, Catalin Cirstoiu, Ciprian Dobre Catalin Cirstoiu, Ciprian Dobre An Agent Based, Dynamic Service System to Monitor, An Agent Based, Dynamic Service System to Monitor, Control and Optimize Distributed Systems Control and Optimize Distributed Systems ACAT - November 2008 ERICE

1 Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Catalin Cirstoiu, Ciprian Dobre An Agent Based, Dynamic Service System to Monitor, Control

Embed Size (px)

Citation preview

Iosif Legrand November 20081

Iosif Legrand, Harvey Newman, Iosif Legrand, Harvey Newman, Ramiro Voicu , Costin Grigoras, Ramiro Voicu , Costin Grigoras, Catalin Cirstoiu, Ciprian DobreCatalin Cirstoiu, Ciprian Dobre

An Agent Based, Dynamic Service System to Monitor,An Agent Based, Dynamic Service System to Monitor, Control and Optimize Distributed SystemsControl and Optimize Distributed Systems

ACAT - November 2008 ERICE

Iosif Legrand November 2008 22

The MonALISA FrameworkThe MonALISA Framework

MonALISA is a Dynamic, Distributed Service System capable to collect any type of information from different systems, to analyze it in near real time and to provide support for automated control decisions and global optimization of workflows in complex grid systems.

The MonALISA system is designed as an ensemble of autonomous multi-threaded, self-describing agent-based subsystems which are registered as dynamic services, and are able to collaborate and cooperate in performing a wide range of monitoring tasks. These agents can analyze and process the information, in a distributed way, and to provide optimization decisions in large scale distributed applications.

Iosif Legrand November 2008

Distributed Object Systems Distributed Object Systems CORBA , DCOM CORBA , DCOM

LookupServiceStubLookup

Service Skeleton CLIENTServer

“Traditional” Distributed Object Models(CORBA, DCOM)

“IDL” Compiler

The Stub is linked to the Client.The Client must know about theservice from the beginning and needs the right stub for it

The Server and the client code must be created together !!

Iosif Legrand November 2008

Distributed Object Systems Distributed Object Systems Web Services WSDL/SOAP Web Services WSDL/SOAP

LookupService

WSDL

CLIENTServer

LookupService

Interface

SOAP

The client can dynamically generate the data structures and the interfaces for using remote objects based on WSDL

Platform independent

Iosif Legrand November 2008 5

Mobile Code and Distributed ServicesMobile Code and Distributed Services

Act as a true dynamic service and provide the necessary functionally to be used by any other services that require such information (Jini, interface to WSDL / SOAP) mechanism to dynamically discover all the “Service Units" remote event notification for changes in the any system lease mechanism for each registered unit

Dynamic Code Loading

LookupServiceProxy CLIENT

LookupService

Proxy

Service

Services can be used dynamicallyRemote Services Proxy == RMI StubMobile Agents Proxy == Entire Service “Smart Proxies” Proxy adjusts to the client

Any well suited protocol for the application

Iosif Legrand November 2008 6

MonALISA Service & Data HandlingMonALISA Service & Data Handling

6

Data Store

Data CacheService & DB

Configuration Control (SSL)

Predicates & Agents

Data (via ML Proxy)

Applications Clients or Higher Level

Services

WS Clients andservice

WebService

WSDLSOAP

LookupService

LookupService

Registration

Discovery

Postgres

AGENTSAGENTS

FILTERS / TRIGGERSFILTERS / TRIGGERS

Monitoring ModulesMonitoring ModulesCollects any type of information Dynamic Loading

Push and Pull

Iosif Legrand November 2008 7

The MonALISA ArchitectureThe MonALISA Architecture

7

Regional or Global High Level Regional or Global High Level Services, Services, Repositories & ClientsRepositories & Clients

Secure and reliable communicationSecure and reliable communicationDynamic load balancing Dynamic load balancing Scalability & ReplicationScalability & ReplicationAAA for ClientsAAA for Clients

Distributed Dynamic Distributed Dynamic Registration and Discovery-Registration and Discovery-based on a lease based on a lease mechanism and remote eventsmechanism and remote events

JINI-Lookup Services Secure & Public

MonALISA services

Proxies

HL services

Agents

Network of

Distributed System for gathering and Distributed System for gathering and analyzing information based on analyzing information based on mobile agents: mobile agents: Customized aggregation, Triggers,Customized aggregation, Triggers,ActionsActions

Fully Distributed System with no Single Point of Failure

Iosif Legrand November 2008 8

LookupService

Registration / Discovery Registration / Discovery Admin Access and AAA for ClientsAdmin Access and AAA for Clients

MonALISAService

LookupService

Client(other service)

DiscoveryRegistration

(signed certificate)

MonALISAService

MonALISAService

Services Proxy

Multiplexer

Services Proxy

Multiplexer

Client(other service)

Admin SSL connection

Trustkeystore

AAA services

Client authentication

Data Data Filters & AgentsFilters & Agents

Trustkeystore

Application

Applications

Iosif Legrand November 2008 9

Monitoring Grid sites, Running Jobs, Monitoring Grid sites, Running Jobs, Network Traffic, and ConnectivityNetwork Traffic, and Connectivity

9

TOPOLOGY

JOBS

ACCOUNTING

Running Jobs

Iosif Legrand November 2008 10

Monitoring architecture in ALICEMonitoring architecture in ALICE

10

Long HistoryDB

LCG Tools

MonALISA @Site

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

MonALISA @CERN

MonALISA

LCG Site

ApMon

AliEn CE

ApMon

AliEn SE

ApMon

ClusterMonitor

ApMon

AliEn TQ

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn CE

ApMon

AliEn SE

ApMon

ClusterMonitor

ApMon

AliEn IS

ApMon

AliEn Optimizers

ApMon

AliEn Brokers

ApMon

MySQLServers

ApMon

CastorGridScripts

ApMon

APIServices

MonaLisaMonaLisaRepositoryRepository

Aggregated Data

rss

vsz

cputime

run

tim

e

job

slots

free

spac

e

nr.

of

file

s

op

en

files

Queued

JobAgents

cpu

ksi2k

jobstatus

disk

used

pro

cesses

loadn

etIn

/ou

t

jobsstatussockets

migratedmbytes

active

sessions

MyP

roxy

status

Alerts

Actions

Iosif Legrand November 2008 11

http://pcalimonitor.cern.ch

ALICE : Global Views, Status & JobsALICE : Global Views, Status & Jobs

Iosif Legrand November 2008 12

ALICE: Job status – history plotsALICE: Job status – history plots

Iosif Legrand November 2008 13

ALICE: Resource Usage monitoringALICE: Resource Usage monitoring

Cumulative parameters CPU Time & CPU KSI2K Wall time & Wall KSI2K Read & written files Input & output traffic (xrootd)

Running parameters Resident memory Virtual memory

Open files Workdir size Disk usage CPU usage

Aggregated per site

Iosif Legrand November 2008 14

ALICE: Job agents monitoringALICE: Job agents monitoring

From Job Agent itself Requesting job Installing packages Running job Done Error statuses

From Computing Element Available job slots Queued Job Agents Running Job Agents

Iosif Legrand November 2008 15

Monitoring the Execution of JobsMonitoring the Execution of Jobs and the Time Evolution and the Time Evolution

15

SPLIT JOBSSPLIT JOBS

LIFELINES for JOBS

Job Job

Job1

Job2

Job3

Job31

Job32

Summit a Job

DAG

Iosif Legrand November 2008 16

Two levels of decisions:

local (autonomous),

global (correlations).

Actions triggered by:

values above/below given thresholds,

absence/presence of values,

correlations between any values.

Action types:

alerts (emails/instant msg/atom feeds),

running an external command,

automatic charts annotations in the repository,

running custom code, like securely ordering a ML service to (re)start a site service.

ML ServiceML Service

ML ServiceML Service

Actions based onActions based onglobal informationglobal information

Actions based onActions based onlocal informationlocal information

• Traffic• Jobs• Hosts• Apps

• Temperature• Humidity• A/C Power• …

SensorsSensors Local Local decisionsdecisions

Global Global decisionsdecisions

Local and Global Decision FrameworkLocal and Global Decision Framework

Global ML

Services

Iosif Legrand November 2008 17

ALICE: Automatic job submissionALICE: Automatic job submissionRestarting ServicesRestarting Services

17

MySQL daemon is automatically restartedwhen it runs out of memoryTrigger: threshold on VSZ memory usage

ALICE Production jobs queue is kept full by the automatic submissionTrigger: threshold on the number of aliprod waiting jobs

Administrators are kept up-to-date on the services’ statusTrigger: presence/absence of monitored information

Iosif Legrand November 2008 18

ALICE is using the monitoring information to automatically:

resubmit error jobs until a target completion percentage is reached,

submit new jobs when necessary (watching the task queue size for each service account)

production jobs,

RAW data reconstruction jobs, for each pass,

restart site services, whenever tests of VoBox services fail but the central services are OK,

send email notifications / add chart annotations when a problem was not solved by a restart,

dynamically modify the DNS aliases of central services for an efficient load-balancing.

Most of the actions are defined by few lines configuration files.

Automatic actions in ALICEAutomatic actions in ALICE

Iosif Legrand November 2008 19

Monitoring USLHCnetMonitoring USLHCnet

Operations & management assisted by agent-based softwareOperations & management assisted by agent-based software Used on the new CIENA equipment used for network managmentUsed on the new CIENA equipment used for network managment

Iosif Legrand November 2008 20

USLHCnet: USLHCnet: Precise measurements Precise measurements for the Operational Status on the WAN Linkfor the Operational Status on the WAN Link

Operations & management assisted by agent-based softwareOperations & management assisted by agent-based software Used on the new CIENA equipment used for network managmentUsed on the new CIENA equipment used for network managment

Iosif Legrand November 2008 21

USLHCnet: Traffic on different segmentsUSLHCnet: Traffic on different segments

Iosif Legrand November 2008 22

USLHCnet: Accounting for Integrated TrafficUSLHCnet: Accounting for Integrated Traffic

Iosif Legrand November 2008 23

The UltraLight NetworkThe UltraLight Network

BNL ESnet IN /OUT

Iosif Legrand November 2008 24

Available Bandwidth MeasurementsAvailable Bandwidth Measurements

Embedded Pathload module.Embedded Pathload module.

24

Iosif Legrand November 2008 25

Monitoring Network Topology, Monitoring Network Topology, Latency, RoutersLatency, Routers

NETWORKS

AS

ROUTERS

Real Time Topology Discovery & DisplayReal Time Topology Discovery & Display

Iosif Legrand November 2008 26

EVO : Real-Time monitoring for ReflectorsEVO : Real-Time monitoring for Reflectorsand the quality of all possible connectionsand the quality of all possible connections

Iosif Legrand November 2008 27

EVO: Creating a Dynamic, Global, Minimum EVO: Creating a Dynamic, Global, Minimum Spanning Tree to optimize the connectivitySpanning Tree to optimize the connectivity

Tuv

uvwTw),(

)),(()(

A weighted connected graph G = (V,E) with n vertices and m edges. The quality of connectivity between any two reflectors is measured every second.Building in near real time a minimum- spanning tree with addition constrains

Iosif Legrand November 2008 28

Dynamic MST to optimize the Dynamic MST to optimize the Connectivity for ReflectorsConnectivity for Reflectors

Frequent measurements of RTT, jitter, traffic and lost packages The MST is recreated in ~ 1 S case on communication problems.

Iosif Legrand November 2008 29

EVO: Optimize how clients connect to the EVO: Optimize how clients connect to the system for best performance and load balancingsystem for best performance and load balancing

Iosif Legrand November 2008 3030

FDT – Fast Data TransferFDT – Fast Data Transfer

FDT is an application for efficient data transfers.

Easy to use. Written in java and runs on all major platforms.

It is based on an asynchronous, multithreaded system which is using the NIO library and is able to:

stream continuously a list of files

use independent threads to read and write on each physical device

transfer data in parallel on multiple TCP streams, when necessary

use appropriate size of buffers for disk IO and networking

resume a file transfer session

Iosif Legrand November 2008 3131

FDT – Fast Data Transfer FDT – Fast Data Transfer

Pool of buffers Kernel Space

Pool of buffers Kernel Space

Data Transfer Sockets / Channels

Independent threads per device

Restore the files frombuffers

Control connection / authorization

Iosif Legrand November 2008 32

FDT featuresFDT features

April 2007 Iosif Legrand32

The FDT architecture allows to "plug-in" external security The FDT architecture allows to "plug-in" external security APIs and to use them for client authentication and APIs and to use them for client authentication and authorization. Supports several security schemes :authorization. Supports several security schemes :

• IP filtering IP filtering • SSH SSH • GSI-SSHGSI-SSH• Globus-GSI Globus-GSI • SSL SSL

User defined loadable modules for Pre and Post User defined loadable modules for Pre and Post Processing to provide support for dedicated MS system, Processing to provide support for dedicated MS system, compression … compression …

FDT can be monitored and controlled dynamically by FDT can be monitored and controlled dynamically by MonALISA servicesMonALISA services

Iosif Legrand November 2008 33October 2006 Iosif Legrand

33

FDT – Memory to Memory Tests in WANFDT – Memory to Memory Tests in WAN

CPUs Dual Core Intel

 Xenon @ 3.00  GHz,  4 GB RAM, 4 x 320 GB SATA Disks Connected with 10Gb/s Myricom

~9.0 Gb/s

~9.4 Gb/s

Iosif Legrand November 2008 34

Disk -to- Disk transfers in WANDisk -to- Disk transfers in WAN

NEW YORK GENEVA

Reads and writes on 4 SATA disks in parallel on each server

Mean traffic ~ 210 MB/s~ 0.75 TB per hour

MB

/s

CERN CALTECH

Reads and writes on two 12-port RAID Controllers in parallel on each server

Mean traffic ~ 545 MB/s~ 2 TB per hour

1U Nodes with 4 Disks 4U Disk Servers with 24 Disks

October 2007 Iosif Legrand

Lustre read/ write ~ 320 MB/s between Florida and Caltech Works with xrootd Interface to dCache using the dcap protocol

Iosif Legrand November 2008 35

Dynamic restorationof lightpath if a segment has problems

Monitoring Optical SwitchesMonitoring Optical Switches

Iosif Legrand November 2008 36

Monitoring the Topology and Optical Monitoring the Topology and Optical Power on Fibers for Optical CircuitsPower on Fibers for Optical Circuits

Port power monitoring

Controlling

Glimmerglass Switch Example

Iosif Legrand November 2008 37

““On-Demand”, End to End Optical On-Demand”, End to End Optical Path AllocationPath Allocation

37

Internet

A

>FDT A/fileX B/path/

OS path availableConfiguring interfacesStarting Data Transfer

Mo

nito

r

Co

ntro

l

TL

1

Optical Switch

MonALISAService

MonALISA Distributed Service System

BOSAgent

Active light path

Regul

ar IP

pat

hReal time monitoring

APPLICATION

LISA AGENTLISA sets up - Network Interfaces - TCP stack - Kernel parameters - RoutesLISA APPLICATION“use eth1.2, …”

LISALISA AgentAgent

DATA

CREATES AN END TO END PATH < 1s

Detects errors and automatically recreate theDetects errors and automatically recreate the path in less than the TCP timeout path in less than the TCP timeout

Iosif Legrand November 2008 38

CERNGeneva

CALTECHPasadena

Starlight

Manlan

USLHCnet

Internet2

Controlling Optical Planes Controlling Optical Planes Automatic Path RecoveryAutomatic Path Recovery

“Fiber cut” simulationsThe traffic moves from one transatlantic line to the other oneFDT transfer (CERN – CALTECH) continues uninterruptedTCP fully recovers in ~ 20s

1

23

4

FDT Transfer

4 Fiber cuts simulations

200+ MBytes/secFrom a 1U Node

4 fiber cut emulations

Iosif Legrand November 2008 39

End to End Path Provisioning End to End Path Provisioning on different layerson different layers

Layer 3

Layer 2

Layer 1

Default IP route

VCAT and VLAN channels

Optical path

Site A

Site B

Monitor layout / Setup circuit

Monitor host & end-to-end paths / Setup end-host parameters

Control transfers and bandwidth reservations

Monitor interfaces traffic

Iosif Legrand November 2008 40

APPLICATION

>FDT A/fileX B/path/

path or channel allocationConfiguring interfacesStarting Data Transfer

Regular IP path Regular IP

pathLocal VLANs

Recommended to use two NICs -one for management /one for data -- bonding two NICs to the same IP

MAP Local VLANsto WAN channels or light paths

““On-Demand”, L2 Dynamic On-Demand”, L2 Dynamic Channel and Path AllocationChannel and Path Allocation

Iosif Legrand November 2008 41

The Need for Planning and Scheduling for The Need for Planning and Scheduling for Large Data TransfersLarge Data Transfers

In Parallel Sequential

2.5 X Faster to perform the two reading tasks sequentially

Iosif Legrand November 2008 42

UserScheduling

ControlMonitoring

End Host Agents

RealtimeFeedback

Request

Channel allocation based on VO/Priority, [ + Wait time, etc.] Create on demand a End-to-end path or Channel & configure end-hosts Automatic recovery (rerouting) in case of errors Dynamic reallocation of throughputs per channel: to manage priorities,

control time to completion, where needed Reallocate resources requested but not used

Dynamic Path Provisioning Dynamic Path Provisioning Queueing and Scheduling Queueing and Scheduling

Iosif Legrand November 2008 43

Dynamic priority for FDT TransfersDynamic priority for FDT Transferson common segmentson common segments

Priority 4

Priority 2

Priority 8

Iosif Legrand November 2008 44

Bandwidth Challenge at SC2005

151 Gbs

~ 500 TB Total in 4h

Iosif Legrand November 2008 45

FDT & MonLISA Used at SC 2006FDT & MonLISA Used at SC 2006

April 2007 Iosif Legrand

17.7 Gb/s Disk to Disk 17.7 Gb/s Disk to Disk on 10 Gb/s link used inon 10 Gb/s link used inBoth directions from Both directions from Florida to CaltechFlorida to Caltech

Iosif Legrand November 2008 46

Official BWCOfficial BWCHyper BWCHyper BWC

SC2006SC2006

April 2007 Iosif Legrand

Iosif Legrand November 2008 47

SC2007SC2007

+80 Gb/s disk to disk

Iosif Legrand November 2008 48

Communities using MonALISACommunities using MonALISA

48

Major Communities

ALICE CMS ATLAS EVO LGC RUSSIA UNAM Grid (Mx) ITU

USLHCNET ULTRALIGHT GLORIAD ABILENE RoEduNET Enlightened

--

VRVSVRVSALICE

USLHCnetUSLHCnet

EVOEVO

OSGOSG

MonALISA TodayRunning 24 X 7

at ~340 Sites Collecting ~ 1 000 000

parameters in near real-time

Update rate of 20,000 parameter updates per second

Monitoring12,000 computers > 100 WAN Links

Thousands of Grid jobs running concurrently

http://monalisa.caltech.edu