CERN - European Laboratory for Particle Physics LHC - 28 September 1999 Javier Jaen Martinez CERN IT/PDP

LHC - 28 September

1999 CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Javier Jaen Martinez CERN IT/PDP

LHC - 28 September 1999 Event Filter Farms 2CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Table of Contents

Motivation & Goals Types of Farms Core Issues Examples JMX: A Management Technology Summary


N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Study Goals How are Farms evolving in non HEP

environments? Have Generic PC Farms and Filter Farms

shared requirements for system/application monitoring, control and management?

Will we benefit from future developments in other domains?

Which are the emerging technologies for farm computing?


N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs Introduction

According to Pfister there are three ways to improve performance

In terms of computing technologies• work harder ~ using faster hardware• work smarter ~ using more efficient algorithms and

techniques• getting help ~ depending on how processors, memory

and interconnect are laid out: MPP, SMP, Distributed Systems and Farms

Work harder Work smarter Get Help


N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Motivation IT/PDP is already using commodity farms

All 4 LHC experiments will use Event Filter Farms

Commodity Farms are also becoming very popular for non HEP applications


N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Motivation

1000’s tasks and 1000’s of nodes to be controlled 1000’s tasks and 1000’s of nodes to be controlled monitored and managed (system and application monitored and managed (system and application management challenge).management challenge).


N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs Types of Farms

In our domain• Event Filter Farms

– To filter data acquired in previous levels of a DAQ– Reduce aggregated throughput by rejecting

uninteresting events or by compressing them

. . .. . .

Event BuildingEvent Building

SFI

..

..

..

EFU

PE

PE

PE

SFI

..

..

..

EFU

PE

PE

PE

SFI

..

..

..

EFU

PE

PE

PE

SFI

..

..

..

EFU

PE

PE

PE

. . . . . . . .. . . . . . . .


N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Types of Farms• Batch Data Processing

– Job reads data from tape process information and writes back data

– Each job runs on a separate node– Job management performed by a batch scheduler– Nodes with good CPU performance and large disks– Good connectivity to mass storage– Inter-node communication not critical (independent jobs)

• Interactive Data Analysis– Analysis and data mining– Traverse large databases as fast as possible– Programs may run in parallel– Nodes with great CPU performance and large disks– High performance inter-process communication


N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Types of farms• Montecarlo Simulation

– Used to simulate detectors– Simulation jobs run independently on each node– Similar to a batch data processing system (maybe with

less disk requirements)

• Others– Workgroup Services– Central Data Recording Farms– Disk server Farms,– ...

LHC - 28 September 1999 Event Filter Farms 10

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Types of farms In non HEP environments

• High Performance Farms (Parallel)– a collection of interconnected stand-alone

computers cooperatively working together as a single, integrated computing resource

– Farm seen as a computer architecture for parallel computation

• High Availability Farms– Mission Critical Applications– Hot Standby– Failover and Failback


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Key Issues in Farm Computing Size Scalability (physical & application) Enhanced Availability (failure management) Single System Image (look-and-feel of one system) Fast Communication (networks & protocols) Load Balancing (CPU, Net, Memory, Disk) Security and Encryption (farm of farms) Distributed Environment (Social issues) Manageability (admin. and control) Programmability (offered API) Applicability (farm-aware and non-aware app.)


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Core Issues (Maturity)Load Balancing

Failure Management SSI

Manageability

Fast Communication

“Mature” Development

FutureChallenge

Monitoring


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs Monitoring… why?

Performance Tuning:• Environment changes dynamically due to the

variable load on the system and the network.• improving or maintaining the quality of the services

according to those changes• Exists a reactive control monitoring that acts on

farm parameters to obtain desired performance Fault Recovery:

• to know the source of any failure in order to improve robustness and reliability.

• automatic fault recovery service needed in farms with hundreds of nodes (migration, …)

Security:• to detect and report security violation events


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Monitoring… Why? Performance Evaluation:

• to evaluate applications/system performance at run-time.

• Evaluation is performed off-line with data monitored on-line

Testing:• to check correctness of new applications running in a

farm by– detecting erroneous or incorrect operations– obtaining activity reports of certain functions of the farm– obtaining a complete history of the farm in a given period

of time


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs Monitoring Types

Generation Processing Dissemin. Presentat.

InstrumentationCollection

Traces generation

Traces mergingdatabase updating

correlationfiltering

UsersManagers

Control Systems

Pull/PushDistrib/Central.

Time/EventCollection Format

Online/OfflineOn Demand/Autom

Storage Format

Dissem. FormatAccess Type

Access ControlDemand/Auto

Present. Format

How Many Monitoring tools are available


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Monitoring ToolsCheopsCheops

GanymedeGanymede

MeasureNetMeasureNet

MTRMTR

Network healthNetwork health

NextPointNextPoint

ResponseNetworksResponseNetworks

Maple.Maple. SAS. SAS.

NetLoggerNetLogger

No Integrated tools for services, No Integrated tools for services, applicationsapplications, devices, network , devices, network monitoringmonitoring

http://www.slac.stanford.edu/~cottrell/tcom/nmtf.htmlhttp://www.slac.stanford.edu/~cottrell/tcom/nmtf.html


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs Monitoring … Strategies?

Define common strategies:• What to be monitored?• Collection strategies• Processing alternatives• Displaying techniques

Obtain Modular implementations• Good example ATLAS Back End Software

IT Division has started a monitoring project • Integrated monitoring• Service Oriented


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Fast Communication

Fast processors and fast networks The time is spent in crossing between them

Killer Switch Killer Switch

° ° °° ° °

NetworkNetworkInterface Interface HardwareHardware

Comm..Comm..SoftwareSoftware


Comm.Comm.SoftwareSoftware





Killer PlatformKiller Platformnsns

µsµs

msms


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs Fast Communication

Remove the kernel from critical path Offer to user applications a fully protected,

virtual, direct (zero copy send messages), user-level access to the network interface

This idea has been specified in VIA (Virtual Interface Architecture)

Application

High Level Comm. Lib (MPI, ShM Put/Get, PVM)

VI Network AdapterVI Kernel Agent

Send/Recv/RDMA

Buff Manag./Synchro


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Fast Communication VIA’s predecesors

• Active Messages (Berkeley Now project, Fast Sockets)• Fast Messages (UCSD MPI, Shmem Put/Get, Global

Arrays)

Applications using sockets, MPI, ShMem, … can benefit from these fast communication layers

Several Farms (HPVM (FM), NERSC PC cluster (M-VIA), …) already benefit from this technology


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs Fast Communication (Fast

Mess)

11

1010

100100

1,0001,000

10,00010,000

44 1616 6464 256256 1K1K 4K4K 16K16K 64K64KMessage size (bytes)Message size (bytes)

Late

ncy

(µs)

Late

ncy

(µs)

11

1010

100100

Ban

dwid

th (M

B/s

)B

andw

idth

(MB

/s)

FM packet sizeFM packet size

77.1 MB/s77.1 MB/s

11.1µs11.1µs


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Fast Communication

00 5050 100100 150150 200200 250250

One-way latency (µs)One-way latency (µs)

WorseWorseBetterBetter

00 5050 100100 150150 200200 250250 300300

Bandwidth (MB/s)Bandwidth (MB/s)

WorseWorse BetterBetter

HPVMHPVM

Pwr. Chal.Pwr. Chal.

SP-2SP-2

T3ET3E

Origin 2KOrigin 2K

BeowulfBeowulf


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs Single System Image

A single system image is the illusion, created by software or hardware, that presents a collection of resources as one, more powerful resource.

Strong SSI results in farms appearing like a single machine to the user, to applications, and to the network.

The SSI level is a good measure of the coupling degree of the nodes in a farm

Every farm has a certain degree of SSI (A farm with no SSI at all is not a farm).


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Benefits of Single System Image

Usage of system resources transparently Transparent process migration and load balancing

across nodes. Improved reliability and higher availability Improved system response time and performance Simplified system management Reduction in the risk of operator errors User need not be aware of the underlying system

architecture to use these machines effectively

(C) from Jain


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

SSI Services Single Entry Point Single File Hierarchy: xFS, AFS, ... Single Control Point: Management from single

GUI Single memory space Single Job Management: Glunix, Codine, LSF Single User Interface: Like workstation/PC

windowing environment Single I/O Space (SIO):

• any node can access any peripheral or disk devices without the knowledge of physical location.


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

SSI Services Single Process Space (SPS)

• Any process on any node create process with cluster wide process wide and they communicate through signal, pipes, etc, as if they are one a single node.

•Every SSI has a boundaryEvery SSI has a boundary

•Single system support can exist at different levelsSingle system support can exist at different levels• OS Level: MOSIX• Middleware:Codine,PVM•Application Level: Monitoring App, Back-End SW


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Scheduling Software Goal: enables the scheduling of system activities and

execution of applications while offering high availability services transparently

Usually works completely outside the kernel and on top of machines existing operating system

Advantages: • Load Balancing• Use spare CPU cycles• Provide Fault tolerance• In practice, increased and reliable throughput of user

applications


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

SS: Generalities The workings of a typical SS:

• Create a job description file: job name, resources, desired platform, …

• Job description file is sent by the client software to a master scheduler

• The master scheduler has an overall view: queues that have been configured plus the computational load of the nodes in the farm

• The master ensures that the resources being used are load balanced and ensures that jobs complete sucessfully


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

SS: Main features Application Support:

• are batch, interactive and parallel jobs supported? • multiple configurable queues?

Job Scheduling and allocation• Allocation Policy: taking into account system

load, CPU type, computational load, memory, disk space, …

• Checkpointing:save state at regular intervals during job execution. Job an be restarted from last checkpoint

• Migration: move job to another node in the farm to achieve dynamic load balancing or perform a sequence of activities on different specialized nodes

• Monitoring/ Suspension/Resumption


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

SS: Main features Dynamics of resources

• Resources, queues, and nodes reconfigured dynamically

• Existence of Single points of failure• Fault tolerance: re-run a job if system crashes

and check for needed resources


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

SS:Packages

Commercial

Codine (Genias)LoadBalancer (Tivoli)

LSF (Platform)Network Queueing Environment (SGI)

TaskBroker (HP)

Research

CCSCondor

Dynamic Network Queueing SystemDistributed Queueing System

Generic NQSPortable Batch System

Prospero Resource ManagerMOSIX

FarDynamite

NQS

PBSNQE

Condor DNQS

DQS

Codine

Utopia

LSF


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CODINE & LSF• to be used in large heterogeneous networked env.• Dynamic and static load balancing• Batch, interactive, parallel jobs• Checkpointing & Migration• Offers API for new distributed applications• No single Point of failure• Job accounting data and analysis tools• Modification of resource reservation for started jobs and specification

of releasable shared resources (LSF)• MPI (LSF) vs MPI, PVM, Express, Linda (Codine)• Reporting tools (LSF)• C API (LSF), ?? (Codine)• No Checkpointing of forked jobs or signaled jobs

SS: Some examples


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs Failure Management

Traditionally associated to Scheduling Sw and oriented to long running processes (CPU intensive)

If a CPU intensive process crashes --> wasted CPU Solution:

• Save the state of the process periodically• In case of failure process restarted from last checkpoint

Strategies:• store checkpoints in files using a distributed file system

(slows down computation, NFS is poor, AFS caching of Checkpoints may flush other useful data)

• checkpoint servers (dedicated node with disk storage and management functions for checkpointing)


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Failure Management Levels:

• Transparent checkpointing: checkpointing library linked against an executable binary. The library checkpoints transparently the process (condor, libckpt, Hector)

• User directed Checkpointing (directives included in the application’s code to perform specific checkpoints of particular memory segments)

Future challenges:• Decoupling Failure management and scheduling• Define strategies for System failure recovery (at

kernel level?)• Define strategies for task failure recovery


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Examples: MOSIX Farms

MOSIX = Multicomputer OS for UNIX An OS module (layer) that provides the

applications with the illusion of working on a single system

Remote operations are performed like local operations

Strong SSI at kernel level


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Example: MOSIX Farms

Supervised by distributed algorithms that respond on-line to global resource availability - transparently

Load-balancing - migrate process from over-loaded to under-loaded nodes

Memory ushering - migrate processes from a node that has exhausted its memory, to prevent paging/swapping

Preemptive process migration that can Preemptive process migration that can migrate--->migrate--->any process, anywhere, anytimeany process, anywhere, anytime


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Example: MOSIX Farms A scalable cluster configuration:

• 50 Pentium-II 300 MHz• 38 Pentium-Pro 200 MHz (some are SMPs)• 16 Pentium-II 400 MHz (some are SMPs)

Over 12 GB cluster-wide RAM Connected by the Myrinet 2.56 G.b/s LAN

Runs Red-Hat 6.0, based on Kernel 2.2.7

Download MOSIX:• http://www.mosix.cs.huji.ac.il/


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs Example: HPVM Farms

GOAL: Obtain Supercomputing performance from a pile of PCs

Scalability: 256 processors demonstrated Networking over Myrinet interconnect OS: LINUX and NT (going NT)

CORBACORBAWinsock 2Winsock 2 HPFHPF

Global Global ArraysArraysSHMEMSHMEMMPIMPI

Illinois Fast Messages (FM)Illinois Fast Messages (FM)

Available nowAvailable now Under Under

developmentdevelopment


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Example: HPVM Farms SSI at middleware level:

• MPI, and LSF Fast Communication:Fast Messages Monitoring: none yet Manageability (still poor):

• HPVM front-end (Java applet + LSF features) • Symera (under development at NCSA)

– DCOM based management tool (only for NT)– Add/remove node from cluster– logical cluster definition– distributed processes control + monitoring

Other: NERSC PC Cluster and Beowulf


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Example: Disk server Farms To transfer data sets between disk and

applications. IT/PDP

• RFIO package (optimize large sequential data transfers)

• each disk server system runs one master RFIO daemon in the background and a new requests lead to the spawning of further RFIO daemons.

• Memory space is used for caching

• SSI: Weak– Load balancing of rfio daemons in different nodes of the farm– Single memory space + I/O space could be useful in a disk

server farm with heterogeneous machines


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs Example: Disk server Farms

• Monitoring: – RFIO daemons status, load of farm nodes, memory usage,

caching hit rates,...• Fast Messaging: rfio techniques using TCP sockets• Manageability: storage, daemons, caching management• Linux based disk servers performance is now comparable to UNIX

disk servers (benchmarking study by Bernd Panzer IT/PDP)!!!! DPSS (Distributed Parallel Storage Server)

• collection of disk servers which operate in parallel over a wide area network to provide logical block level access to large data sets

• SSI: – applications are not aware of declustered data. – Load balancing if replicated data

• Monitoring: Java Agents Monitoring and Management • Fast Messaging: Dynamic TCP buffer size adjustment


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs JMX: A Management

Technology JMX: Java Management

Extensions (Basics):• defines a management

architecture, APIs, and management services all under a single specification

• resources can be made manageable without regards as to how its manager is implemented (SNMP, Corba, Java Manager)

• Based on Dynamic Agents• Platform and Protocol

independent• JDMK 3.2

Management Management ApplicApplic

Managed Managed ResourceResource

Instrumentation Instrumentation LevelLevel

(JMX Resource)(JMX Resource)

Agent LevelAgent Level

(JMX Agent)(JMX Agent)

Manager LevelManager Level

(JMX Manager)(JMX Manager)


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

JMX: Components


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs JMX: Applications

Implement distributed SNMP monitoring infrastructures

Heterogeneus farms (NT+Linux) management

Environments where Management “Intelligence” or requirements change over time

Environments where Management Clients maybe implemented using different technologies.


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs Summary

Farms scale and intended use will grow in the next years

We presented a set of factors to compare different farm computing approaches

Developments from non HEP domains can be used in HEP farms:• Fast Networking• Monitoring• System Management

However Application and tasks Management is very dependant on particular domains


CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

CER

N -

Euro

pean

Lab

orat

ory

for P

artic

le P

hysi

cs

Summary EFF community should:

• Share common experiences (specific subfields in future meetings)

• Define common monitoring requirements and mechanisms, SSI requirements, management procedures (filtering, reconstruction, compression, …)

• Follow on developments in management of High Performance computing farms (same challenge of management of thousand’s of processes/threads)

• Obtain if possible modular implementations of these requirements that constitute EFF Management Approach

Documents

CERN - European Laboratory for Particle Physics LHC - 28 September 1999 Javier Jaen Martinez CERN IT/PDP