Planning the LCG Fabric at CERN openlab TCO Workshop November 11 th 2003 Tony.Cass@ CERN.ch

Planning the LCG Fabric at CERN

openlab TCO Workshop

November 11th 2003

[email protected]

2 [email protected]

Fabric Area Overview

InfrastructureElectricity, Cooling, SpaceInfrastructureElectricity, Cooling, Space

NetworkNetwork

Batch system (LSF, CPU server)Batch system (LSF, CPU server)

Storage system (AFS, CASTOR, disk server)Storage system (AFS, CASTOR, disk server)

Purchase, Hardware selection,Resource planningPurchase, Hardware selection,Resource planning

InstallationConfiguration + monitoringFault tolerance

InstallationConfiguration + monitoringFault tolerance

Prototype, TestbedsPrototype, Testbeds

Benchmarks, R&D,ArchitectureBenchmarks, R&D,Architecture

Automation, Operation, ControlAutomation, Operation, Control

Coupling of components through hardware and software

GRID services !?GRID services !?

3 [email protected]

Agenda Building Fabric

Batch Subsystem

Storage subsystem

Installation and Configuration

Monitoring and control

Hardware Purchase

4 [email protected]


Batch Subsystem

Storage subsystem



Hardware Purchase

5 [email protected]

Building Fabric — I B513 was constructed in the early 1970s and

the machine room infrastructure has evolved slowly over time.– Like the eye, the result is often not ideal…

6 [email protected]

Current Machine Room LayoutProblem:Normabarres run one way, services run the other….

Services

Services

Services

Services

7 [email protected]



With the preparations for LHC we have the opportunity to remodel the infrastructure.

8 [email protected]

528 box PCs 105kW1440 1U PCs 288kW324 disk servers 120kW(?)

Future Machine Room Layout

18m double rows of racks12 shelf unitsor 36 19” racks

9m double rows of racks for critical servers

Aligned normabarres

9 [email protected]



With the preparations for LHC we have the opportunity to remodel the infrastructure.– Arrange services in clear groupings associated with

power and network connections.» Clarity for general operations plus ease of service restart

should there be any power failure.

– Isolate critical infrastructure such as networking, mail and home directory services.

– Clear monitoring of planned power distribution system.

Just “good housekeeping”, but we expect to reap the benefits during LHC operation.

10 [email protected]

Building Fabric — II Beyond good housekeeping, though, there are

building fabric issues that are intimately related with recurrent equipment purchase.– Raw power: We can support a maximum equipment

load of 2.5MW. Does the recurrent additional cost of blade systems avoid investment in additional power capacity?

– Power efficiency: Early PCs had power factors of ~0.7 and generated high levels of 3rd harmonics. Fortunately, we now see power factors of 0.95 or better, avoiding the need to install filters in the PDUs. Will this continue?

– Many sites need to install 1U or 2U rack mounted systems for space reasons. This is not a concern for us at present but may become so eventually.

» There is a link here to the previous point: the small power supplies for 1U systems often have poor power factors.



Batch Subsystem

Storage subsystem



Hardware Purchase


Fabric ArchitectureLevel of complexity

Batch system, load balancing,Control software, Hierarchical Storage Systems

HardwareHardware SoftwareSoftware

CPUCPU

Physical and logical couplingPhysical and logical coupling

DiskDisk

PC PC Storage tray,NAS server,SAN element

Storage tray,NAS server,SAN element

Motherboard, backplane,Bus, integrating devices(memory,Power supply, controller,..)

Operating system, driver

Network (Ethernet, fibre channel, Myrinet, ….)Hubs, switches, routers

ClusterCluster

World wide clusterWorld wide cluster Grid middleware Wide area network




Batch Subsystem Looking purely at batch system issues, TCO is

reduced as the efficiency of node usage increases. What are the dependencies?– The load characteristics– The batch scheduler– Chip technology– Processors/box– The operating system– Others?



reduced as the efficiency of node usage increases. What are the dependencies?– The load characteristics

» Not much we in IT can do here!

– The batch scheduler– Chip technology– Processors/box– The operating system– Others?



reduced as the efficiency of node usage increases. What are the dependencies?– The load characteristics – The batch scheduler

» LSF is pretty good here, fortunately.

– Chip technology– Processors/box– The operating system– Others?



reduced as the efficiency of node usage increases. What are the dependencies?– The load characteristics – The batch scheduler– Chip technology

» Take hyperthreading, for example. Tests have shown that, for HEP codes at least, hyperthreading wastes 20% of the system performance running two tasks on a dual processor machine. There are no clear benefits to running with hyperthreading enabled when running three tasks. What is the outlook here?

– Processors/box– The operating system– Others?



reduced as the efficiency of node usage increases. What are the dependencies?– The load characteristics – The batch scheduler– Chip technology– Processors/box

» At present, a single 100baseT NIC would support the I/O load of a quad processor CPU server. Quad processor boxes would halve the cost of networking infrastructure—but they come at a hefty price premium (XEON MP vs XEON DP, heftier chassis, …). What is the outlook here?

And total system memory becomes an issue.

– The operating system– Others?



reduced as the efficiency of node usage increases. What are the dependencies?– The load characteristics – The batch scheduler– Chip technology– Processors/box– The operating system

» Linux is getting better, but things such as processor affinity would be nice.

Relationship to hyperthreading…

– Others?



reduced as the efficiency of node usage increases. What are the dependencies?– The load characteristics – The batch scheduler– Chip technology– Processors/box– The operating system– Others?



Batch Subsystem

Storage subsystem



Hardware Purchase


Storage subsystem

Processors “desktop+” node == CPU server

CPU server + larger case + 6*2 disks == Disk server

CPU server + Fiber Channel Interface + tape drive == Tape server

Simple building blocks:




Storage subsystem — Disk Storage TCO: Maximise available online capacity within

fixed budget (material & personnel).– IDE based disk servers are much cheaper than high

end SAN servers. But are we spending too much time on maintenance?

» Yes, at present, but we need to analyse carefully the reasons for the current load.

Complexities of Linux drivers seem under control, but numbers have exploded. And are some problems related to batch of hardware?

– Where is the optimum? Switching to fibre channel disks would reduce capacity by factor of ~5.

» Naively, buy, say, 10% extra systems to cover failures. Sadly, this is not as simple as for CPU servers; active data on the servers must be reloaded elsewhere.

» Always have duplicate data? => purchase 2x required space. Still cheaper than SAN? How does this relate to …


Storage System — Tapes The first TCO question is “Do we need them?” Disk storage costs are dropping…


Disk Price/Performance Evolution

price in SFr per GByte

1

10

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

time since Jan 2000

SF

r/G

B

40 GB disk

60 GB disk

80 GB disk

120 GB

160 GB

180 GB

200 GB

disk server

factor 6 in 3 years

factor 2.5 difference

Non-mirrored disk server


Storage System — Tapes The first TCO question is “Do we need them?” Disk storage costs dropping… But

– Disk servers need system administrators, idle tapes sitting in a tape silo don’t.

– With disk only solution, we need storage for at least twice the total data volume to ensure no data loss.

– Server lifetime of 3-5 years; data must be copied periodically.

» Also an issue for tape, but the lifetime of a disk server is probably still less than the lifetime of a given tape media format.

Assumption today is that tape storage will be required.


Storage System — Tapes Tape robotics is easy.

– Bigger means better cost/slot.



Storage System — Tapes Tape robotics is easy.

– Bigger means better cost/slot.

Tape drives: High end vs LTO– TCO issue: LTO drives are cheaper than high end IBM

and STK drives, but are they reliable enough for our use?

» c.f. the IDE disk server area.

Real problem, though is tape media.– Vast portion of the data is accessed rarely but must

be stored for long period. Strong pressure to select a solution that minimises an overall cost dominated by tape media.


Storage System — Managed Storage Should CERN build or buy software systems? How to measure the value of a software system?

– Initial cost:» Build: Staff time to create required functionality» Buy: Initial purchase cost of system as delivered plus staff time

to install and figure for CERN.

– Ongoing cost» Build: Staff time to maintain system and add extra functionality» Buy: License/maintenance cost plus staff time to track releases.

Extra functionality that we consider useful may or may not arrive.

Choice:– Batch system: Buy LSF.– Managed storage system: Build CASTOR.

Use this model as we move on to consider system management software.



Batch Subsystem

Storage subsystem



Hardware Purchase


Installation and Configuration Reproducibility and guaranteed homogeneity of

system configuration is a clear method to minimise ongoing system management costs. A management framework is required that can cope with the numbers of systems we expect.

We faced the same issues as we moved from mainframes to RISC systems. Vendor solutions offered then were linked to hardware—so we developed our own solution.

Is a vendor framework acceptable if we have a homogeneous park of Linux systems?– Being honest, why have we built our own again?


Installation and Configuration Installation and configuration is only part of the

overall computer centre management:


ELFms architecture

NodeConfiguration

SystemMonitoring

System

InstallationSystem

Fault MgmtSystem


Installation and Configuration Installation and configuration is only part of the

overall computer centre management: Systems provided by vendors cannot (yet) be

integrated into such an overall framework. And there is still a tendency to differentiate

products on the basis of management software, not raw hardware performance.– This is a problem for us as we cannot ensure we

always buy brand X rack mounted servers or blade systems.

– In short, life is not so different from the RISC system era.



Batch Subsystem

Storage subsystem



Hardware Purchase


Monitoring and Control Assuming that there are clear interfaces, why

not integrate a commercial monitoring package into our overall architecture?

Two reasons:– No commercial package meets (met) our

requirements in terms of, say, long term data storage and access for analysis.

» This could be considered self serving: we produce requirements that justify a build rather than buy decision.

– Experience has show, repeatedly, that monitoring frameworks require effort to install and maintain, but don’t deliver the sensors we require.

» Vendors haven’t heard of LSF, let alone AFS.» A good reason!


Hardware Management System A specific example of the

integration problem. Workflows must interface to local procedures for, e.g., LAN address allocation. Can we integrate a vendor solution? Do complete solutions exist?

Request New Machine Install [FIO/IS] Decide New Identity [FIO/OPT]

Install [FIO/IS]

Request Physical Machine Install [FIO/OPT]Physically Install Machine [DCS]

Connect to Network [CS]

Check and Update Information [FIO/OPT]

Request Network Connection [FIO/OPT]

Remedy/HMSFIO/OPT

Import Node Map

FIO/IS

Raise Ticket

Retire Node

DCS

Raise Ticket

Move Machine

Perform db updates & checks

Raise Ticket

Install S/W & put in prod'n

Close Ticket

Remedy/PRMS

Observe

Change Status

Remedy/DCS

Observe

Close Ticket

Change Status

Observe

Close Ticket

Close Ticket

CS

Change Status

Req. n/w conn & dns entry

Update CS DB & DNS

Observe

Confirmation email


Console Management Done poorly now:


We will do better:

TCO issue: Do the benefits of a single console management system outweigh costs of developing our own? How do we integrate vendor supplied racks of preinstalled systems?

Console Management

xxx

pcitfionnn

lxplusnnn

Userapp

CDB – config service

• Machine – port @ head node mapping

• User – machine authorisations

Console server 1

Serverproc

conf

log

Machine 1.1

Machine 1.44

.

.

.

.

RS/232

Console server 75

Serverproc

conf

log

Machine 75.1

Machine 75.44

.

.

.

.

…

Console logrepository

xxx

pcitfionnn

lxplusnnn

Userapp

lxplusnnn

Userapp

CDB – config service

• Machine – port @ head node mapping

• User – machine authorisations

Console server 1

Serverproc

conf

log

Console server 1

Serverproc

conf

log

Machine 1.1

Machine 1.44

.

.

.

.

RS/232

Console server 75

Serverproc

conf

log

Console server 75

Serverproc

conf

log

Machine 75.1

Machine 75.44

.

.

.

.

…

Console logrepository



Batch Subsystem

Storage subsystem



Hardware Purchase


Hardware Purchase The issue at hand: How do we work within our

purchasing procedures to purchase equipment that minimises our total cost of ownership?

At present, we eliminate vast areas of the multi-dimensional space by assuming we will rely on ELFms for system management and Castor for data management. Simplified[!!!] view:– CPU: White box vs 1U vs blades; install or ready

packaged– Disk: IDE vs SAN; level of vendor integration

HELP! Can we benefit from management software that

comes with ready built racks of equipment in a multi-vendor environment?

Documents

Planning the LCG Fabric at CERN openlab TCO Workshop November 11 th 2003 Tony.Cass@ CERN.ch