52
The infrastructure of Grid at KIT The infrastructure of Grid at KIT Angela Poschlad Steinbuch Centre for Computing Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)

20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

The infrastructure of Grid at KITThe infrastructure of Grid at KIT

Angela Poschlad

Steinbuch Centre for Computing

Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)

Page 2: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Outline

What is a grid?G id t KIT

MonitoringOn-call-dutyGrid at KIT

GridKaResources

On-call-duty

Preproduction systemService ChallengesResources

ServicesCluster layout

Service Challenges

Network

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

2 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 3: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

What is a grid

A grid is an global allocation of resources (storage and CPUs) at local computing centres with defined services and connected by anlocal computing centres with defined services and connected by an efficient network. Through the middleware the usage of the grid is uncoupled from g g g pthe local batch system. This allows an access to all users without having information about the different site setup. Th f id d i Vi t l O i ti hi hThe users of a grid are grouped in Virtual Organisations which are communities exhibiting the same goals.

For example, the participants at the CMS experiment are organised in the VO cms.

The grid resources are shared within these VOs. A grid centre can support various VOs and the concept of the membership in VOssupport various VOs and the concept of the membership in VOs allows a simple authorisation of users at the different sites.

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

3 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 4: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Grid structure

ResourcesComputingComputing Storage

ServicesService discovery mechanismResource BrokerPortal machines to resources

Storage Elements (SE)g ( )Computing Elements (CE)

Catalogue Service for data

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

4 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 5: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

WLCG structure

The LHC Computing Grid (LCG) composes different level of importance of grid centresimportance of grid centresThe Worldwide LHC Computing Grid is based on the middleware glite.g

. . .

. . .

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

5 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 6: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Grid @ KITCampus North

SCC North

GridKa

SCC North

GridKa

Campus Grid

D-Grid

SCC South

Campus South

D Grid reference installation

SCC South

WLCG Tier 3 maintained by EKP

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

6 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 7: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Campus Grid

Project for virtualization of a heterogeneous computing environmentenvironment

Scalar processorsVector processorsDiff t hit t (di t ib t d h d)Different memory architecture (distributed, shared)

Utilization of the resources with grid technology (globus)g gy (g )OpusIB (Opteron-Cluster with InfiniBand, Linux)

Open-MPI supportedAIX S (P 4 d P PC)AIX-Server (Power4 and PowerPC)Circa 240 cores

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

7 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 8: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

DGI Reference Installation

Reference installation for D-Grid

Support for grid installationGood documented example installation of a grid siteDescription of the favored architecture and infrastructureProviding of a powerful monitoringUser management

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

8 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 9: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

WLCG Tier 3 – Uni Karlsruhe (EKP/SCC)

Currently providing a DPM storage elementA b t h l t i i tiA batch cluster is in preparation

Using cluster shared between many institutesThe cluster is located at SCC SouthMiddleware services will be maintained at Campus North

Th idi f W k N d l b hi d iThe providing of Worker Nodes can only be achieved using virtualization techniques

Other institutes require different OSq

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

9 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 10: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

GridKa - WLCG Tier 1 and more

• supports all 4 LHC experiments

• supports non-LHC experiments: CDF, D0, BaBar, Compass …

• supports several D-Grid d HEP VOand non-HEP VOs, e.g.

Auger, Astrogrid, Medigrid,..

• located near Karlsruhe on the• located near Karlsruhe on the KIT north campus

• Operated by the• Operated by the Steinbuch Centre for Computing

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

10 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 11: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Resources at GridKa

Computing resourcesThe computingThe computing resources can be used by more than 30 VOsCluster of SL4 32 Bit and SL5 64 Bit Worker Nodes8620 coresMore than 12 TB memoryy

StorageTape librariesDisk space

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

11 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 12: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Services at GridKa

Storage ElementTopLevelLFCFTS

dCacheStorage

pBDII

LFCFTS

Computing Elements lcg-CE

TopLevel BDIILFC

VOBox

ARC-CE

lcg CECREAM-CEUnicoreGl b

FTSVOBoxeslcg-CE

MyProxy

GlobusArc-CE

VOBoxesMyProxyUser Interface

lcg-CE

Globus

UserInterfaces

User InterfaceResource Broker (WMS/LB)

Unicore

CREAM CE WMS

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

12 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

( )

Page 13: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Computing Elements I

The Computing Elements is a portal to the local batch systempbspro at GridKa p p

Various middleware flavors are supported at GridKa

gLitelcg-CE and CREAM-CEused by EGEE/WLCG and D Gridused by EGEE/WLCG and D-Grid

ARC-CEInstallation currently ongoingy g gAsked by Atlas to try

UnicoreUsed by D GridUsed by D-Grid

Globus Toolkit 4Used by D-Grid

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

13 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

y

Page 14: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Computing Elements II

Users are mapped to local accounts in gLiteT th i ll CE th t l iTo ensure the same mapping on all CEs the central mapping directory is mounted via nfs (single point of failure)Special users have permission to install software on the VOSpecial users have permission to install software on the VO specific software area

Special queue implemented to ensure installation jobs with high priority on the clusterpriority on the cluster

P bl ith fil tProblems with file systemHit ext3 file system limit of max 32k links in each directoryWith rising computing resources the number of job risesg p g jUsing xfs instead

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

14 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 15: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Computing Elements III

Jobs last week at GridKaMost computing groupsMost computing groupsuse grid techniquesSome groups submit local jobslocal jobs

Different number of jobs jon the gLite CEs

CREAM is only used by alicealice

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

15 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 16: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Storage Element

Storage is provided by dCache systemsPetabytes of disk and tape storagePetabytes of disk and tape storage

Two instances in productionOne supporting various VOspp gOne supporting Atlas

Recently splitted from old instance

PlansThird instance planned supporting all D-Grid VOsThird instance planned supporting all D Grid VOsVirtual tape library

Reduce risk of writing problemsRising risk for reading problems for whole data setsRising risk for reading problems for whole data sets

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

16 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 17: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

LFC and FTS with Oracle DBRound Robin for FTS web servicechannels defined on one node

LHCb has read-only LFC (Replica CERN LHCb LFC)

LFCFTSLHCb LFC

In case of drop out another machine has to resumethe channel

LFC

FTSnode 2

FTSnode 1

FTSnode 3

LFC

LFC DB Read-onlyOracle DB

Stream from CERN2x FTS + 1x LFC

FTS DBHot standby

Oracle Oracle Oracle CERN LHCbLFC DB

Stream from CERN

3 data base back-ends for all frontends

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

17 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

all frontends

Page 18: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

File transfer service I

The file transfer service (FTS) provides dedicated transfer channel between two grid centresbetween two grid centres

Also possible to define connection to everywhere (“STAR”)

R tl FTS i tRecently new FTS instance installed with version 2.1

SLC 3 -> SL 4SLC 3 > SL 4Srmls usage configurableReduce load on SRM/pnfsdCache storage

New File Transfer Monitoring (FTM) il bl(FTM) available

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

18 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 19: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

File transfer service II

9 VOs supported but only two really active (cms and atlas)active (cms and atlas)The VOs have different data distribution modelsCMS often uses “Site”-STAR channelAtlas has dedicated channelfor each siteMore than 70 channel defined

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

19 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 20: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Information system I

All these services have to be published into a central service discovery systemdiscovery system

In gLite ldap is used for this purpose

BDII service has to be implemented at each sitepBerkely Database Information Index

The site BDII collectsall information about localservicesservices

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

20 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 21: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Information System II

A central BDII queries a list of site BDIIs and provides this information to the users or other services such as the Resourceinformation to the users or other services such as the Resource Broker

In EGEE the list is automatically created from a central data base where all sites have to registerwhere all sites have to registerD-Grid sites are maintained by hand at the moment

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

21 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 22: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Information System III

There can be various TopLevel BDIIs for a communityEGEE T L l BDII l t d t CERN t h Ti 1EGEE TopLevel BDIIs are located at CERN, at each Tier 1 -offered for regional sites - and some other sites such as DESY

At GridKa we have different configured BDIIsFor WLCG production a RoundRobin with 4 BDIIs is availableFor WLCG production a RoundRobin with 4 BDIIs is available

This service has to scale with the GridKa resources and the regional resources

To support D-Grid the Resource Broker (WMS) have a RoundRobin of pp ( )two BDIIs collecting information of EGEE and D-GridFor monitoring purposes a TopLevel BDII is installed providing all sites

Certified uncertified D-Grid PPS sitesCertified, uncertified, D Grid, PPS sites

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

22 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 23: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Resource Broker I

User should not locate resources themselvesIn glite a Resource Broker is used to find proper resourcesIn glite a Resource Broker is used to find proper resources matching the users requirements

Wall timeOperating systemOperating systemFree slots, small queueInstalled software…

The current Resource Broker is called WMSWorkload Management Systemg y

It works together with a LBLogging and Bookkeeping system

The WMS gets the information on the available resources by querying a TopLevel BDII

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

23 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 24: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Resource Broker II

At GridKa we had up to 6500 jobs operated by one instance at one time and up to 2500 jobs running paralleltime and up to 2500 jobs running parallelThe service is not allover stable and

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

24 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 25: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

GridKa Computing Cluster I

The GridKa cluster provides 8640 cores Local disk space, and physical and virtual memory, available on worker p , p y y,nodes:-----------------------------------------------------------------------------------

Batch (CPU type) | /tmp + /tmp/home | Phys. Mem. | Virt. Mem. | # job slots

-----------------------------------------------------------------------------------

AMD Opteron 270 | 130 | 4 | 12 | 4

Intel Xeon 5148 | 165 | 16 | 24 | 4

l 5160 | 210 | 6 | 14 | 4Intel Xeon 5160 | 210 | 6 | 14 | 4

-------------------|-----------------------|------------|------------|-------------

Intel Xeon E5345 | 175 + 230 | 16 | 48 | 8

Intel Xeon E5430 | 175 + 230 | 16 | 48 | 8Intel Xeon E5430 | 175 + 230 | 16 | 48 | 8

Intel Xeon L5420 | 170 + 225 | 16 | 48 | 8

-----------------------------------------------------------------------------------

The compute nodes are located in two rooms as most other resources

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

25 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 26: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

GridKa Computing Cluster II

Waste heat is cooled completely with water

High computer density on small area without a complex air conditioning system

Rooms have air-condition so some racks can be open

Each rack is appropriate for an amount of heat of 10 KW Free for coming

storage resourcesNew fileserver racks open automatically in case of cooling problems

storage resources

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

26 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 27: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

GridKa Computing Cluster III

Rack manager

Ventilators for air circulation

Rack manager

Power supply

Switch

Power supply for nodes

h hheat exchanger

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

27 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 28: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

GridKa Computing Cluster IV

The Worker Nodes are organized rack by rackEach rack has its own private subnetEach rack has its own private subnet10.1.rack.host

Each rack has a rack manager for logging (syslog) and g gg g ( y g)configurationCluster installation is done centrally with Rocks Toolkit

Also rack manager

For the central administration and configuration cfengine is usedDistributing certificates and configuration files to the rack managerDistributing certificates and configuration files to the rack managerRack manager distributing files to Worker Nodes

Software area mounted read-only on most Worker NodesSoftware area mounted read only on most Worker NodesFirst node of each rack the software area is mounted in read-write mode do software installation can be doneDedicated sgm queue limited to these hosts

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

28 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Dedicated sgm queue limited to these hosts

Page 29: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

GridKa Computing Cluster V

Currently we have two sub cluster installedOne half with SL4 32BitOne half with SL4 32BitOther half with SL5 64Bit

This is necessary since not all VOs can handle SL5 yety yWe decided not to have dedicated queues but dedicated CEs for the different cluster

SL4 Cluster2x lcg CEs

SL5 Cluster2x lcg CEs2x lcg-CEs

1x CREAM-CE1x Unicore1x Globus

2x lcg-CEs1x CREAM-CE

C ti t1x Globus Cpu time measurement problems on some SL5 compute nodes

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

29 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 30: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Accounting

WLCG:Publishing all grid jobs to a central RGMAPublishing all grid jobs to a central RGMA

D-Grid:Published via DGAS

April # jobs wall time cpu timeAtlas 466214 890288.50 722175.35 Alice 299744 969445.26 806379.47 cms 88030 433567.15 371238.03 Astrogrid 36982 706115.22 412857.02 LHCb 33962 82700.83 76948.39 Belle 1076 16507 72 14802 11Belle 1076 16507.72 14802.11 Auger 1811 7421.19 7064.37

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

30 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 31: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Accounting – LHC VOs

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

31 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 32: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Network

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

32 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 33: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Network II

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

33 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 34: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Network III - Automatic failover

In March:

Link to SARA was down

FailoverFailover

Automatic routing over CERN

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

34 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 35: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

WLCG Tier 1 GridKa

As a tier 1 GridKa has to provide an availability of 98 % Problematic with many updatesProblematic with many updatesMaintenances in the computing centre can have affect on GridKa services (DNS, power supply …)Sometimes non functional updatesSometimes non-functional updates

On-call-duty 24x7 requiredOn call duty 24x7 requiredSome requirements on reaction time cannot be matched with on-call-dutyAutomation neededAutomation needed

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

35 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 36: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Monitoring I

Central monitoring page at GridKaP idi i f ti diff tProviding information on different resources

dCacheFTSSAM results

OpsLCH VOsLCH VOs

Status boardInterventionsincidents

Links to other web sitesGangliaGangliaNagiosVO dashboards

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

36 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 37: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Monitoring II

The central monitoring tool used at GridKa is nagios

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

37 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 38: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

On-call-duty I

To provide a 24x7 support different on-call-circles had to be implementedimplemented

InfrastructureNetworkD t t d d t bData management and databasesMiddleware services and GGUSHardware and server (still missing)

Nagios is triggering alarmSMS to on-call-engineer

t ti k t i i t l ti k t tcreates ticket in internal ticket systemDocumentation for involved persons and next incident

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

38 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 39: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

On-call-duty II

Business process view inprocess view in nagios used for problem definitiondefinition

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

39 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 40: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

On-call-duty III

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

40 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 41: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

On-call-duty IV

Nagios process schema for “Middleware services and GGUS”Sensors for GGUS still missingSensors for GGUS still missing

The on-call-engineer is called ifAny BDII has problemsy pLess than 2 CEs ate okAny LFC orFTS problemFTS problemoccurs

External sensors are problematicsince false positives cannotbe controlled

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

41 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 42: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

On-call-duty V

Issues concerning on-call-engineerEvery employee has to rest 11 hours without interruption of workEvery employee has to rest 11 hours without interruption of work between two working days (German law)

This is also valid for on-call-engineer

Wh t h i f i id t i th i ht?What happens in case of incident in the night?Example: OCE works until 6 pm, then the on call duty startsIncident at 3 am until 4 am

only 9 hours restThe rest time has to start again

OCE can come to work not before 3 pmOCE can come to work not before 3 pmBut he has to work 40 hours the weekThe missing hours have to be collected on other working daysMaximum of 10 hours a day allowedMaximum of 10 hours a day allowed

If this happens too often, the OCE cannot reach the normal working hours and maybe has to come Saturdays …

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

42 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 43: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

VOBoxes and VO logging host

Many VOs have their own login node at GridKaUsed for VO internal monitoring or file transfer agentsUsed for VO internal monitoring or file transfer agents

Some times a VO likes to have access to some logging files, e.g. FTS logs or gridftp server logsg g p g

Central logging host implemented at GridKaSome information only accessible by ‘local VO supporters’ (must have sighed ‘Datenschutzerklaerung’), other information is readable by VO s g ed ate sc ut e ae u g ), ot e o at o s eadab e by Omember and some information available for all.

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

43 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 44: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Pre-Production Service

To minimize impact on the production most updates and changes are tested in the preproduction service (PPS)are tested in the preproduction service (PPS)PPS services running as small virtual machines

Enough for functional testsg

Also new services as the CREAM CE are introduced and tested in PPS

In so called ‘service pilots’ a new system is tested in dedicated sites with the community interested in the serviceAdvantages:

Enough time to get used to a new serviceGood contact with the developersEarly discovery of site specific problems

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

44 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 45: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

CREAM CE Pilot with Alice

CREAM: Computing Resource Execution And Management

CREAM CEManagement

Operated through a VOBOX parallel to the already existing service at GridKa

Access to the CREAM CEAccess to the CREAM CE

Initially 30 CPUS (PPS) available for the testing

VOBOX

CREAM CLI

For more load temporarily raised to 300 cores

Moved later to production ALICE queue CREAM-CLI

Gridftp2000 concurrently jobs workmanaged by CREAM during several days

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

45 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 46: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

CREAM CE Pilot II

ALICE production jobs via CREAM CE ( 2000)CREAM CE (ca. 2000)

Alice jobs via lcg-CEj g

The two CEs used have the same hardware

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

46 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 47: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

CREAM CE Pilot III

Ongoing testingLoad testLoad test

Test CREAM CE with 5000 managed jobs at the same timeFor each batch system there is a CREAM CE available in this pilot

Problems with WMS to be solvedProblems with WMS to be solved

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

47 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 48: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

gLExec/SCAS Pilot

gLExec is used for “pilot jobs” The mapping on the WN requires a central authorization tool (SCAS)The mapping on the WN requires a central authorization tool (SCAS)

gLExec on all production Worker Nodes for scalability testgLExec on all production Worker Nodes for scalability test

WNgLExec WN

LESCASWN gLExec gLExec

WNgLExec

WNgLExec

CentralCredential

gLExec

WNgLExec

WNgLExec

mappingWN

gLExecWNgLExec

WNgLExecWN

gLExec

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

48 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 49: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Service Challenge

Soon operation in real production

Tests for real incidents ongoing

Test ALARM Tickets

Raised by each LHC VO with a theoretical incidentWorkflow test for possible incidents

Proceed as it was a real incident

First test ticket had no incident specifiedFirst test ticket had no incident specifiedSecond ticket “reported” failing jobs and assumed a problem with nfs software mounts

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

49 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 50: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Security Service Challenge

email:THIS IS A TEST: Consider specific DN as corruptedTHIS IS A TEST: Consider specific DN as corrupted

WorkflowCheck for user activity and ban userCheck for user activity and ban userAnalyze users activity, e.g. active jobs, which UI was used … after analyzing kill jobsReport activityReport activity

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

50 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 51: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Site admin’s everlasting questions/concerns

New conceptsNew concepts

Improve efficiency for service

Improve efficiency for service

New conceptsfor redundancy?New concepts

for redundancy?

for servicemaintenance?

for servicemaintenance?

How much How much How much virtualization?

How much virtualization?

PPSengagement?

PPSengagement?What

enhancements What

enhancements for scalability can be done?for scalability can be done?

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

51 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009

Page 52: 20090525 DESY Seminar-2.ppt [Kompatibilitätsmodus] · WLCG Tier 3 – Uni Karlsruhe (EKP/SCC) Currently providing a DPM storage element Ab t h l t i i tiA batch cluster is in preparation

Thank you for your attention!

KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH

und Universität Karlsruhe (TH)

52 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009