Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
The infrastructure of Grid at KITThe infrastructure of Grid at KIT
Angela Poschlad
Steinbuch Centre for Computing
Die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Outline
What is a grid?G id t KIT
MonitoringOn-call-dutyGrid at KIT
GridKaResources
On-call-duty
Preproduction systemService ChallengesResources
ServicesCluster layout
Service Challenges
Network
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
2 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
What is a grid
A grid is an global allocation of resources (storage and CPUs) at local computing centres with defined services and connected by anlocal computing centres with defined services and connected by an efficient network. Through the middleware the usage of the grid is uncoupled from g g g pthe local batch system. This allows an access to all users without having information about the different site setup. Th f id d i Vi t l O i ti hi hThe users of a grid are grouped in Virtual Organisations which are communities exhibiting the same goals.
For example, the participants at the CMS experiment are organised in the VO cms.
The grid resources are shared within these VOs. A grid centre can support various VOs and the concept of the membership in VOssupport various VOs and the concept of the membership in VOs allows a simple authorisation of users at the different sites.
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
3 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Grid structure
ResourcesComputingComputing Storage
ServicesService discovery mechanismResource BrokerPortal machines to resources
Storage Elements (SE)g ( )Computing Elements (CE)
Catalogue Service for data
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
4 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
WLCG structure
The LHC Computing Grid (LCG) composes different level of importance of grid centresimportance of grid centresThe Worldwide LHC Computing Grid is based on the middleware glite.g
. . .
. . .
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
5 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Grid @ KITCampus North
SCC North
GridKa
SCC North
GridKa
Campus Grid
D-Grid
SCC South
Campus South
D Grid reference installation
SCC South
WLCG Tier 3 maintained by EKP
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
6 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Campus Grid
Project for virtualization of a heterogeneous computing environmentenvironment
Scalar processorsVector processorsDiff t hit t (di t ib t d h d)Different memory architecture (distributed, shared)
Utilization of the resources with grid technology (globus)g gy (g )OpusIB (Opteron-Cluster with InfiniBand, Linux)
Open-MPI supportedAIX S (P 4 d P PC)AIX-Server (Power4 and PowerPC)Circa 240 cores
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
7 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
DGI Reference Installation
Reference installation for D-Grid
Support for grid installationGood documented example installation of a grid siteDescription of the favored architecture and infrastructureProviding of a powerful monitoringUser management
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
8 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
WLCG Tier 3 – Uni Karlsruhe (EKP/SCC)
Currently providing a DPM storage elementA b t h l t i i tiA batch cluster is in preparation
Using cluster shared between many institutesThe cluster is located at SCC SouthMiddleware services will be maintained at Campus North
Th idi f W k N d l b hi d iThe providing of Worker Nodes can only be achieved using virtualization techniques
Other institutes require different OSq
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
9 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
GridKa - WLCG Tier 1 and more
• supports all 4 LHC experiments
• supports non-LHC experiments: CDF, D0, BaBar, Compass …
• supports several D-Grid d HEP VOand non-HEP VOs, e.g.
Auger, Astrogrid, Medigrid,..
• located near Karlsruhe on the• located near Karlsruhe on the KIT north campus
• Operated by the• Operated by the Steinbuch Centre for Computing
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
10 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Resources at GridKa
Computing resourcesThe computingThe computing resources can be used by more than 30 VOsCluster of SL4 32 Bit and SL5 64 Bit Worker Nodes8620 coresMore than 12 TB memoryy
StorageTape librariesDisk space
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
11 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Services at GridKa
Storage ElementTopLevelLFCFTS
dCacheStorage
pBDII
LFCFTS
Computing Elements lcg-CE
TopLevel BDIILFC
VOBox
ARC-CE
lcg CECREAM-CEUnicoreGl b
FTSVOBoxeslcg-CE
MyProxy
GlobusArc-CE
VOBoxesMyProxyUser Interface
lcg-CE
Globus
UserInterfaces
User InterfaceResource Broker (WMS/LB)
Unicore
CREAM CE WMS
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
12 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
( )
Computing Elements I
The Computing Elements is a portal to the local batch systempbspro at GridKa p p
Various middleware flavors are supported at GridKa
gLitelcg-CE and CREAM-CEused by EGEE/WLCG and D Gridused by EGEE/WLCG and D-Grid
ARC-CEInstallation currently ongoingy g gAsked by Atlas to try
UnicoreUsed by D GridUsed by D-Grid
Globus Toolkit 4Used by D-Grid
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
13 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
y
Computing Elements II
Users are mapped to local accounts in gLiteT th i ll CE th t l iTo ensure the same mapping on all CEs the central mapping directory is mounted via nfs (single point of failure)Special users have permission to install software on the VOSpecial users have permission to install software on the VO specific software area
Special queue implemented to ensure installation jobs with high priority on the clusterpriority on the cluster
P bl ith fil tProblems with file systemHit ext3 file system limit of max 32k links in each directoryWith rising computing resources the number of job risesg p g jUsing xfs instead
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
14 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Computing Elements III
Jobs last week at GridKaMost computing groupsMost computing groupsuse grid techniquesSome groups submit local jobslocal jobs
Different number of jobs jon the gLite CEs
CREAM is only used by alicealice
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
15 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Storage Element
Storage is provided by dCache systemsPetabytes of disk and tape storagePetabytes of disk and tape storage
Two instances in productionOne supporting various VOspp gOne supporting Atlas
Recently splitted from old instance
PlansThird instance planned supporting all D-Grid VOsThird instance planned supporting all D Grid VOsVirtual tape library
Reduce risk of writing problemsRising risk for reading problems for whole data setsRising risk for reading problems for whole data sets
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
16 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
LFC and FTS with Oracle DBRound Robin for FTS web servicechannels defined on one node
LHCb has read-only LFC (Replica CERN LHCb LFC)
LFCFTSLHCb LFC
In case of drop out another machine has to resumethe channel
LFC
FTSnode 2
FTSnode 1
FTSnode 3
LFC
LFC DB Read-onlyOracle DB
Stream from CERN2x FTS + 1x LFC
FTS DBHot standby
Oracle Oracle Oracle CERN LHCbLFC DB
Stream from CERN
3 data base back-ends for all frontends
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
17 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
all frontends
File transfer service I
The file transfer service (FTS) provides dedicated transfer channel between two grid centresbetween two grid centres
Also possible to define connection to everywhere (“STAR”)
R tl FTS i tRecently new FTS instance installed with version 2.1
SLC 3 -> SL 4SLC 3 > SL 4Srmls usage configurableReduce load on SRM/pnfsdCache storage
New File Transfer Monitoring (FTM) il bl(FTM) available
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
18 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
File transfer service II
9 VOs supported but only two really active (cms and atlas)active (cms and atlas)The VOs have different data distribution modelsCMS often uses “Site”-STAR channelAtlas has dedicated channelfor each siteMore than 70 channel defined
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
19 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Information system I
All these services have to be published into a central service discovery systemdiscovery system
In gLite ldap is used for this purpose
BDII service has to be implemented at each sitepBerkely Database Information Index
The site BDII collectsall information about localservicesservices
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
20 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Information System II
A central BDII queries a list of site BDIIs and provides this information to the users or other services such as the Resourceinformation to the users or other services such as the Resource Broker
In EGEE the list is automatically created from a central data base where all sites have to registerwhere all sites have to registerD-Grid sites are maintained by hand at the moment
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
21 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Information System III
There can be various TopLevel BDIIs for a communityEGEE T L l BDII l t d t CERN t h Ti 1EGEE TopLevel BDIIs are located at CERN, at each Tier 1 -offered for regional sites - and some other sites such as DESY
At GridKa we have different configured BDIIsFor WLCG production a RoundRobin with 4 BDIIs is availableFor WLCG production a RoundRobin with 4 BDIIs is available
This service has to scale with the GridKa resources and the regional resources
To support D-Grid the Resource Broker (WMS) have a RoundRobin of pp ( )two BDIIs collecting information of EGEE and D-GridFor monitoring purposes a TopLevel BDII is installed providing all sites
Certified uncertified D-Grid PPS sitesCertified, uncertified, D Grid, PPS sites
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
22 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Resource Broker I
User should not locate resources themselvesIn glite a Resource Broker is used to find proper resourcesIn glite a Resource Broker is used to find proper resources matching the users requirements
Wall timeOperating systemOperating systemFree slots, small queueInstalled software…
The current Resource Broker is called WMSWorkload Management Systemg y
It works together with a LBLogging and Bookkeeping system
The WMS gets the information on the available resources by querying a TopLevel BDII
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
23 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Resource Broker II
At GridKa we had up to 6500 jobs operated by one instance at one time and up to 2500 jobs running paralleltime and up to 2500 jobs running parallelThe service is not allover stable and
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
24 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
GridKa Computing Cluster I
The GridKa cluster provides 8640 cores Local disk space, and physical and virtual memory, available on worker p , p y y,nodes:-----------------------------------------------------------------------------------
Batch (CPU type) | /tmp + /tmp/home | Phys. Mem. | Virt. Mem. | # job slots
-----------------------------------------------------------------------------------
AMD Opteron 270 | 130 | 4 | 12 | 4
Intel Xeon 5148 | 165 | 16 | 24 | 4
l 5160 | 210 | 6 | 14 | 4Intel Xeon 5160 | 210 | 6 | 14 | 4
-------------------|-----------------------|------------|------------|-------------
Intel Xeon E5345 | 175 + 230 | 16 | 48 | 8
Intel Xeon E5430 | 175 + 230 | 16 | 48 | 8Intel Xeon E5430 | 175 + 230 | 16 | 48 | 8
Intel Xeon L5420 | 170 + 225 | 16 | 48 | 8
-----------------------------------------------------------------------------------
The compute nodes are located in two rooms as most other resources
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
25 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
GridKa Computing Cluster II
Waste heat is cooled completely with water
High computer density on small area without a complex air conditioning system
Rooms have air-condition so some racks can be open
Each rack is appropriate for an amount of heat of 10 KW Free for coming
storage resourcesNew fileserver racks open automatically in case of cooling problems
storage resources
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
26 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
GridKa Computing Cluster III
Rack manager
Ventilators for air circulation
Rack manager
Power supply
Switch
Power supply for nodes
h hheat exchanger
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
27 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
GridKa Computing Cluster IV
The Worker Nodes are organized rack by rackEach rack has its own private subnetEach rack has its own private subnet10.1.rack.host
Each rack has a rack manager for logging (syslog) and g gg g ( y g)configurationCluster installation is done centrally with Rocks Toolkit
Also rack manager
For the central administration and configuration cfengine is usedDistributing certificates and configuration files to the rack managerDistributing certificates and configuration files to the rack managerRack manager distributing files to Worker Nodes
Software area mounted read-only on most Worker NodesSoftware area mounted read only on most Worker NodesFirst node of each rack the software area is mounted in read-write mode do software installation can be doneDedicated sgm queue limited to these hosts
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
28 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Dedicated sgm queue limited to these hosts
GridKa Computing Cluster V
Currently we have two sub cluster installedOne half with SL4 32BitOne half with SL4 32BitOther half with SL5 64Bit
This is necessary since not all VOs can handle SL5 yety yWe decided not to have dedicated queues but dedicated CEs for the different cluster
SL4 Cluster2x lcg CEs
SL5 Cluster2x lcg CEs2x lcg-CEs
1x CREAM-CE1x Unicore1x Globus
2x lcg-CEs1x CREAM-CE
C ti t1x Globus Cpu time measurement problems on some SL5 compute nodes
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
29 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Accounting
WLCG:Publishing all grid jobs to a central RGMAPublishing all grid jobs to a central RGMA
D-Grid:Published via DGAS
April # jobs wall time cpu timeAtlas 466214 890288.50 722175.35 Alice 299744 969445.26 806379.47 cms 88030 433567.15 371238.03 Astrogrid 36982 706115.22 412857.02 LHCb 33962 82700.83 76948.39 Belle 1076 16507 72 14802 11Belle 1076 16507.72 14802.11 Auger 1811 7421.19 7064.37
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
30 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Accounting – LHC VOs
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
31 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Network
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
32 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Network II
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
33 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Network III - Automatic failover
In March:
Link to SARA was down
FailoverFailover
Automatic routing over CERN
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
34 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
WLCG Tier 1 GridKa
As a tier 1 GridKa has to provide an availability of 98 % Problematic with many updatesProblematic with many updatesMaintenances in the computing centre can have affect on GridKa services (DNS, power supply …)Sometimes non functional updatesSometimes non-functional updates
On-call-duty 24x7 requiredOn call duty 24x7 requiredSome requirements on reaction time cannot be matched with on-call-dutyAutomation neededAutomation needed
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
35 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Monitoring I
Central monitoring page at GridKaP idi i f ti diff tProviding information on different resources
dCacheFTSSAM results
OpsLCH VOsLCH VOs
Status boardInterventionsincidents
Links to other web sitesGangliaGangliaNagiosVO dashboards
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
36 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Monitoring II
The central monitoring tool used at GridKa is nagios
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
37 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
On-call-duty I
To provide a 24x7 support different on-call-circles had to be implementedimplemented
InfrastructureNetworkD t t d d t bData management and databasesMiddleware services and GGUSHardware and server (still missing)
Nagios is triggering alarmSMS to on-call-engineer
t ti k t i i t l ti k t tcreates ticket in internal ticket systemDocumentation for involved persons and next incident
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
38 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
On-call-duty II
Business process view inprocess view in nagios used for problem definitiondefinition
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
39 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
On-call-duty III
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
40 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
On-call-duty IV
Nagios process schema for “Middleware services and GGUS”Sensors for GGUS still missingSensors for GGUS still missing
The on-call-engineer is called ifAny BDII has problemsy pLess than 2 CEs ate okAny LFC orFTS problemFTS problemoccurs
External sensors are problematicsince false positives cannotbe controlled
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
41 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
On-call-duty V
Issues concerning on-call-engineerEvery employee has to rest 11 hours without interruption of workEvery employee has to rest 11 hours without interruption of work between two working days (German law)
This is also valid for on-call-engineer
Wh t h i f i id t i th i ht?What happens in case of incident in the night?Example: OCE works until 6 pm, then the on call duty startsIncident at 3 am until 4 am
only 9 hours restThe rest time has to start again
OCE can come to work not before 3 pmOCE can come to work not before 3 pmBut he has to work 40 hours the weekThe missing hours have to be collected on other working daysMaximum of 10 hours a day allowedMaximum of 10 hours a day allowed
If this happens too often, the OCE cannot reach the normal working hours and maybe has to come Saturdays …
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
42 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
VOBoxes and VO logging host
Many VOs have their own login node at GridKaUsed for VO internal monitoring or file transfer agentsUsed for VO internal monitoring or file transfer agents
Some times a VO likes to have access to some logging files, e.g. FTS logs or gridftp server logsg g p g
Central logging host implemented at GridKaSome information only accessible by ‘local VO supporters’ (must have sighed ‘Datenschutzerklaerung’), other information is readable by VO s g ed ate sc ut e ae u g ), ot e o at o s eadab e by Omember and some information available for all.
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
43 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Pre-Production Service
To minimize impact on the production most updates and changes are tested in the preproduction service (PPS)are tested in the preproduction service (PPS)PPS services running as small virtual machines
Enough for functional testsg
Also new services as the CREAM CE are introduced and tested in PPS
In so called ‘service pilots’ a new system is tested in dedicated sites with the community interested in the serviceAdvantages:
Enough time to get used to a new serviceGood contact with the developersEarly discovery of site specific problems
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
44 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
CREAM CE Pilot with Alice
CREAM: Computing Resource Execution And Management
CREAM CEManagement
Operated through a VOBOX parallel to the already existing service at GridKa
Access to the CREAM CEAccess to the CREAM CE
Initially 30 CPUS (PPS) available for the testing
VOBOX
CREAM CLI
For more load temporarily raised to 300 cores
Moved later to production ALICE queue CREAM-CLI
Gridftp2000 concurrently jobs workmanaged by CREAM during several days
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
45 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
CREAM CE Pilot II
ALICE production jobs via CREAM CE ( 2000)CREAM CE (ca. 2000)
Alice jobs via lcg-CEj g
The two CEs used have the same hardware
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
46 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
CREAM CE Pilot III
Ongoing testingLoad testLoad test
Test CREAM CE with 5000 managed jobs at the same timeFor each batch system there is a CREAM CE available in this pilot
Problems with WMS to be solvedProblems with WMS to be solved
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
47 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
gLExec/SCAS Pilot
gLExec is used for “pilot jobs” The mapping on the WN requires a central authorization tool (SCAS)The mapping on the WN requires a central authorization tool (SCAS)
gLExec on all production Worker Nodes for scalability testgLExec on all production Worker Nodes for scalability test
WNgLExec WN
LESCASWN gLExec gLExec
WNgLExec
WNgLExec
CentralCredential
gLExec
WNgLExec
WNgLExec
mappingWN
gLExecWNgLExec
WNgLExecWN
gLExec
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
48 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Service Challenge
Soon operation in real production
Tests for real incidents ongoing
Test ALARM Tickets
Raised by each LHC VO with a theoretical incidentWorkflow test for possible incidents
Proceed as it was a real incident
First test ticket had no incident specifiedFirst test ticket had no incident specifiedSecond ticket “reported” failing jobs and assumed a problem with nfs software mounts
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
49 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Security Service Challenge
email:THIS IS A TEST: Consider specific DN as corruptedTHIS IS A TEST: Consider specific DN as corrupted
WorkflowCheck for user activity and ban userCheck for user activity and ban userAnalyze users activity, e.g. active jobs, which UI was used … after analyzing kill jobsReport activityReport activity
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
50 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Site admin’s everlasting questions/concerns
New conceptsNew concepts
Improve efficiency for service
Improve efficiency for service
New conceptsfor redundancy?New concepts
for redundancy?
for servicemaintenance?
for servicemaintenance?
How much How much How much virtualization?
How much virtualization?
PPSengagement?
PPSengagement?What
enhancements What
enhancements for scalability can be done?for scalability can be done?
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
51 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009
Thank you for your attention!
KIT - Die Kooperation von Forschungszentrum Karlsruhe GmbH
und Universität Karlsruhe (TH)
52 | Angela Poschlad | Steinbuch Centre for Computing | 25.05.2009