Upload
teresa-henderson
View
212
Download
0
Embed Size (px)
Citation preview
EU 2nd Year Review – 04-05 Feb. 2003 – Title – n° 1
WP8: Progress and testbed evaluation
F Harris (Oxford/CERN)
(WP8 coordinator )
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 2
Outline of the presentation
Overview of the objectives for the 2nd project year, and the corresponding achievements
Activities of funded and unfunded effort
Ongoing work on use cases
Data Challenge work with Atlas and CMS
Comments on the key points of work in the other 4 WP8 experiments
The organisation for D 8.3 ‘Testbed assessment for HEP applications’
The planning for the 3rd project year, and some associated issues
QUESTIONS
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 3
The objectives for 2nd project year, and the corresponding achievements
OBJECTIVES
Use and exploitation of Testbed1
Validation of releases + feedback
Participation in the Architecture group (ATF), and the elaboration of use cases
ACHIEVEMENTS
Babar and D0 have joined the 4 LHC experiments, and NA48 will soon join. 5 experiments have used the applications testbed. All WP8 experiments have continued to develop their distributed computing infrastructure in Europe and USA
Both EIPs and the experiments have given continual feedback to middleware from both generic and experiment specific evaluations
ATF is very active and execute regular ‘scenario playing’ reviews. Use case documents have been produced and will develop in the context of EDG/LCG
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 4
Overview of objectives for 2nd project year, and the corresponding achievements
OBJECTIVES
Design of a common middleware layer for WP8 experiments
Use of EDG middleware in experiment Data Challenges (DCs)
Developments of tutorials and documentation for the user community
ACHIEVEMENTS
This has moved into the LHC Computing Grid (LCG) project
Atlas and then CMS experiments have achieved significant pioneering work in the use of EDG middleware for DCs, and in producing detailed evaluations
WP8 has played a substantial role in course design, implementation and delivery
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 5
Activities of funded and unfunded effort
WP8 used 51 funded man-months instead of the projected 43.5 (January - November)
Complemented with 350 unfunded man-months from experiments which has largely concentrated on experiment specific activities
The EIP (Experiment Independent Persons) have been involved in Functionality* and stress testing Middleware debugging campaigns* Configuration and testing of Storage Elements and Virtual Organisations* Data Challenges of the ATLAS and CMS experiments Organisation of WP8 Integration Team* and Architectural Task Force
* Activities unforeseen in original mandate
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 6
Ongoing work on use cases
‘Common Use Cases for A HEP Common Application Layer’ (HEPCAL)
(Document produced for LCG; chaired and largely manned by WP8 people,
and only possible thanks to WP8 experience)
General (authorisation,login,browse resources) 4 use cases Data Management (metadata and data operations) 19 use cases Job Management (submission,control,monitoring,errors 16 use cases
,resource estimation, job splitting…….) VO Management (resource reservation,user rights 4 use cases
,software publishing…)
. EDG 1.4.3 satisfies use cases for a basic system(authorisation/authentication,data handling,job submission)
. EDG 2.0 will satisfy more advanced data handling e.g. (metadata) and HEP data transformation
. There are other areas for discussion e.g. virtual data, experiment s/w publishing
This work will continue within EDG and LCG
IN ATF there is regular scenario playing for use cases to check existing and future design
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 7
Overview of data challenge work ATLAS (pioneers!) Specific Goals
Compare results with those obtained without Grid in previous months for ~100 ‘long’ detector simulation jobs
Make prioritised list of recommendations to EDG for bug-fixes and future developments in an evaluation report
Organization Joint Atlas/EDG/LCG effort
Resources used (and functions) Sites (CERN,RAL,Lyon,Nikhef,CNAF)
+ (Karlsruhe) Several UIs Milan,CERN,Cambridge RB CERN(shared)
RC Originally shared with CMS. Finally separate one at CNAF
CMS
Specific Goals Aim for as many simulated events as
possible for physics analysis, with 1000’s of ‘short’ event generation and ‘long’ detector simulation jobs, using the full production system
Measure performances, efficiencies and reason of job failures to give detailed feedback to middleware in a detailed report
Organization
This was a joint effort involving CMS, EDG, EDT and LCG people
Resources used (and functions) Sites (CERN,RAL,Lyon,Nikhef,CNAF) +
(Legnaro,Padova,Ecol. Poly,I.C) SeveralUIs CNAF, Padova, Ecol.P,I.C
Several RBs
CNAF(CMS),CNAF(shared), CERN(CMS),IC(CMS+Babar)
RC Originally shared with Atlas. Finally separate one at CNAF
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 8
History-relating applications work to TB versions
Version Date
1.1.2 27 Feb 2002
1.1.3 02 Apr 2002
1.1.4 04 Apr 2002
1.2.a1 11 Apr 2002
1.2.b1 31 May 2002
1.2.0 12 Aug 2002
1.2.1 04 Sep 2002
1.2.2 09 Sep 2002
1.2.3 25 Oct 2002
1.3.0 08 Nov 2002
1.3.1 19 Nov 2002
1.3.2 20 Nov 2002
1.3.3 21 Nov 2002
1.3.4 25 Nov 2002
1.4.0 06 Dec 2002
1.4.1 07 Jan 2003
1.4.2 09 Jan 2003
1.4.3 14 Jan 2003
RC Changes
Mixed Globus 2.0/2.2RB/JSS Upgrade
Known Problems:• GASS Cache Coherency• Race Conditions in Gatekeeper• Unstable MDS
Successes• Improved MDS Stability• FTP Transfers OKKnown Problems:• Interactions with RC
Real Use by Applications!Limitations: • Resource Exhaustion• Size of Logical Collections
Successes• Matchmaking/Job Mgt.• Basic Data Mgt.Known Problems:• High Rate Submissions• Long FTP Transfers
ATLAS commence phase1 tests
CMS start stress tests Nov 30
which continue till Dec 20
•Problems with long jobs•Instability in MDS•Long file transfers unreliable
CMS and Atlas evaluate 1.4.3
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 9
Atlas evaluations (August and Dec/Jan) (DETAILED PAPER IN PREPARATION)
RESULTS (see Atlas jobs in DEMO tomorrow) Atlas software was used in the EDG Grid environment Several hundred simulation jobs of length 4-24 hours were executed , data was
replicated using grid tools Results of simulation agreed with ‘non-Grid’ runs
OBSERVATIONS Good interaction with EDG middleware providers and with WP6/8 With a substantial effort it was possible to perform the jobs Showed up bugs and performance limitations (fixed or to be fixed in EDG 2.0)
WP1 Many ‘Long Jobs’ failed (now much better) WP2 Replication Tools were difficult to use and unreliable WP3 Information Service based on MDS gave poor performance (affected
WP1) WP4 We need to separate out application and system software
installations (fixed in 1.4.3)
We need EDG 2.0 release for use in large scale data challenges
RECOMMENDATIONS (see combined ATLAS/CMS recommendations…)
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 10
SECE
CMS software
CMS production components interfaced to EDG middleware (more details in DEMO )
BOSSDB
WorkloadManagement
System
JDL
RefDB
parameters
data registration
Job output filteringRuntime monitoring
input
dat a
lo
cat i
on
Push data or info
Pull info
UIIMPALA/BOSS
CMS production tools on UI: job creation, job submission and monitoring
CMS software RPM-based installed on CEs/WNs
Replica Manager
CE
CMS software
CE
CMS software
CE
WN
SECE
CMS software
SE
SE
SE
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 11
Main results and observations from CMS work (detailed doc in preparation)
RESULTS
Could distribute and run CMS s/w in EDG environment
Generated ~250K events for physics with ~10,000 jobs in 3 week period
OBSERVATIONS
Were able to quickly add new sites to provide extra resources
Fast turnaround in bug fixing and installing new software
Test was labour intensive (since software was developing and the overall system was fragile)
WP1 At the start there were serious problems with long jobs- recently improved WP2 Replication Tools were difficult to use and not reliable, and the performance of the
Replica Catalogue was unsatisfactory WP3 The Information System based on MDS performed poorly with increasing query rate The system is sensitive to hardware faults and site/system mis-configuration The user tools for fault diagnosis are limited
EDG 2.0 should fix the major problems (see talks by R Jones and E Laure) providing a system suitable for full integration in distributed production
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 12
CMS event production in December 2002using EDG software and applications TB
Nb
. of
evts
time
http://cmsdoc.cern.ch/cms/production/www/html/general/index.html
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 13
CMS/EDG Summary of Stress Test Preliminary Analysis
Status EDG evaluation CMS evaluation EDG ver 1.4.3
Finished Correctly 5518 4601 604Crashed or bad status 818 1099 65
Total number of jobs 6336 5700 669
Efficiency 0.87 0.81 0.90
CMKIN jobs
Status EDG evaluation CMS evaluation EDG ver 1.4.3
Finished Correctly 1678 2147 394Crashed or bad status 2662 934 104
Total number of jobs 4340 3081 498Efficiency 0.39 0.70 0.79
CMSIM jobs
Short jobs
Long jobs
After Stress Test – Jan 03
After Stress Test – Jan 03
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 14
EDG reasons of failure (categories) Preliminary analysis of pre Xmas (1.4.0)
CMKIN (short) jobsStatus Totals
Crashed jobs 818
Reasons of Failure for Crashed jobs
No matching resource found 509
Generic Failure: MyProxyServer not found in JDL expr. 102Running 74Failure while executing job wrapper 37Other failures 96
CMSIM (long) jobsStatus Totals
Crashed jobs 2662
Reasons of Failure for Crashed jobs
Failure while executing job wrapper 1476No matching resource found 722Globus Failure: Globus down/Submit to globus failed 144Running 116Globus Failure 90Other failures 114
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 15
Joint recommendations from Atlas/CMS work
There are essential developments (see EDG 2.0) needed in Data Management (robustness and functionality)
Information Systems (robustness and scalability)
Workload Management (scalability for high rates, batch submissions,output file specification)
Mass Storage Support (gridified support due in EDG 2.0)
We must maintain and strengthen joint Experiment/EDG work in the evaluation of system components AND the architecture (both will need to evolve – GRID developments are R/D)
Once middleware providers have done their ‘unit tests’ the applications must work with them in the areas of:
Performance evaluation for the user with increasing rates of job submission and data handling, and an expanding TB configuration
Streamlining procedures for feedback to middleware providers
EDG should provide site validation and monitoring procedures
EDG should provide good user tools for fault detection and diagnosis (what is job status?, why did it fail?……..)
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 16
Some key points of work in the other experiments
ALICE developed scripts for the installation of ALICE software on EDG/CEs
developed a WEB interface to automatically submit jobs to the testbed and evaluate its "efficiency" (currently in use)
Current development of the AliEn/EDG interface (included effort from DataTAG) able to send jobs to EDG via AliEn Currently completing the tests for registering/accessing data on/from both
catalogues (AliEn and EDG), which is required for the interoperability
LHCb consolidation of basic job submission capability demonstrated at EU review,
and at the opening of National E-science Center, Edinburgh, 25 April
made RPMs for LHCb environment
included DataGrid in new LHCb distributed production system (DIRAC) and demonstrated that short DataGrid jobs can be submitted and managed via DIRAC
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 17
Babar Deployment of the BaBar VO:
VO and RC at Manchester RB at IC CE/SE/WN at SLAC, In2p3, RAL and Ferrara.
Deployment and adaptation of EDG software at SLAC (the EDG scripts had to be modified for the WN inside the Internet Free Zone)
Successfully tested BaBar analysis and simulation jobs within the EDG framework.
Next step is to run full scale analysis on the Grid.
D0 A D0 replica catalogue and VO server have been set up at Nikhef 124 CPU farm has been successfully used with EDG s/w D0 support was added to the official EDG release, and several sites now support
D0 jobs and have installed the RPMs. Will try the newer release (and true Grid production) when RH 7.2 support
appears
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 18
The key content for D 8.3 ‘Testbed assessment for HEP applications’
‘Datagrid as HEP production environment’ Detailed evaluations of Atlas and CMS Task Forces
Evaluations by other LHC experiments (Alice,LHCb)
Evaluations from non-LHC experiments (Babar,D0)
Mapping of evaluations to the ‘common use cases’ General use cases
Data management
Job Management
VO management
Summary of lessons learned for future EDG development, and statement of priorities for the experiments
EU 2nd Year Review – 04-05 Feb. 2003 – WP8 progress and testbed evaluation – n° 19
The planning for the 3rd project year, and associated issues
PLANNING Continue work with experiments using the successful Task Force Model for Data
Challenges Complete D8.3 for end March 2003 (based on release 1.4.3) Continue architecture work in ATF, and participate to LCG use case/architecture
activities Evaluate EDG 2.0 software, and port it to experiment software environments for use in
the data challenges Complete D8.4 by Dec 2003 (based on release 2.x)
SOME IMPORTANT ISSUES Must organise detailed test sessions involving experiments and the providers of
middleware for information systems, data management and mass storage handling in the context of moving to EDG 2.0
We look for improved diagnostic information from middleware in case of problems WP8 will work increasingly with experiments rather than in generic testing, which will
taken up by the WP6 Testing Group We must relate EDG/WP8 work to the use by experiments of the forthcoming LCG
Prototype, both in terms of software, hardware and user support We should re-activate inter-application WG (8+9+10)