Upload
isaac-hubbard
View
214
Download
0
Embed Size (px)
Citation preview
Production test on EDG-1.4
Goal 1: simulate and reconstuct 5000 Pb-Pb central events
1 job/event
Output size: about 1.8 GB/event, so 9 TB
Job duration: 12-24 hrs
Goal2: analyse these events
1 job to analyse all the events
Status
Started on Mar, 15th
Stopped on May, 31st
About 450 central Pb-Pb events simulated (6 jobs/day) :-(
Output registered in the EDG Alice RC
Output stored on :
EDG disk SE's (300)EDG MSS SE's (150)CASTOR at CNAF and CERN (all, registered in the AliEn Data Catalogue)
Production test on EDG-1.4
Comments
Average Efficiency: 35%
More jobs would mean lower efficiency
Application Testbed unstable on the time scale of our job duration (24 h)
Most of the jobs failed because of services failures
It takes a long time to track down the errors and recover (i.e., clean up the RC by hand when needed)
Production test on EDG-1.4
Failure reasons: RB overloaded
Service crash, jobs get lost even though under execution at a WN, and they can't be tracked/monitored anymore
stdout/stderr can't be monitored during execution
The job might complete correctly and store/register the output on/in the SE/RC
No Output Sandbox available
No change of job status
Production test on EDG-1.4
Failure reasons: WN disk space full
Alice jobs produce a 2 GB output
Sometimes the available disk space on the executing WN is filled up and the job crashes
Production test on EDG-1.4
Failure reasons: The "Lyon" problem
WN's publish the total available memory in the IS
The JDL memory requirement is compared to the published values
When more than a job is allowed on the WN, the memory is shared. AliRoot jobs break because they need more memory than the actually available amount
Workaround by F. Hernandez
Production test on EDG-1.4
Behaviours not understood
Some jobs go to "OutputReady" status after 6-8 days
MSS jobs fail more frequently (and job information only available for CNAF jobs)
Production test on EDG-1.4
Production test on EDG-1.4 MSS jobsOK 74LDAP failure 23RC failure 35Disk full 16Lost 32Wrapper 39Running 36 Submit 15---------------------Total 270
Production test on EDG-1.4Conclusions
The EDG Application Testbed is not suitable for large productions (lack of resources)
Its use is very frustrating: instability, limited functionality, low efficiency
at the present rate, it would take 18 months to complete the production :-(
functionality for data analysis is now missing
The application testbed is being closed
use AliEn for data analysis and wait for LCG-1