9
Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb- Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration: 12-24 hrs Goal2: analyse these events 1 job to analyse all the events

Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration:

Embed Size (px)

Citation preview

Page 1: Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration:

Production test on EDG-1.4

Goal 1: simulate and reconstuct 5000 Pb-Pb central events

1 job/event

Output size: about 1.8 GB/event, so 9 TB

Job duration: 12-24 hrs

Goal2: analyse these events

1 job to analyse all the events

Page 2: Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration:

Status

Started on Mar, 15th

Stopped on May, 31st

About 450 central Pb-Pb events simulated (6 jobs/day) :-(

Output registered in the EDG Alice RC

Output stored on :

EDG disk SE's (300)EDG MSS SE's (150)CASTOR at CNAF and CERN (all, registered in the AliEn Data Catalogue)

Production test on EDG-1.4

Page 3: Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration:

Comments

Average Efficiency: 35%

More jobs would mean lower efficiency

Application Testbed unstable on the time scale of our job duration (24 h)

Most of the jobs failed because of services failures

It takes a long time to track down the errors and recover (i.e., clean up the RC by hand when needed)

Production test on EDG-1.4

Page 4: Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration:

Failure reasons: RB overloaded

Service crash, jobs get lost even though under execution at a WN, and they can't be tracked/monitored anymore

stdout/stderr can't be monitored during execution

The job might complete correctly and store/register the output on/in the SE/RC

No Output Sandbox available

No change of job status

Production test on EDG-1.4

Page 5: Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration:

Failure reasons: WN disk space full

Alice jobs produce a 2 GB output

Sometimes the available disk space on the executing WN is filled up and the job crashes

Production test on EDG-1.4

Page 6: Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration:

Failure reasons: The "Lyon" problem

WN's publish the total available memory in the IS

The JDL memory requirement is compared to the published values

When more than a job is allowed on the WN, the memory is shared. AliRoot jobs break because they need more memory than the actually available amount

Workaround by F. Hernandez

Production test on EDG-1.4

Page 7: Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration:

Behaviours not understood

Some jobs go to "OutputReady" status after 6-8 days

MSS jobs fail more frequently (and job information only available for CNAF jobs)

Production test on EDG-1.4

Page 8: Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration:

Production test on EDG-1.4 MSS jobsOK 74LDAP failure 23RC failure 35Disk full 16Lost 32Wrapper 39Running 36 Submit 15---------------------Total 270

Page 9: Production test on EDG-1.4 Goal 1: simulate and reconstuct 5000 Pb-Pb central events 1 job/event Output size: about 1.8 GB/event, so 9 TB Job duration:

Production test on EDG-1.4Conclusions

The EDG Application Testbed is not suitable for large productions (lack of resources)

Its use is very frustrating: instability, limited functionality, low efficiency

at the present rate, it would take 18 months to complete the production :-(

functionality for data analysis is now missing

The application testbed is being closed

use AliEn for data analysis and wait for LCG-1