17
SAM Job Submission • What is SAM? • sam submit …… • Data Management • Details. • Conclusions. Rod Walker, 10 th May, Gridpp, Manchester.

SAM Job Submission What is SAM? sam submit …… Data Management Details. Conclusions. Rod Walker, 10 th May, Gridpp, Manchester

Embed Size (px)

Citation preview

SAM Job Submission

• What is SAM?

• sam submit ……

• Data Management

• Details.

• Conclusions.

Rod Walker, 10th May, Gridpp, Manchester.

What is SAM?

• SAM is Sequential data Access via Meta-data• Project started in 1997 to handle D0’s needs for

Run II data system.• Current SAM team includes:

– Andrew Baranovski, Lauri Loebel-Carpenter, Gabriele Garzoglio, Chris Jozwiak, Lee Lueking*, Carmenita Moore, Igor Terekhov, Julie Trumbo, Sinisa Veseli, Matthew Vranicar, Stephen P. White, Victoria White*. (*project leaders)

• http://d0db.fnal.gov/sam

SAM is a Distributed SystemDatabaseServer(s)(Central Database)

NameServer

Global Resource

Manager(s)Log server

Station 1Servers

Station 2Servers

Station 3 Servers

Station nServers

Mass Storage System(s)

SharedGlobally

Local

SharedLocally

Arrows indicateControl and data flow

Job Submission

• Executable– Runtime environment

• Executable&assoc. files (user specific).• Experiment environment.

• Data– Dataset definition

• Select by metadata. • Converted to LFN`s at submit time, ie.datasets

change.• Build SQL query…then…execute query.

Dataset

Job Running & Job Control

ClientLocal SM

(Station Master)

Batch SystemProcess Manager

(SAM wrapper script)User Task

Job Manager(Project Master)

2.submit to SM

4.submitTo BS

6.start job 8.invoke

5.Submission ok

10.resubmit

9.setJobCount/stop

3.invoke

jobEnd

1. sam submit –defname=mydata –script=myexe

7.Started

(Run this exe | on this data)

User exeUser exeUser exe

Job control

User exe

getNextFile()

Here`s the path to a local file: /sam/cache1/boo/mydata1.dat

WaitFinished

Replica Catalogue

LFN

PFNStager

Fetch PFN

BS

Release

12

34

Physics & wrapper

Data Management

• Replica Catalogue

• Replication

• Cache Management

Replica Catalogue

• Combined with Metadata in an Oracle database, although logically distinct– Query on metadata to create a dataset

• list of LFN`s

• Experiment specific (D0/CDF).

– Query on LFN to locate physical file.• Generic replica catalogue.

• node:/path/to/cache/myfile.dat

Replica Catalogue

600,000 files increasing at 3000/day, 120TB.

150,000 in cache

5000 files per day replicated, 5000 destroyed.

½ million queries per day, (90% SELECT).

Cache Managment

• 13.6TB, in several 100 individually managed caches.• 1TB in and out/day (10k files)• Cache lifetime ~10 days• Various prescriptions for cache replacement, e.g. 1st in, 1st

out, last use.

70% hit rate(~6000 files/day)

Replication

• Easy – use your favourite ftp.

• BUT……what could go wrong.– Cache space – Cache Management.– network, dead node, corrupted file - retries.– dead disk, uncached – fail-over.– sluggish robot, slow delivery – hold job.

• A stroll through my log file.

05/07/02 16:01:52 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status:  Code: delivery error (Category SAM Internal)  Severity level: ERROR  Generated on 07 May 16:01:51 by eworker  In the context: executed process samcpcab:d0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:256 STDOUT: Executing Kerberos rcp: /usr/krb5/bin/rcpd0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000/sam/cache20/lancs/boo STDERR: kshd: Logins currently disabled.trying normal rcp (/usr/bsd/rcp)  WARNING: NO ENCRYPTION!d0cs015.fnal.gov: Connection refused, method name: samcp  Recommended action: Please contact [email protected] 05/07/02 16:01:52 imperial-test.SM.imperial-test 11698: Delivery failed,scheduling retry in 3 seconds

Retry

05/07/02 16:02:35 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status:  Code: delivery error (Category SAM Internal)  Severity level: ERROR  Generated on 07 May 16:02:35 by eworker  In the context: executed process samcpcab:d0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:256 STDOUT: Executing Kerberos rcp: /usr/krb5/bin/rcpd0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000/sam/cache20/lancs/boo STDERR: kshd: Logins currently disabled.trying normal rcp (/usr/bsd/rcp)  WARNING: NO ENCRYPTION!d0cs015.fnal.gov: Connection refused, method name: samcp  Recommended action: Please contact [email protected]  05/07/02 16:02:35 imperial-test.SM.imperial-test 11698: Maximum numberof retrials exceeded. Will not retry again from this source!05/07/02 16:02:35 imperial-test.SM.Repler 11698: Will avoid locations:(cab:d0cs015.fnal.gov:/sam/cache/boo)05/07/02 16:02:35 imperial-test.SM.Repler 11698: No loc is preferred,selectingenstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all(prl733.24)

Give up on this source.

Avoid this location. Get another location from RC, and retry.

05/07/02 16:10:53 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status:  Code: OK (Category Enstore)  Severity level: SUCCESS  Generated on 07 May 16:10:53 by eworker  In the context: executed process samcpenstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_021.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:0 STDOUT:INFILE=/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_021.raw_p10.15.01_000OUTFILE=/sam/cache20/lancs/booFILESIZE=1369320147LABEL=PRL859LOCATION=0000_000000000_0000067DRIVE=d0enmvr9a:/dev/rmt/tps0d1nDRIVE_SN=4560020042TRANSFER_TIME=160.38SEEK_TIME=73.47MOUNT_TIME=25.36QWAIT_TIME=65.79TIME2NOW=329.78STATUS=ok  STDERR: Completed transferring 1369320147 bytes in 1 files in329.720216036 sec.        Overall rate = 3.96 MB/sec.  Drive rate = 8.14 MB/sec.        Network rate = 8.13 MB/sec.  Exit status

Got it

05/07/02 15:46:09 imperial-test.SM.PBS BS Adapter 11698: Rememberingthat job 1760.gw39.hep.ph.ic.ac.uk for project 61983_sam_ is held --------------------------05/07/02 16:00:56 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status:  Code: OK (Category Enstore)  Severity level: SUCCESS  Generated on 07 May 16:00:56 by eworker  In the context: executed process samcpenstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_012.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:0 STDOUT:INFILE=/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_012.raw_p10.15.01_000OUTFILE=/sam/cache20/lancs/booFILESIZE=788805399LABEL=PRL829LOCATION=0000_000000000_0000025DRIVE=d0enmvr9a:/dev/rmt/tps0d1nDRIVE_SN=4560020042TRANSFER_TIME=90.08SEEK_TIME=45.05MOUNT_TIME=27.14QWAIT_TIME=225.50TIME2NOW=392.28STATUS=ok  STDERR: Completed transferring 788805399 bytes in 1 files in392.221878052 sec.        Overall rate = 1.92 MB/sec.  Drive rate = 8.35 MB/sec.        Network rate = 8.35 MB/sec.  Exit status = 0., method name: samcp  Recommended action: Please contact [email protected]/07/02 105/07/02 16:00:57 imperial-test.SM.PBS BS Adapter 11698: Willexecute: qrls 1760.gw39.hep.ph.ic.ac.uk

Hold in queue until 1st file delivered.

Release

File arrives

Conclusions

• Executable is stupid - no knowledge of data transfer. Job manager does the clever stuff.

• SAM has a fully featured, tried and tested data management system.

• No GSI, GridFTP, or CondorG as yet,

…but you need more than G`s to make a grid!