16
Data management for ATLAS, ALICE Data management for ATLAS, ALICE and VOCE in the Czech Republic and VOCE in the Czech Republic L.Fiala, J. Chudoba, J. Kosina, J. Krasova, M. Lokajicek, J. Svec, J. Kmunicek, D. Kouril, L. Matyska, M. Ruda, Z. Salvet, M. Mulac

Data management for ATLAS, ALICE and VOCE in the Czech Republic L.Fiala, J. Chudoba, J. Kosina, J. Krasova, M. Lokajicek, J. Svec, J. Kmunicek, D. Kouril,

Embed Size (px)

Citation preview

Data management for ATLAS, Data management for ATLAS, ALICE and VOCE in the Czech ALICE and VOCE in the Czech

RepublicRepublic

L.Fiala, J. Chudoba, J. Kosina, J. Krasova, M. Lokajicek, J. Svec, J. Kmunicek, D. Kouril, L. Matyska, M. Ruda, Z. Salvet, M. Mulac

OverviewOverview

• Supported VOs (VOCE, ATLAS, ALICE)Supported VOs (VOCE, ATLAS, ALICE)

• DPM as a choice of SRM-based DPM as a choice of SRM-based Storage ElementStorage Element

• Issues encountered with DPMIssues encountered with DPM

• Results of transfersResults of transfers

• ConclusionConclusion

VOCEVOCE• Virtual Organization for Central EuropeVirtual Organization for Central Europe

• in the scope of the EGEE projectin the scope of the EGEE project

• provision of distributed Grid facitilites to non-HEP provision of distributed Grid facitilites to non-HEP scientistsscientists

• Austrian, Czech, Hungarian, Polish, Slovak, Austrian, Czech, Hungarian, Polish, Slovak, Slovenian resources involvedSlovenian resources involved

• the design and implementation of VOCE the design and implementation of VOCE infrastructure done solely on Czech Resourcesinfrastructure done solely on Czech Resources

ALICE, ALICE, ATLASATLAS• Virtual Organizations for LHC experimentsVirtual Organizations for LHC experiments

Storage ElementsStorage Elements

• Classical disk based SEsClassical disk based SEs

• Participating in Service Challenge 4Participating in Service Challenge 4Need for SRM-enabled SENeed for SRM-enabled SE

• No tape storage available for Grid at the No tape storage available for Grid at the moment – DPM chosen as SRM enabled SEmoment – DPM chosen as SRM enabled SE

• 1 head node, 1 disk server on the same 1 head node, 1 disk server on the same serverserver

• Separate nodes with disk servers plannedSeparate nodes with disk servers planned

• 5 TB on 4 filesystems (3 local, 1 NBD)5 TB on 4 filesystems (3 local, 1 NBD)

DPM issues – srmCopy()DPM issues – srmCopy()

• DPM does not currently support DPM does not currently support srmCopy()srmCopy() method (work in progress) method (work in progress)

• When copying from non-DPM SRM SE to When copying from non-DPM SRM SE to DPM SE using srmcp, the DPM SE using srmcp, the pushmode=true pushmode=true flag must be usedflag must be used

• Local temporary storage or Local temporary storage or globus-globus-url-copyurl-copy can be used to avoid direct can be used to avoid direct SRM to SRM 3SRM to SRM 3rdrd party transfer using party transfer using srmCopy()srmCopy()

DPM issues – pools on NFS DPM issues – pools on NFS (1)(1)• Our original setup – disk array Our original setup – disk array

attached to NFS server (64bit Opteron, attached to NFS server (64bit Opteron, Fedora Core OS with 2.6 kernel)Fedora Core OS with 2.6 kernel)

• Disk array NFS mounted on DPM disk Disk array NFS mounted on DPM disk server (no need to install disk server server (no need to install disk server on Fedora)on Fedora)

• Silent file truncation when copying Silent file truncation when copying files from pools located on NFSfiles from pools located on NFS

DPM issues – pools on NFS DPM issues – pools on NFS (2)(2)• Using Using stracestrace we found that the problem is we found that the problem is

that at some point during the copying that at some point during the copying process receives EACCES error from process receives EACCES error from read()read()

• Unable to reproduce using standard utilities Unable to reproduce using standard utilities (cp, dd, simple (cp, dd, simple read()/write()read()/write() program programss))

• Problem only when 2.4 client and 2.6 Problem only when 2.4 client and 2.6 kernel (verified on various versions)kernel (verified on various versions)

DPM issues – pools on NFS DPM issues – pools on NFS (3)(3)• Problem reported to DPM developersProblem reported to DPM developers

• Verified to be issue also with new VDT 1.3 Verified to be issue also with new VDT 1.3 (globus4, gridftp2)(globus4, gridftp2)

• Our workaround – used NBD instead of NFSOur workaround – used NBD instead of NFS– Important: DPM requires every fs in the pool Important: DPM requires every fs in the pool

to be a separate partition (free space to be a separate partition (free space calculation)calculation)

– NBD is a suitable solution for case of shared NBD is a suitable solution for case of shared filesystemfilesystem

DPM issues – rate limitingDPM issues – rate limiting

• SRM implementation in DPM currently SRM implementation in DPM currently doesn’t support (unlike dCache or doesn’t support (unlike dCache or CASTOR2) rate limiting concurrent CASTOR2) rate limiting concurrent new SRM requestsnew SRM requests

• On DPM TODO listOn DPM TODO list

• Besides these issue we have quite Besides these issue we have quite good results using DPM as a SE for good results using DPM as a SE for ATLAS, ALICE and VOCE VOs …ATLAS, ALICE and VOCE VOs …

Atlas CSCAtlas CSC• Golias100 receives data from Atlas CSC productionGolias100 receives data from Atlas CSC production

• Defined in some lexor (Atlas LCG executor) Defined in some lexor (Atlas LCG executor) instances as reliable storage elementinstances as reliable storage element

0

200

400

600

800

1000

1200

1400

1600

1800

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

day (May 2006)

GB

retrieved stored

0

500

1000

1500

2000

2500

3000

3500

4000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

day (May 2006)

retrieve connections store connections unique clients

Data transfers via FTSData transfers via FTS

• CERN – FZU, tested in April using FTS server at CERN – FZU, tested in April using FTS server at CERNCERN

Data transfers via srmcpData transfers via srmcp

– FTS channel available only to associated Tier1 (FZK)FTS channel available only to associated Tier1 (FZK)– Tests to another Tier1 possible only via transfers issued “by hand” Tests to another Tier1 possible only via transfers issued “by hand” – Tests SARA - FZU: Tests SARA - FZU:

• bulk copy from SARA to FZU, now with only one srmcp bulk copy from SARA to FZU, now with only one srmcp commandcommand

• 10 files: max speed 200 Mbps, average 130 Mbps10 files: max speed 200 Mbps, average 130 Mbps• 200 files: only 66 finished, the rest failed due to “Too many 200 files: only 66 finished, the rest failed due to “Too many

transfers” errortransfers” error• Speed OKSpeed OK

Tests Tier1 – Tier2 via FTSTests Tier1 – Tier2 via FTS

• FZU (Prague) is a Tier2 associated to Tier1 FZU (Prague) is a Tier2 associated to Tier1 FZK (GridKa, Karlsruhe, Germany) FZK (GridKa, Karlsruhe, Germany)

• FTS (File Transfer Server) operated by Tier1, FTS (File Transfer Server) operated by Tier1, channels FZK-FZU and FZU-FZK managed by channels FZK-FZU and FZU-FZK managed by FZK and FZUFZK and FZU

• Tunable parameters:Tunable parameters:– Number of files transferred simultaneouslyNumber of files transferred simultaneously– Number of streamsNumber of streams– Priorities between different VOs (ATLAS, ALICE, Priorities between different VOs (ATLAS, ALICE,

DTEAM)DTEAM)

Results not stable:

• Transfer of 50 files, each file 1GB• Starts fast, then timeouts occur:

• Transfer of 100 files, each file 1GB• Started when load on Tier1 disk servers low

ATLAS Tier0 test – part of ATLAS Tier0 test – part of SC4SC4• Transfers of RAW and AOD data from Tier0 (CERN) to Transfers of RAW and AOD data from Tier0 (CERN) to

10 ATLAS Tier1’s and to associated Tier2’s10 ATLAS Tier1’s and to associated Tier2’s

• Managed by ATLAS system DQ2, it uses FTS at Tier0 Managed by ATLAS system DQ2, it uses FTS at Tier0 for Tier0 – Tier1 transfers and Tier1’s FTS for Tier1 – for Tier0 – Tier1 transfers and Tier1’s FTS for Tier1 – Tier2 transferTier2 transfer

• First data copied to FZU this Monday: First data copied to FZU this Monday:

ALICE plans FTS transfer test in July

ConclusionConclusion

• DPM is the only “light-weight” available DPM is the only “light-weight” available Storage Element with SRM frontendStorage Element with SRM frontend

• It has issues, but none of them are It has issues, but none of them are “show stoppers” and the code is under “show stoppers” and the code is under active developmentactive development

• Using DPM, we were able to reach Using DPM, we were able to reach significant and non-trivial transfer significant and non-trivial transfer results in the scope of LCG SC4results in the scope of LCG SC4