35
1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss [email protected] www.gridka.de

1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

Embed Size (px)

Citation preview

Page 1: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

1

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Report

Tier-1 + associated Tier-2s

Andreas Heiss

[email protected]

Page 2: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

2

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Talk OutlineTalk Outline

● GridKa “cloud” / DECH overviewGridKa “cloud” / DECH overview

● Tier-1 CPU usage and data transfer testsTier-1 CPU usage and data transfer tests

● Middleware issuesMiddleware issues

● Site availability Site availability

● SC4 and experiments' exercisesSC4 and experiments' exercises

● Reports of (some) Tier-2 sitesReports of (some) Tier-2 sites

● Conclusion Conclusion

Page 3: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

3

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

GridKa Tier-1 GridKa Tier-1

● supports all 4 LHC experiments

● supports 4 non-LHC experiments: CDF, D0, BaBar, Compass

● located near Karlsruhe/Germany on the FZK (soon: KIT) campus

● Operated by the Institute for Scientific Computing (soon: “Steinbuch Computing Centre”)

Page 4: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

4

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

GridKa associated Tier-2 sites spread over 3 EGEE regions.GridKa associated Tier-2 sites spread over 3 EGEE regions. (4 LHC Experiments, 5 (soon: 6) countries, >20 T2 sites)

Page 5: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

5

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

region DECHregion DECH

LHCb

CMS

Alice

Atlas

10

00

SI2

k

Page 6: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

6

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

atlas

cmslhcb

alice

GridKa

Page 7: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

7

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Column D

Column E

Column F

Column G

Column H

Column I Column J

Column K

Column L

Column M

Column N

Column O

0

5000

10000

15000

20000

25000

Usage of CPU time through grid and local job submission

Alice Atlas CMS LHCb Alice Atlas CMS LHCb

Month

kS

I2k

* d

ay

s

J F M A M J J A S O N D

2006 by LHC

April CPU Milestone+ approx. 650 kSI2kDelayed due to cooling and BIOS issues

12 35 31 17 46 37 50 57 34 43 33

Fraction ofCPU usage by LHC experiments[%]

Ratio of grid/non-grid jobsof LHC experiments>76% since April 2006

~ 2000 CPU

cores available

(2087 kSI2k)

Page 8: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

8

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

cooling failureup and running after~2 days → too long!

PBS shutdown due tosecurity problem in pbs_mom

update togLite 3.0

Overall goodutilisation of GridKaCPUs.Increasing Fraction of Grid-jobs.

Page 9: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

9

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Data transfers November 2006Data transfers November 2006Hourly averaged dCache I/O rates and tape transfer rates

achieved 477 MB/s peak(1hour average) data rate.>440 MB/s during 8 hours

(T0→T1 + T1→T1)

> 200 MB/s to tapeachieved with 8 LTO3drives.

Higher tape throughput already in October 2006

Page 10: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

10

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Gridview T0→FZK Plots for Nov. 14-15th

high high CMSCMStransfer ratestransfer rates> 200 MB/s> 200 MB/s

Page 11: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

11

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Multi-VO transfers December 06Multi-VO transfers December 06 Target: Alice 24MB/s, Atlas 83.3 MB/s, CMS 26.3 MB/s → SUM: 134 MB/s

CMS disk-only poolsat FZK full.

LFC down FTS failed RED = ATLAS

Page 12: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

12

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

gLite middleware issuesgLite middleware issues● gLite-3 (LCG-flavour) CE on a 1 CPU-Opteron machine in June → machine under very high load → CE frequently not published in site BDII → Begin of August: hardware replaced by dual dual-core Opteron server, 4GB RAM

● Still infosystem problems● Info provider script was by far too slow (run > 25 mins. but started every minute) → A modified script supplied by RAL/Empirial College solved this problem ... and the next problem was recognized:● Scripts were run by different users (edginfo, rgma, edginfo w/ globus-mds environment)

pbs commands missing in globus-mds environment → empty ldif file and CE disappeared.

gLite3.0

BDII on extra

machine

downtimedCache update

Page 13: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

13

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

availabilityavailability

General problems:● Timeouts of top level BDII. Always: BDII query response times 2-4 sec. ● high load on top level BDII ● dCache: hanging gridftp doors caused SFT failures (timeouts)● lcg-rm timeouts (600s)

DNS entries vanished(1/2 day)Firewall overloadeddue to test program

Page 14: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

14

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Page 15: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

15

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Experiments' views

Experiments' views

Page 16: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

16

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

ATLASATLAS

SC4 resultsSC4 results● Throuput to T1 sites during week 11/08/2006● Goal was achieved during peak times but not sustained.

● Suffered from high load (>90) on VO box→ new machine provided by GridKa

● Initially only 4TB disk(-only) space in GridKa dCache available → another ≈34 TB additional disks provided begin of October

Page 17: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

17

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Tape Serverproblem @ GridKa

CERN server problem

Problem with Atlas certificate

Dedicated test-week for DDM October 4-10

● nom. 72 MB/s transfer rate Cern-GridKa achieved, but not sustained over a long time.● Peak rates of 150 MB/s

Page 18: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

18

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

DDM tests: Tier-1 + Tier-2 “cloud”DDM tests: Tier-1 + Tier-2 “cloud”Participating Tier-2s: DESY-HH, DESY-ZN, Wuppertal, FZU, CSCS, Cyfronet

3 steps functional tests:

1. 1 dataset subscribed to each Tier-2 + one add. dataset to all Tier-2s→ 100% files transferred

2. 2 datasets to each Tier-2→ Problem w/ Atlas VO at Wuppertal, few replication failures.

3. 1 dataset in each Tier-2 subscribed to GridKa→ 100% files transferred.

Parallel subscriptionof datasets (few 100 GBs) to all Tier-2s.(Dec. 06)

Throughphut tests to be done!

Page 19: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

19

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Atlas data aggregation at GridKaAtlas data aggregation at GridKa

Status as of begin of December:

● All available AODs subscribed● 26098 / 31148 files at GridKa

compared to 26347 / 30949 at CERN CAF (approx. 2891 GB)● RDOs: 1185 GB (mostly for calibration studies)● ESDs: 506 GB

Page 20: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

20

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

FZK

PDC’06 - site contributionsAliceAlice

Page 21: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

21

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Nov. 16-22.:No 'competitor' concerning T0-GridKa transfers except dteam, but low overall Cern export rate.

Page 22: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

22

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Multi-VO transfer testsMulti-VO transfer testsDec 11th - 14thDec 11th - 14th

Page 23: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

23

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

CMSCMS

dCache upgrade

● Sufficiant high transfer rates possible over longer periods of time.● Good transfer quality ...● ... until dCache upgrade

Beginning of CSA06 went very well with good transfer rates from our connected T1 FZK. When FZK experienced problems with the dcache upgrade, we noticed how reliant we as a T2 were on our T1. We were able to get parts of the desired data from FNAL, ASGC and RAL but never at the speedas initially from FZK.

Derek Feichtinger, CSCS (Swiss T2)

Page 24: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

24

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

~ 50TB / 21 days

● Good transfer rates when no dCache problems occur Other problems encountered:

● low dCache output rates to worker nodes → suboptimal configuration of dCache pools for read operations.

● Problem with stage out of files > 2GB → preload lib (ls -l on /pnfs)

Page 25: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

25

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

LHCbLHCb

LHCb jobs

LHCb jobs @ GridKa

Running jobs, snapshot of Nov. 9th, 2006

● Good cooperation with GridKa, phone meetings if necessary.● GridKa fraction of LHCb MC production increased from 1.2 % until June to 5.4% since July

Page 26: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

26

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Upgrades in 2007Upgrades in 2007● Install additional CPUs (April)

● LHC experiments: 1027 kSI2k + 837 kSI2k = 1864 kSI2k● non-LHC experiments: 1060 kSI2k + 210 kSI2k = 1270 kSI2k

● Add tape capacity (April)● LHC experiments: 393 TB + 614 TB = 1007 TB● non-LHC experiments: 545 TB + 40 TB = 585 TB

• GRAU Datasystems XT library • 5400 slots• 16 LTO3 drives (IBM)

(expandable to 60)• support for TSM• dCache interfaced to

TSM via TSS

Page 27: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

27

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

● Add disk capacity (Juli)● LHC experiments: 284 TB + 594 TB = 878

TB● non-LHC experiments: 353 TB + 90 TB = 443

TB • Storage units of 20 TB• 2 servers connected to 1 storage

controller• 2 (at 2 Gbit) servers for every 20 TB• dCache pool node on GPFS file system

2007: LHC experiments will2007: LHC experiments willhave biggest fraction of the GridKahave biggest fraction of the GridKaresources! resources!

2007: LHC experiments will2007: LHC experiments willhave biggest fraction of the GridKahave biggest fraction of the GridKaresources! resources!

Page 28: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

28

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

● Extend dCache mass storage● dedicated nodes to write to tape● group of nodes to read/write disk-only and read from tape

To Worker nodes

T0/T1 OPN10 Gb

SRM nodegridka-dcache.fzk.de

dCache head node

9/2

8/2

006

F

ZK

T2 and Internet10 Gb

tape W

D

disk only R + Wtape R

CB

tape R + Wtape W

A

private net public net

Page 29: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

29

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

● Extend LAN/WAN router mesh and WAN connections.

● add WAN router for redundancy

● add LAN router (already installed, testing)

● build 10Gb/s p2p links to several other Tier-1 sites:

CNAF: ready SARA: we have light IN2P3: 2007

in addition to the existing dedicated 10 Gb/s link to Cern an 10 Gb/s uplink to DFN/X-Win.

Page 30: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

30

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Tier-2 partners

Tier-2 partners

Page 31: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

31

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

CMS T2 Desy-Aachen FederationCMS T2 Desy-Aachen Federation● significant contributions to CMS SC4 and CSA06 challenges

● stable data transfers● transferred 55 TB to DESY/Aachen disk within 45 days, 45 TB to DESY tape

● Aachen CMS muon and computing groups successfully demonstrated full “grid-chain” from data taking at T0 to user analysis at T2 for the first time.

● 14% of total CMS grid MC production

● 2007/2008:● MC prod. / Calib. in Aachen, MC prod. and user analysis at Desy● Significant upgrade of resources● Further improve cooperation between German CMS centers (including Uni KA and GridKa)

Page 32: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

32

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Polish Federated Tier-2Polish Federated Tier-2

● 3 computing centres, each supporting mainly one experiment:● Kraków - Atlas, LHCb ● Warsaw - CMS, LHCb● Poznań - Alice

● connected via Pionier academic network● 1Gb/s p2p network link to GridKa in place

● successful participation in Atlas SC4 T1↔T2 tests: - Up to 100 MB/s transfer rates from Krakow to GridKa, 50% slower in other direction. - 100% file transfer efficiency

● 1000 kSI2k CPU and 250 TB disk will be provided by Polish Tier-2 Federation at LHC startup.

Page 33: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

33

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

FZU PragueFZU Prague

nr.of jobs

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov

Nr. of ATLAS jobs submitted to Golias

# CPU equivalent

0

10

20

30

40

50

60

70

80

90

100

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov

CPU equivalent usage – average number of CPUs used continuously

Successfull participation in Atlas DDM tests!

Page 34: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

34

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Conclusions and further remarksConclusions and further remarks

● Successful participation in SC4 and experiments' exercises.

● Still problems with the stability of the storage system.

→ Recent upgrade to dCache 1.7. Improvement?● Site availablilty still below target → complex issue● Massive upgrade of GridKa CPU and storage in 2007

→ LHC fraction of total resources > 50% in 2007● Additional 10Gb/s (backup) links to other Tier-1 sites.

● Atlas and CMS communities around GridKa well organized. (Alice/LHCb have 1/0 Tier-2s so far.)

Page 35: 1 Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft WLCG Collaboration Workshop, Jan. 24th 2007 Report Tier-1 + associated Tier-2s Andreas Heiss

35

Forschungszentrum Karlsruhein der Helmholtz - Gemeinschaft

WLCG Collaboration Workshop, Jan. 24th 2007

Thanks to the contributors:Thanks to the contributors:

Thomas Kress, Günter Quast (German CMS T2 Federation)

Kilian Schwarz (GSI Darmstadt, Alice)

Jiri Chudoba (Prague, Atlas)

Andrzej Olszewski (Krakow, Polish federated Tier-2 sites)

John Kennedy, Günter Duckeck (Munich, Atlas)

...