LT 2 London Tier2 Status Olivier van der Aa LT2 Team M. Aggarwal, D. Colling, A. Fage, S. George, K. Georgiou, W. Hay, P. Kyberd, A. Martin, G. Mazza,

LT2

London Tier2 Status

Olivier van der Aa

LT2 Team

M. Aggarwal, D. Colling, A. Fage, S. George, K. Georgiou, W. Hay, P. Kyberd, A. Martin, G. Mazza, D. McBride, H. Nebrinsky, D. Rand, G. Rybkine, G.

Sciacca, K. Septhon, B. Waugh

Jan 27th June 2006

London Tier 2 Status

LT2Outline

• LT2 Usage• LT2 Sites updates• LT2 SC4 activity

• Conclusion

Jan 27th June 2006


LT2Number of Running

Jobs

January

February

Jan 27th June 2006


LT2Number of running jobs

March

April

Jan 27th June 2006


LT2Number of running jobs

May

•Increase of the infrastructure usage by LHCB last month•Has stressed the system. Caused very slow mds responses.

Jan 27th June 2006


LT2Usage and efficiency per

VO[2006-01-01,2006-04-30]

WallTime consumption

•ATLAS, LHCB, BIOMED, CMS are the top consumers

Efficiency: Fraction of total time which

result in a successful state.

•Efficiency by order: BIOMED, ATLAS, LHCB, CMS.•Efficiency pattern is notyet understood. Why is BIOMED more efficient (ie causes less middleware failures)

BIOMED

ATLAS

LHCB

CMS

Jan 27th June 2006


LT2

ucl-central

Brunel

QMUL

RHUL

IC-HEP

UCL-HEP

ucl-central

IC-LESC

Usage and efficiency per CE

[2006-01-01,2006-04-30]

WallTime

•QMUL provides 55% of the total WallTime•Delicate to provide the Service Level Agreement of 95% availability of the LT2 with 1FTE at QMUL

Efficiency•UCL-CENTRAL the most efficient in job successes rate. •Can be explained because they mainly attract biomed jobs

Brunel

ucl-central

Brunel

brunel

QMUL

RHUL

IC-HEP

UCL-HEP

ucl-central

IC-LESC

Jan 27th June 2006


LT2CE / VO view

• In London we support18 VO. (sixt has not been used)

• Right Plots shows the relative VO usage for each CE.

• Size of the box is proportional to the total Wall Clock Time

ucl-central

BrunelQMUL

RHULIC-HEP

UCL-HEP

ucl-central

IC-LESC

Brunel

Jan 27th June 2006


LT2GridLoad

Tool to monitor the sites:

-Updates every 5minutes-Uses the RTM data and stores it in rrd files•Shows theNumber of Jobs in any state•VO view. Stacks the Jobs by VO •CE view. Stacks the Jobs by CE

https://gfe03.hep.ph.ic.ac.uk:4175/cgi-bin/load

Still a prototype. Will add•View by GOC and ROC.•Error checking.•Add usage (running cpu/total cpu). •Improve look and feel

Could interface with NAGIOS for raising alarms (high abort rate)

Jan 27th June 2006


LT2GridLoad (cont)

The GridLoad plots can be useful to spot

problems.

Example: Observed High Abort rate at one site for LHCB jobs

•It helped to be proactive for the VO. Could spot that there is a problem before we receive a ticket

# Aborted Jobs

#Running Jobs

Jan 27th June 2006


LT2LT2 Usage: Conclusions

• We have now an additional tool to monitor the LT2 cpu activity on real time.

• The overall usage is increasing.

• We need to understand the efficiency patterns. What causes those differences between the VO.

• We need similar real time monitoring tools for the storage.

Jan - May

Jan 27th June 2006


LT2Outline


• Conclusion

Jan 27th June 2006


LT2Brunel site update

– New Cluster provided by Streamline Computing :• Supermicro dual processor dual core AMD Opteron nodes• 40x1.8 GHz, 4GB memory, 80 GB disk• Head node 2 GHz 8GB memry, 320 GB disk• Total 164 Cores• Is in the process of being configured

– Gb connection ?• 1Gb wan at Brunel in 65 days from now. • They are currently buying appropriates switches and related hardware.

• Will have a throttling router that limits the LCG traffic if the university demand is high. If the university demand is low then the LCG will have higher allocation

• The Brunel site is expected to have a 10 times faster connection (200Mb) by September.

– SRM: • best rate was 59Mb/s . • Will remove any nfs mounted filesystem. No real showstopper there.

Jan 27th June 2006


LT2IC Sites update

• HEP:– Old IBM (60CPU) cluster running smoothly, almost full of jobs for the last two month.

– Will build a new cluster with off the shelf boxes• 40 Dual Core AMD• 40 TB of disk (non raid)• Will use SGE for the job manager.

Jan 27th June 2006


LT2IC Sites update

• Investigated FTS performance issues with dCache<->dCache transfers

• FTS using urlcopy causes high iowait

FTS/urlcopy (130Mb/s)FTS/srmcp (179Mb/s)iowait

Time

Block

Block

Time

Jan 27th June 2006


LT2

• LESC:– 33% of 400 1.8 GHz opterons.– Running RHEL3 64 bit.– SGE job manager.– DPM storage with small disk partition– Currently porting DPM to SOLARIS to avoid nfs mounting file

systems used for the SRM. • See the progressing work at http://www.gridpp.ac.uk/wiki/DPM-on-Solaris

– Difficulties: Improving usage. Several VO not comfortable with 64bit arch even if 32 bit libraries are there

• ICT:– Deploying a new 200 Xeon cluster running PBS for College Use.– Will have a share of 30% in that cluster for LCG– 30TB of raid storage that will be shared.– Difficulties: They want to use GT 3.2.1

IC Sites update

Jan 27th June 2006


LT2QMUL site update

• Lots of activity with the commissioning of their new cluster provided by Viglen

– 280 Dual Core Opterons (270) 2GHz– All nodes have 2x250 Gb disks

=140TB !• What filesystem to use with that

environment. Will consider lustre. – All nodes are 1Gb connected. With

10Gb inter switch links. – Now online with ~1600 job slots– Problems:

• Site stability under high job load: nfs mounted software area not coping

• Raid boxes giving hardware errors. Seemed to be due to loose sata connectors. The disks where tested ok with smart. Not yet clear what it is due to.

• Reliability of DPM on Poolfs

Jan 27th June 2006


LT2UCL sites update

• CCC– Have successfully moved to SGE job manager to service

364 Slots (91 dual cpu + hyper threading)– Improved their SRM performance by using direct

fiberchannel link to the raid array from the head node. • Write bandwidth moved from 90Mb/s to 238Mb/s

– Will have 40 additional nodes (160 slots) soon. – Moving their cluster from one building to another one

will start on July 3 for 1 week.

• HEP– New Gb switches have been bought. Need to cable them to

the head node. – Will have 1,2 boxes with mirrored 120Gb disks with dpm

pool installed on them to support non Atlas vo– Atlas will still be using nfs mounted– Problem: Performance for Atlas storage

Jan 27th June 2006


LT2RHUL siteupdate

• Cluster running smoothly – 142 Job slots almost full for two month. All VO

targetting that site.

– No more nfs mounted disks with write access from DPM .– Broad VO usage

• Update on the 1Gb connection:– Purchase order was signed yesterday. – Discussions are now starting as to when it will be

installed.

• Problems: – Need to be able to drain Pool to remove the read-only

nfs mounted filesystem

Jan 27th June 2006


LT2Transfers throughput status

Rate (Mb/s)

Site Inbound Outbound Update

Brunel 57 59 Gb connection signed (200Mb by september)

IC-HEP 80 190 FTS performance problem not yet understood

IC-LeSC 156 95 DPM being build for solaris

QMUL 118 172Poolfs need to be recompiled with round robin feature

RHUL 59 58 Gb connection signed

UCL-HEP 71 63 Gb switches there.

UCL-CENTRAL 90 309

Move to direct fiberchannel connection. Rate is now 238Mb/s

Jan 27th June 2006


LT2Outline


• Conclusion

Jan 27th June 2006


LT2SC4 Activity

• CMS: Target is CSA06 (Computing Software and Analysis Challenge)– CSA06 Objective=“A 50 million event exercise to test

the workflow and dataflow associated with the data handling and data access model of CMS “• Will test the new cms reconstruction framework for large production

• Need 20MB/s bandwidth to T2 storage• Will start on 15 Sept• More information can be found at:https://twiki.cern.ch/twiki/bin/view/CMS/CSA06

– IC-HEP and IC-LESC preparing for CSA06 – Strategy is to help other sites when IC is ok.

• New PheDex installed that uses FTS– Need to solve the FTS performance issues

• ProdAgent configuration prepared for IC-LESC and IC-HEP– Brunel Involved in PheDex.

• ATLAS: – No commitment yet.

Jan 27th June 2006


LT2Conclusions

• Real Time monitoring of the LT2 job states in place– The usage is increasing

• Site evolution– SGE deployed at UCL-CENTRAL– QMUL more than doubled the number of job slots– Brunel: Gb connection on the right track, Commissioning a new

cluster (160 cores)– IC: spotted FTS performance issues, Porting of DPM under

solaris ongoing, Will commission a new cluster at HEP– RHUL: very stable site, Gb connection signed.

General storage evolution: In the process of removing nfs mounts.

• SC4: Involvement in the CMS SC4 activity is going on. Need to have a volunteer in the atlas sc4.

Jan 27th June 2006


LT2

Thanks to all of the Team

M. Aggarwal, D. Colling, A. Fage, S. George, K. Georgiou, M. Green, W. Hay, P. Hobson, P. Kyberd, A. Martin, G. Mazza, D. McBride, H. Nebrinsky, D. Rand, G. Rybkine, G. Sciacca, K. Septhon, B. Waugh,

LT2

Documents

LT 2 London Tier2 Status Olivier van der Aa LT2 Team M. Aggarwal, D. Colling, A. Fage, S. George, K. Georgiou, W. Hay, P. Kyberd, A. Martin, G. Mazza,