[IEEE 2013 RoEduNet International Conference 12th Edition: Networking in Education and Research - Iasi, Romania (2013.09.26-2013.09.28)] 2013 RoEduNet International Conference 12th

Data Transfer Optimization for the DFCTI Grid Site

Mihai Ciubancan, Serban Constantinescu and Mihnea Dulea Department of Computational Physics and Information Technologies (DFCTI)

‘Horia Hulubei’ National Institute for R&D in Physics and Nuclear Engineering (IFIN-HH)

Magurele, Romania

Abstract — The paper first reviews the main achievements obtained by the Romanian LCG Federation (RO-LCG) since its founding, emphasizing its recent performances within the WLCG collaboration and also discussing the current limitations of the network infrastructure. Then the presentation focuses on the technical organization of the DFCTI's main Grid site, RO-07-NIPNE, and on the requirements it must satisfy in order to ensure the quality and reliability necessary for WAN communication with the Tier-1 centers, and for efficient data handling within the LAN. The design of a scalable Storage Elements optimized for ATLAS analysis is proposed. Also, the maximization of the throughput of the disk write on worker nodes by using congestion control algorithms is investigated.

Keywords—LHC Computing Grid; WLCG; RO-LCG; data transfer

I. INTRODUCTION The worldwide grid activity recorded during the last ten

years on the EGEE/EGI accounting portal [1] is strongly dominated by the virtual organizations (VOs) related to the LHC experiments conducted at CERN, which are supported by the LHC Computing Grid (LCG, [2]). The more than 80% share that the High Energy Physics (HEP) occupies within the overall grid production is due, besides the consistent research programme at CERN, to the intensive use of many applications developed within the HEP community, that perfectly fit the distributed computing paradigm, and to the permanent improvement and optimization of the LCG infrastructure. The Worldwide LCG (WLCG) collaboration has provided a global computing environment that continuously evolved to adapt to the LHC experiments' requirements and to the steady progress of networking, computing hardware, and software.

Recent advances in wideband technologies allowed the change of the LCG topology from the hierarchical (MONARC) model of Tier-0/1/2 resource centers [3], to a mesh model [4], in which each Tier-1 center can exchange data with any other Tier-1 center, and a Tier-2 center can interact with many Tier-1 and Tier-2 centers. This major initiative is expected to significantly improve the performance and reliability of data transfers between the centers.

No less important for the successful execution of the grid workflows is the optimization of the data handling within each resource center, including the Tier-2s, which host together half of the overall processing capacities of the WLCG. As there is no generally valid recipe, each center implements to this end

specific measures adapted to its local architecture and technical conditions.

The Romanian contribution to the WLCG collaboration consists of resources and services jointly provided by seven grid sites that form the RO-LCG Federation, organizational entity coordinated by DFCTI/IFIN-HH, which is assimilated to a Tier-2 center. This article first presents a short overview of the technical setup of RO-LCG and of the results obtained during its eight years of activity. Then, the solutions adopted for improving the efficiency of job execution through the optimization of the data transfers at the DFCTI's grid site are reviewed.

II. RO-LCG SETUP AND RESULTS

A. Requirements and Setup RO-LCG provides disk storage and processing power for

Monte Carlo simulations and data analysis required by three of the main LHC experiments/VOs, ALICE, ATLAS and LHCb [5]. The WLCG Memorandum of Understanding (MoU) [6] specifies the service-level agreement (minimum annual availability of 95%), and at least 10 Gb/s connectivity with the GÉANT network. Moreover, in order to meet the ever increasing need of storage and computing capacities for LHC experiments, annual upgrades of resources are yearly pledged.

The LHC VOs are supported by grid sites owned by three R&D institutions and one university, which are listed below. TABLE I. RO-LCG CONSORTIUM'S MEMBERS AND GRID SITES

Currently, the seven grid sites above jointly provide to the

WLCG collaboration more than 6000 CPU cores and a disk storage capacity of 2.26 PetaBytes. These resource centers are connected through network links of at least 10 Gb/s to the 100 Gb/s backbone of the RoEduNet NREN, which moreover ensures the required 10 Gb/s bandwidth link to GÉANT. While the centers from Cluj-Napoca and Iasi have exclusive access to

This work was partly funded by Ministry of National Education / Institute of Atomic Physics under the contracts no. 8/2012– PNII-Capacities-M3-CERN, C1-06/2010-IFA, and from the Hulubei-Meshcheryakovcollaboration, JINR Order 82/06.02.2012.

the 10 Gb/s links to RoEduNet, the 5 centers owned by IFIN-HH and ISS, which are located on the Magurele Platform, must share a common 10 GB/s connection to RoEduNet's NOC.

The basic software configuration of the centers must fulfill LCG's infrastructure demands and keep pace with its frequent upgrades. The main requirements implemented at the sites are: a) operating system - Scientific Linux (SL) release [7], version 6.4; b) middleware - European Middleware Initiative (EMI) release 3 [8]; c) computing elements (CE) - running the CREAM job management service [9], with PBS/TORQUE distributed resource manager (queuing system) and Maui job scheduler; d) storage elements (SE) - with Disk Pool Manager (DPM) for disk storage management [10]. Two RO-LCG sites (NIHAM and RO-13-ISS) provide resources to the ALICE collaboration by also using AliEn middleware [11].

B. Performances and Current Limitations The steady increase of the LHC production during the First

Run (2010-2012), the annual upgrades of the computing and storage resources, and the continuous optimization of the network infrastructure have led to higher levels of grid production in the RO-LCG federation.

Since its establishment in 2006, RO-LCG has processed more than 39.7 million LHC jobs, summing more than 99 million CPU kSI2k-hours [12] (i.e. 396 million HEPSpec06-hours [13]). The representation of the annual grid production as a function of time (below) clearly highlights the start of the LHC production (in 2010).

Fig. 1. Total CPU use (in kSI2k-hours) and number of jobs recorded yearly by EGI [1] for RO-LCG during September-August intervals of time

Also, beginning with Sept. 2012 - Aug. 2013, a significant increase of the ratio between the CPU hours and the number of jobs becomes visible. This is a consequence of the usage of processors at full capacity by jobs that consume longer CPU times (such as e.g. for LHCb reprocessing), which thereby have blocked the access to grid resources of the short but very frequent pilot jobs (which significantly count in statistics).

The above mentioned results rank RO-LCG in good positions on both the national and international levels. Thus, during Jan. 2006 - Sept. 2013 RO-LCG has run 99% of the grid jobs and 98% of the CPU time realized in Romania. Also, if the sum of the three VOs (ALICE, ATLAS, and LHCb) is considered then, according to Fig. 2 below, in 2013 RO-LCG ranks 8th among the 34 national Tier-2s monitored on EGI's accounting portal [1], regarding the cumulated number of jobs, and 11th regarding the CPU time, with shares of 2.76%, respectively 2.18%, of the totals.

Preserving this leading position strongly depends on the future upgrade of the computing and network capacities.

Fig. 2. National shares of total Tier-2 activity recorded in Jan. - Sept. 2013 on ALICE + ATLAS + LHCb VOs (only top 15 represented): CPU time (above) and number of jobs (below)

Indeed, the 10 Gb/s bandwidth of the external link is already becoming critical for the five RO-LCG sites located on the Magurele Platform, that provide most of the national grid traffic. As Fig. 3 below demonstrates, during massive grid data transfers, the traffic reaches values in the 9-10 Gb/s range, that is the maximum transfer capacity of the link.

Fig. 3. Peaks with values close to the maximum bandwidth, recorded by AARNIEC for incoming (above) and outgoing (below) traffic on the Ethernet interface that connects the grid sites located at Magurele to RoEduNet.

These results show that urgent measures must be taken in order to preclude the congestion of the external link.

III. RO-07-NIPNE

A. Site Topology DFCTI directly contributes to the WLCG collaboration

with the grid site RO-07-NIPNE [14], which provides a processing power of 1500 CPU cores and 525 TB storage capacity. The site is connected to RoEduNet through a Cisco ASR 9006 router that manages one 10 Gb/s link for all the five sites located at Magurele.

For reasons of available space and power supply, the grid resources are distributed in two computer rooms (denoted by 'CD' and 'CC' in Fig. 4 below); while the first room hosts the data disk servers (DDs), the second one is dedicated to the worker nodes (WNs). The two parts of the cluster are connected through multiple 10 Gb/s links.

Fig. 4. Topology of the RO-07-NIPNE cluster. GW = grid gateway, CE = computing elements, SE = storage element. The thick lines denote 10 Gb/s links. In blue: stacked switches.

The job management is currently ensured by three CREAM-CE computing elements: one for ALICE production (tbit01), one dedicated to the ATLAS analysis jobs (tbit07), and one for LHCb and ATLAS production (tbit03). The storage element is a DPM/SRM head node that manages various DPM DDs. Each worker node has 8-32 cores, 2GB RAM and 4 GB virtual memory per core, minimum 400 GB disk scratch space (800 GB on 32-core nodes), and 1-2 Gb/s network links. The hardware is heterogeneous, originating from various producers (Fujitsu, IBM, Dell, Supermicro, Tyan, etc.).

B. Data Transfer Requirements The quality and reliability of the file transfers through WANs and inside the centers' LANs is decisive for the efficiency of the Tier-2 sites. As previously discussed in [15], the external link should provide error-free support for receiving large input files for on-site data analysis, and for sending results files to Tier-1s / users and job logs to the users. The success of data analysis also relies on the quality of data handling within the site.

The file transfers within the ATLAS analysis at RO-07-NIPNE present features that are common to all the Tier-2 sites:

• The Tier-1 centers distribute large (~10 GB) Analysis Object Data (AOD data sets) to the storage element (SE). The transfer speed depends on the end-to-end bandwidth of the Tier1-Tier2 connection; therefore it is sensitive to any degradation of network parameters.

• The users may require the reading (transfer) of many AODs from the SE's disk data servers (DDs) to the worker nodes (WN), from which physical events of interest are selected. The speed of the copying on WNs depends on the available bandwidth between SE's DDs and WNs, and this must be high enough to allow the completion of each file transfer before reaching the timeout limit at which the job is aborted.

• The analysis results (~10 GB files) are stored on SE's DDs and can be later transmitted to the Workload Management System (WMS).

• The local use on WN of (large) simulation files - stored on SE's DDs, can be required by an analysis job.

The efficiency of the above operations depends on the performance and tuning of the local/external network and hardware, on the stress resistance of the system, and on the coordination between the Tier1s and the associated Tier2 centers.

C. Data Transfer Optimization Measures were first taken to investigate the possibilities of increase of the on-site capacity for ATLAS analysis, by improving the reliability of the external connection to the French Cloud. These led to the adoption of new networking initiatives.

First, the external link to RoEduNet was replaced by a more reliable one.

Second, the DFCTI site and RO-02-NIPNE have been integrated into the LHC Open Network Environment (LHCONE) [16], which provides a private network for the Tier-1/2/3 centers capable to ensure better access to datasets for the HEP community. In order to do this, a separate VLAN was defined and dynamic routing (through BGP) was enabled by creating a point-of-presence (PoP) of RoEduNet at DFCTI. At present, all the ATLAS sites from RO-LCG are using LHCONE, and the other centers from Magurele will benefit of it when the experiments will decide this.

Third, tests started for the qualification of the site as a Directly connected Tier-2 (T2D) site [17] (see Fig. 5). T2Ds

Fig. 5. Results of file transfer tests between RO-07-NIPNE and up to 11 Tier-1 grid centers.

are reliable sites that can exchange data with different clouds.

A site is T2D if the rates of data communication with at least 9 Tier-1s are larger than 5 MB/s, for some well defined periods of time. Tests are currently performed to find the weak points of the link Magurele - RoEduNet NOC – GEANT, with 3 couples of perfSONAR servers (measuring bandwidth and latency), located at DFCTI, UPB, and in RoEduNet’s NOC.

Fourth, an upgrade of the external bandwidth to 40-100 Gb/s is planned to be performed with the help of AARNIEC.

Studies have been performed at the LAN level in order to optimize the Storage Element (SE) solution. The SE should: a) provide the storage capacity required by the users; b) allow the reading of the files stored in SE's DDs by the WNs at the minimum read speed per file which is stipulated in the experiment's computing model (2.5 MB/s); c) be scalable with respect to the upgrade of the storage capacity. In the case of DFCTI the problem is further complicated by the use of 1 Gb/s links and the distribution of the hardware in two separate rooms (Fig. 4).

To solve the last problem above, the switch stacks (in blue on Fig. 4) and the bandwidth of the connection between the rooms can be upgraded at will.

For reasons of scalability it is recommended to choose multiple DDs with a storage capacity as big as it is possible to be read from WNs at 2.5 MB/s. Thus, the SE is connected to multiple DDs, each of which having the network capability to be read from WNs. A balance between the storage capacity and the network throughput of each DS must exist. Then, a rather simple computation shows that for each TB of DDs, a storage capacity 200 Mb/s of network throughput must be provided.

This result was used for computing the optimal configurations of the new storage equipment to be procured for DFCTI, and helped improving e.g. the performance of a SE on which we could not previously reach a transfer speed of more than 8 Gb/s (Fig. 6).

Fig. 6. Reaching an overall traffic of more than 11 Gbps between SE's DDs and WNs during the concurent running of 500 analysis jobs.

At the other end of the connection, the throughput of the disk write on WNs must be maximized. This is not trivial for servers with many multi-core processors.

As the transfer speed at the level of the controller cannot be increased (except for choosing e.g. RAID 0, if possible), the solution is to tune the transfer speed between the Ethernet adaptor and memory. The tests performed at DFCTI on various servers have proven that in this case the congestion control algorithm matters. GKrellM [18] was used to monitor the

RAID partition(s) that participate to the transfer and the Ethernet adaptor (Fig. 7).

Fig. 7. Testing "Vegas" congestion control algorithm. The disk activity stops when the ethernet activity stops, because the buffer is empty.

It was shown that replacing "BIC" (Binary Increase Congestion control) with “Vegas” congestion control algorithm can significantly improve the performance of file transfers.

IV. CONCLUSIONS The promising results recently obtained regarding the

optimization of the local grid infrastructure create the premises for the successful participation of RO-LCG to the computational support of the next LHC running period.

ACKNOWLEDGMENT Collaboration within RO-LCG with C. Schiaua and G.

Stoicea (IFIN-HH), F. Farcas (ITIM-CJ), C. Pinzaru (UAIC), I. Stan (ISS), A. Herisanu (‘Politehnica’ Univ. of Bucharest), and with AARNIEC's staff is acknowledged.

REFERENCES

[1] EGI accounting portal, http://accounting.egi.eu/egi.php [2] Large Hadron Collider Computing Grid, http://wlcg.web.cern.ch [3] http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=9

2416 [4] Ian Bird et al, "Update of the Computing Models of the WLCG and the

LHC Experiments", draft, version 1.7, Sept. 2013. [5] http://public.web.cern.ch/public/en/lhc/LHCExperiments-en.html [6] WLCG MoU, https://espace.cern.ch/WLCG-document-

repository/MoU/admin/archive/WLCGMoU.pdf [7] https://www.scientificlinux.org/, http://linux.web.cern.ch/linux [8] European Middleware Initiative, http://www.eu-emi.eu/ [9] CREAM service, https://wiki.italiangrid.it/CREAM [10] Disk Pool Manager, https://www.gridpp.ac.uk/wiki/Disk_Pool_Manager [11] AliEn middleware, http://alien2.cern.ch/ [12] SPEC CPU2000 v1.3, http://www.spec.org/cpu2000/ [13] HEP-SPEC06 benchmark, http://w3.hepix.org/benchmarks/doku.php/ [14] http://gstat2.grid.sinica.edu.tw/gstat/site/RO-07-NIPNE/ [15] M. Dulea, S. Constantinescu, and M. Ciubancan, " Support for multiple

virtual organizations in the Romanian LCG Federation", in Proceedings of the 5th RO-LCG Conference "Grid, Cloud, & High Performance Computing in Science", 25-27 Oct. 2012, INCDTIM, Cluj-Napoca, Romania, pp. 59-62, ISBN 978-1-4673-2242-3, IEEE explorer link: http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?reload=true&punumber=6522576

[16] LHCONE, http://lhcone.net/ [17] A.F. Casaní et al, Proceedings of Science (ISGC 2012), 011 [18] GKrellM, http://freecode.com/projects/gkrellm

Documents

[IEEE 2013 RoEduNet International Conference 12th Edition: Networking in Education and Research - Iasi, Romania (2013.09.26-2013.09.28)] 2013 RoEduNet International Conference 12th