10
Scalla/xrootd WAN globalization tools: where we are. Fabrizio Furano 1 , Andrew Hanushevsky 2 1 Conseil Europeen Recherche Nucl. (CERN), Switzerland E-mail: [email protected] 2 SLAC National Accelerator Laboratory, CA (US) E-mail: [email protected] Abstract. The Scalla/Xrootd software suite is a set of tools and suggested methods useful to build scalable, fault tolerant and high performance storage systems for POSIX-like data access. One of the most important recent development efforts has been to implement technologies able to deal with the characteristics of Wide Area Networks, and find solutions in order to allow data analysis applications to directly access remote data repositories in an efficient way. This contribution describes the current status of the various features and mechanisms implemented in the Scalla/Xrootd software suite, which allow to create and efficiently access ’global’ data repositories, obtained by aggregating multiple sites through Wide Area Networks. One of these mechanisms is the ability of the clients to efficiently exploit high-latency high-throughput WANs and access remote repositories in read/write mode for analysis-like tasks. We will also discuss the possibilities of making distant data sub-repositories cooperate. The aim is to give a unique view of their content, and eventually allow external systems to coordinate and trigger data movements among them. Experience in using Scalla/Xrootd remote data repositories is also reported. 1. Introduction This contribution deals with the possibilities given by the Scalla/Xrootd platform in order to allow to efficiently access data repositories through both WAN and LAN. Nowadays we can see that the wide area connectivity between the high energy physics sites in general is developing, and there are several attempts to evaluate at what extent they can be used at their maximum. The approach of the Scalla/Xrootd suite is, with respect to this, quite similar to the one of the World Wide Web, i.e. a client accesses directly (via TCP and a suitable application-level protocol) the portions of the data it needs, thus avoiding to transfer the portions of data that it does not need. With respect to this, it is useless to say that the common architectural choice of pre-transferring locally all the data chunks (in the form of complete, huge data files) does not fit these efficiency requirements, and is not considered in this work, although always possible. In the last years, several development milestones in the Scalla/Xrootd suite made it possible to create fast and coherent WAN-wide data repositories and to actually reach a performance level for WAN-wide direct data access which is more than sufficient to analyze data files which are not local in the site where the computation runs. Of course, with this statement, we don’t want to say that every computation has to be against remote repositories, but instead that a

Scalla/xrootd WAN globalization tools: where we · Scalla/xrootd WAN globalization tools: where we are. Fabrizio Furano1, Andrew Hanushevsky2 1Conseil Europeen Recherche Nucl.(CERN),

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalla/xrootd WAN globalization tools: where we · Scalla/xrootd WAN globalization tools: where we are. Fabrizio Furano1, Andrew Hanushevsky2 1Conseil Europeen Recherche Nucl.(CERN),

Scalla/xrootd WAN globalization tools: where we

are.

Fabrizio Furano1, Andrew Hanushevsky2

1Conseil Europeen Recherche Nucl. (CERN), Switzerland

E-mail: [email protected] National Accelerator Laboratory, CA (US)

E-mail: [email protected]

Abstract. The Scalla/Xrootd software suite is a set of tools and suggested methods useful tobuild scalable, fault tolerant and high performance storage systems for POSIX-like data access.One of the most important recent development efforts has been to implement technologies ableto deal with the characteristics of Wide Area Networks, and find solutions in order to allowdata analysis applications to directly access remote data repositories in an efficient way. Thiscontribution describes the current status of the various features and mechanisms implementedin the Scalla/Xrootd software suite, which allow to create and efficiently access ’global’ datarepositories, obtained by aggregating multiple sites through Wide Area Networks. One of thesemechanisms is the ability of the clients to efficiently exploit high-latency high-throughput WANsand access remote repositories in read/write mode for analysis-like tasks. We will also discussthe possibilities of making distant data sub-repositories cooperate. The aim is to give a uniqueview of their content, and eventually allow external systems to coordinate and trigger datamovements among them. Experience in using Scalla/Xrootd remote data repositories is alsoreported.

1. IntroductionThis contribution deals with the possibilities given by the Scalla/Xrootd platform in order toallow to efficiently access data repositories through both WAN and LAN. Nowadays we can seethat the wide area connectivity between the high energy physics sites in general is developing,and there are several attempts to evaluate at what extent they can be used at their maximum.The approach of the Scalla/Xrootd suite is, with respect to this, quite similar to the one ofthe World Wide Web, i.e. a client accesses directly (via TCP and a suitable application-levelprotocol) the portions of the data it needs, thus avoiding to transfer the portions of data thatit does not need. With respect to this, it is useless to say that the common architectural choiceof pre-transferring locally all the data chunks (in the form of complete, huge data files) doesnot fit these efficiency requirements, and is not considered in this work, although always possible.

In the last years, several development milestones in the Scalla/Xrootd suite made it possibleto create fast and coherent WAN-wide data repositories and to actually reach a performancelevel for WAN-wide direct data access which is more than sufficient to analyze data files whichare not local in the site where the computation runs. Of course, with this statement, we don’twant to say that every computation has to be against remote repositories, but instead that a

Page 2: Scalla/xrootd WAN globalization tools: where we · Scalla/xrootd WAN globalization tools: where we are. Fabrizio Furano1, Andrew Hanushevsky2 1Conseil Europeen Recherche Nucl.(CERN),

typical High Energy Physics computation against a remote repository is possible and can reachperformance levels which are very close to the ones reachable in a LAN.

This work describes very shortly the achievements with respect to the said milestones, andcontains the results of a significant performance test done in a 10Gb wide area network availablefor such tests. A few use cases and examples of production-level usages of these new possibilitiesare also outlined.

2. Wan specificsEven if a WAN may have a very high throughput (even higher than a typical LAN), the taskof reaching data access throughputs which are sufficient for the purpose of the High EnergyPhysics computing has never been easy. This is mainly due to the fact that, through WANs,each client/server response comes much later (E.g. 180ms later). This is a very difficult obstaclefor applications which perform millions of interactions, reading a few bytes each.It is well known, however, that there are techniques to optimize this kind of data access, withoutloosing the advantage of reading only the needed chunks (efficiency). As a rule of thumb, wecould say that with well tuned WANs (i.e. able to give a good throughput to a single connection)one needs applications and tools built with WANs in mind, i.e. able to overcome the limitationsgiven by their latency, otherwise they are walls impossible to climb.When speaking about applications, the first class of them which is able to work well in WANsis, of course, the one of copy-like applications, often known as ’bulk data transfer applications’.There are several examples of those (e.g. gridftp, bbcp, xrdcp, fdt, etc.), and they are toolswhich are used very much in the High Energy Physics domain with good results. But, as said,there are use cases which are much more interesting then blindly copying huge data files, becausethey can give more benefit to the users; last but not least, a reduced need of keeping track oftoo many data movements/replicas, together with a much higher World Wide Web-like overallsimplicity and system availability.An easy set of example requirements for good analysis performance through WAN (notexhaustive nor complete, but reflecting a typical High Energy Physics use case) is the following:

• Usage of the ROOT framework at the user side.• The data files are in the form of TTrees.• The used protocol for data access is the xrootd one.• The data servers are built around a Scalla/xrootd system, eventually integrated with other

systems (e.g. the recent CASTOR releases).

3. What can be done, what can be desiredBasically, with an XROOTD-based front-end we can do 2 things via WAN:

• Access remote data• Aggregate remote storage elements into a unique one. This allows to build an unique storage

pool with sub-clusters in different sites.For the storage aggregation features, there are no practical size limits, since the native xrootd

architecture can accommodate up to 262K servers in theory, without scaling limitations.The basic idea for an application is that it does not need to know the location of a file, just itsfilename, in a globally coherent name-space.

There are several aspects to consider when dealing with this kind of features, but a greatbenefit as well, mainly in the form of performance and overall robustness, linked also to the factthat there are no needs for third-party softwares.In this section, we are trying to highlight the kind of benefit we are aiming for, in the form oftwo relatively simple use cases.

Page 3: Scalla/xrootd WAN globalization tools: where we · Scalla/xrootd WAN globalization tools: where we are. Fabrizio Furano1, Andrew Hanushevsky2 1Conseil Europeen Recherche Nucl.(CERN),

3.1. Use case 1: the traveling physicistA physicist is waiting for the results of his analysis jobs on the GRID. They are many jobs,which produce several output files which will be saved to a storage element, e.g. at CERN.This physicist’s laptop is configured to access those storage elements and draw his histograms,with ROOT.This man leaves for a conference in a distant place, and the jobs finish while in the plane. Whenthere, he wants to simply draw the results from the laptop, and save new modified histogramsin the same storage element. Of course there is no time to loose in tweaking in order to get aremote copy of everything. To avoid confusion, all the things must stay where they are.What can this physicist expect? Can he do it? The answer is “technologically, yes, it’s possible”.

3.2. Use case 2: efficient data access for batch jobsThis use case reflects the true production choices of the ALICE analysis on the GRID.Each job reads about 100-150MB of conditions data, from a storage element calledALICE::CERN::SE. These are conditions data accessed directly, not file copies, and the accessis very efficient, i.e. only the needed bytes are read from the files: each job reads only what itneeds from a much bigger repository.ROOT/AliROOT has a maximum read speed of about 20MB/s, with a 100% utilization of oneCPU core.

Moreover, sometimes also the data files are accessed from a remote SE, especially in the casein which a file is lost or the local SE is down. From the monitoring pages, these mechanismsturned out to be very robust and efficient, even if it is clear that there is some space for furtheroptimizations.This fits perfectly with the current status of the development of the xrootd WAN data accessfeatures.

3.3. What improved recentlyUp to now the WAN speedup was possible with ROOT+XrdClient+XROOTD, by activatingthe so-called “multi-stream WAN mode”. As already described, it can give a performance in-crease of up to 100-150x with respect to basic client-server protocols (including the xrootd onewith WAN mode off).The only drawback of it is that it needs to be activated, in the client side before it starts con-necting to a server. Deciding when to switch it on is far from being easy, in general. Hence, thiskind of optimization, even if it needed just a flag, always demonstrated itself as quite difficultto automatize.

The recent developments changed the scenario in a more usable direction:

• The multiple TCP streams-based WAN mode is the most performant for bulk data transfers,hence it is used in general only for them.

• The new internal improvements use the efficiency of the newer kernels and TCP stacks.Now, by default, the xrootd client can read/write data via WAN at speeds much higherthan the maximum speed ROOT can achieve, even in the worst case of a very distantrepository (provided that there is enough bandwidth).

One consequence of the new milestone is that an interactive analysis does not need differentparameters with respect to one performed through a LAN. This potentially fulfills the require-ments of both the discussed use cases, and is going to be the topic of the next section.

Hence, by deploying:

• The new xrootd client (bundled in ROOT)

Page 4: Scalla/xrootd WAN globalization tools: where we · Scalla/xrootd WAN globalization tools: where we are. Fabrizio Furano1, Andrew Hanushevsky2 1Conseil Europeen Recherche Nucl.(CERN),

• A recent server version (available through xrd-installer, together with the new configuration)• Client and server machines with TCP parameters suitable for WANs (nearly all the most

recent mainstream Linux distributions are already reasonably well configured for that),

We can expect a good improvement over the past, without having to change anything in theapplications.

4. Reference test on a 10Gb WANIn order to have a good estimation about the performance which a simple analysis task canobtain when accessing data via a Wide Area Network, we set up a testbed composed as follows:

• A 10Gb network, owned by Caltech at the CERN site• A client machine and a server machine, with TCP stacks tuned for maximum performance• The latency of the WAN is selectable, with roundtrip time of 0.1ms (like a very fast 10Gb

LAN) and 180ms (like a worst-case WAN, with the client at CERN and the data server inCalifornia).

We set up various tests to be performed:

• Using various bulk data transfer tools, populate a 30GB repository of 10 ROOT files, andread it back.

• Draw various histograms, spanning the whole repository:A very light one (Draw small), reading all the TTrees in the files, but a minimal fraction

of the data.A heavier one (Draw fewcalc), reading a large fraction of the data (about 30%)Like the previous one, but artificially putting a big number of calculations to be applied

to any read value (Draw heavycalc)One which reads every byte in the file (Get all entries)

• From the read data, write a reasonably sized compressed output histogram ( 600MBytes)to the same remote server.

These tests exercise ROOT features and use cases which are much used in real data analyses,but in a fashion which is much heavier than the normal use cases. The purpose of this was toget a better performance measurement precision, due to longer run times.

One thing to point out very clearly is that these tests were not intended to be a so-calledBandwidth race, i.e. the goal was not to fill the 10Gb bandwidth, but to get useful measure-ments on a reference network, by avoiding the uncertainties coming from the performance of ashared one.

The outcome of these tests, instead, can be an answer to some or all of the following questions:

• Can we use this kind of technology to live better with data? I.e. getting more efficiencyand more robustness in a generic HEP computing model.

• How does a “normal” task perform in LAN/WAN environment, and how does it compareto the performance of a local disk?

The tests compare three data access technologies:

• Local disk (a performant RAID-5 able to sustain up to 600MB/s)• XROOTD-based data access

With the TCP multistreaming (XRD 15streams) versus the (now default) usage of theTCP windows scaling mechanisms (XRD wscale)

Page 5: Scalla/xrootd WAN globalization tools: where we · Scalla/xrootd WAN globalization tools: where we are. Fabrizio Furano1, Andrew Hanushevsky2 1Conseil Europeen Recherche Nucl.(CERN),

Figure 1. Performance comparison (on a 10Gb 180ms RTT reference WAN) of the varioustools used to populate/read back the whole test repository.

• HTTP (Apache2) based data access, in the same configurations as the XROOTD case.

An often posed question is why we chose Apache2 (and the underlying http protocol) to dothe job, even if we know that the product’s features do not completely match the requirementsof High Energy Physics. We chose Apache2/http because, for a test repository based on accessto single files via LAN and WAN it is the most powerful opponent, indeed:

• It is an efficient and robust product, supporting well both in LAN and WAN.• Its client (and, relatively, the server) is a very lightweight software.• Used with ROOT, the data accesses through the http protocol do not consume more

bandwidth than they should, unlike most of the distributed file systems we preliminarilyevaluated.

• It is very well integrated in ROOT, supporting all the advanced data access features of theplatform (except writes, not supported).

The tests’ results are visible in Figure 1, 2, 3, and 4.

4.1. How does direct access behave?The tests’ results are very interesting, and here, for space reasons, we highlight only some of themost important aspects, like:

• In both LAN and WAN, the various data analyses performed through xrootd took a numberof seconds which has the same order of magnitude than the local disk case.

• This is especially true when writing outputs, a task which seems particularly efficient withxrootd via WAN towards a remote storage element. In this case, the achievable performanceis very close to the one achievable through a fast RAID local disk.

Page 6: Scalla/xrootd WAN globalization tools: where we · Scalla/xrootd WAN globalization tools: where we are. Fabrizio Furano1, Andrew Hanushevsky2 1Conseil Europeen Recherche Nucl.(CERN),

Figure 2. Performance comparison (on a 10Gb 180ms RTT reference WAN) of the varioustools used to populate/read back the whole test repository.

Figure 3. A compared estimation (on a 10Gb 180ms RTT reference WAN) of the writeperformance and the fixed overheads impact for the various tools against local disk access.

Page 7: Scalla/xrootd WAN globalization tools: where we · Scalla/xrootd WAN globalization tools: where we are. Fabrizio Furano1, Andrew Hanushevsky2 1Conseil Europeen Recherche Nucl.(CERN),

Figure 4. Performance comparison (on a 10Gb 180ms RTT reference WAN) in analysing thedata with three different algorithms.

• For data analyses of sufficient size (more than a few megabytes) the windows scalingtechnology largely outperforms the multiple streams one, in the given use case. This wasnot a surprise, since the TCP multistreaming technologies are known to work better in thebulk data transfer use case.

The idea behind this kind of features is not necessarily to force any computation to be per-formed towards data which is remote. Hence, we do not believe that local clusters can besubstituted at the present time, especially in the cases when the data is accessed by a largecomputing farm. But the “historical” use case for which a local farm can only access a storageelement which is very close to it is not a forced choice any more, as can be demonstrated forexample in a production environment by the ALICE computing model [15].

Moreover, in a modern computing model dealing with data access for High Energy Physicsexperiments (but probably not only High Energy Physics), there are other very interestinguse cases which can be easily accommodated with a well working WAN-wide data access, forexample:

• Interactive data access, especially when using analysis applications which exploit thecomputing power of the modern multicore architectures [13]

This would be even more precious for tasks like e.g. debugging an analysis withouthaving to deal with copying the data it accesses.

• Letting a batch analysis job continue processing if it landed in a farm whose local storageelement lost the data files which were supposed to be present.

Page 8: Scalla/xrootd WAN globalization tools: where we · Scalla/xrootd WAN globalization tools: where we are. Fabrizio Furano1, Andrew Hanushevsky2 1Conseil Europeen Recherche Nucl.(CERN),

• Any other usage which could come to the mind of an user of the World Wide Web, whichmade interactivity available in a very easy way via WAN.

5. About storage globalizationThe other interesting item dealing with the recent WAN-oriented features of the Scalla/xrootdsuite deals with the possibility of building a unique repository, composed by several sub-clustersresiding in different sites.The xrootd architecture, based on a B-64 tree where servers are organised (or self-organise) intoclusters [5], by construction accommodates the fact that the connections among servers couldbe performed through a wide area network. All this is accomplished with the definition of aso-called meta-manager host, whom the remote sites subscribe to. This new role of a servermachine constitutes for the clients the entry point of a unique meta-cluster, which contains allthe subscribed sites, potentially spread elsewhere.

Of course, one important requirement is that the remote sites are actually able to connectto this meta-manager, but, an even more important one is that all the sub-clusters expose thesame coherent name space. In other words, a given data file must be exported with the samefile name everywhere.

With the possibility of accessing a very performant coherent repository without having toknow the location of a file, we can implement many ideas. For instance, this mechanism is beingused in a subset of the storage elements belonging to ALICE in order to quickly fetch a filewhich was supposed to be present. Figure 5 synthetically shows how this is performed in a waywhich is completely transparent to the client which requests the file.

6. ConclusionsBy using the two described mechanisms (the possibility of efficiently accessing remote dataand the possibility of creating inter-site meta-clusters) many things are possible, as they arevery generic mechanisms which encapsulate the technicalities but are not linked in any way toparticular deployments.For example, building a true and robust federation of sites now starts becoming easier. Forinstance, one site could host the storage part, the other one could host the worker nodes,without having to worry too much about the performance loss due to the latency of the linkbetween the sites. Or both might have a part of the overall storage and they appear as one,without having to deal with (generally less stable and very complex) “glue code“ in order to tryto build an artificial higher level view of the resources.So, up to now all the tests and the production usages have been very successful, from thepoints of view of both performance and robustness. We strongly believe that giving thepossibilities of having a storage system able to support WANs properly and efficiently is amajor accomplishment which will give many benefits, especially when dealing with user-levelinteractive analysis.

References[1] Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. The Google file system. Proceedings of the nineteenth

ACM symposium on Operating systems principles, Oct, ACM press, 2003.[2] Steve R. Waterhouse, David M. Doolin, Gene Kan, Yaroslav Faybishenko. Distributed Search in P2P

Networks. IEEE Internet Computing Journal, 6, 2002.[3] Matei Ripeanu and Ian Foster and Adriana Iamnitchi. Mapping the Gnutella Network: Properties of Large-

Scale Peer-to-Peer Systems and Implications for System Design. IEEE Internet Computing Journal, 1,Jan-Feb, Springer-Verlag, 2002.

Page 9: Scalla/xrootd WAN globalization tools: where we · Scalla/xrootd WAN globalization tools: where we are. Fabrizio Furano1, Andrew Hanushevsky2 1Conseil Europeen Recherche Nucl.(CERN),

!"##$%&'($)&

*+,-.&

/&01#231(4)%&516'$)"&

/7-89&01#231&")%(")5$#"&

7#531&51():$'&;#"<&

=#">311?&3$&)35@&'($)&

A(''(:0&3&B1)C&

/'<&$#&$@)&01#231&")%(")5$#"&

+)$&")%(")5$)%&$#&$@)&"(0@$&

5#1132#"3D:0&516'$)"E&3:%&F)$5@&($G&

->>)%(3$)1?G&

/&'>3"$&51():$&

5#61%&H#(:$&@)")&

/:?&#$@)"&

!"##$%&'($)&!"##$%&'($)&

*89I=.&

8>'%&

!"##$%&

!("$631&

"3''&

#$#"30)&

#?'$)>&

J&26(1$&#:&%3$3&+1#231(43D#:&

Figure 5. Example diagram of the Virtual Mass Storage System architecture, based on theglobalization of the ALICE storage.

[4] The Scalla/xrootd Software Suite. http://savannah.cern.ch/projects/xrootd andhttp://xrootd.slac.stanford.edu/ .

[5] Fabrizio Furano, Andrew Hanushevsky. Managing commitments in a Multi Agent System using Passive Bids.iat,pp.698-701, 2005 IEEE/WIC/ACM International Conference on Intelligent Agent Technology (IAT’05),2005.

[6] Hanushevsky, A. and Weeks, B. Designing high performance data access systems: invited talkabstract. Proceedings of the 5th international Workshop on Software and Performance (Palma,Illes Balears, Spain, July 12 - 14, 2005). WOSP ’05. ACM, New York, NY, 267-267. DOI=http://doi.acm.org/10.1145/1071021.1071053.

[7] Dorigo, A., Elmer, P., Furano, F., and Hanushevsky, A. XROOTD/TXNetFile: a highly scalable architecturefor data access in the ROOT environment. Proceedings of the 4th WSEAS international Conferenceon Telecommunications and informatics (Prague, Czech Republic, March 13 - 15, 2005). M. Husak andN. Mastorakis, Eds. World Scientific and Engineering Academy and Society (WSEAS), Stevens Point,Wisconsin, 1-6.

[8] A. Dorigo, P. Elmer, F. Furano, and A. Hanushevsky. Xrootd - A highly scalable architecture for data access.WSEAS Transactions on Computers, Apr. 2005.

[9] XRootd explained. Computing seminar at CERN, http://indico.cern.ch/conferenceDisplay.py?confId=38458.

[10] Hanushevsky, A. Are SE architectures ready for LHC? Proceedings of ACAT 2008: XII InternationalWorkshop on Advanced Computing and Analysis Techniques in Physics Research. http://acat2008.cern.ch/.

[11] F. Furano and A. Hanushevsky Data access performance through parallelization and vectored access. Someresults. CHEP07: Computing for High Energy Physics. Journal of Physics: Conference Series 119 Volume119 (2008) 072016 (9pp)

[12] ROOT: An Object-Oriented Data Analysis Framework http://root.cern.ch

Page 10: Scalla/xrootd WAN globalization tools: where we · Scalla/xrootd WAN globalization tools: where we are. Fabrizio Furano1, Andrew Hanushevsky2 1Conseil Europeen Recherche Nucl.(CERN),

[13] M. Ballintijn, R. Brun, F. Rademakers and G. Roland. Distributed Parallel Analysis Framework withPROOF. http://root.cern.ch/twiki/bin/view/ROOT/PROOF .

[14] A L I C E: Technical Design Report of the Computing. June 2005. ISBN 92-9083-247-9.http://aliceinfo.cern.ch/Collaboration/Documents/TDR/Computing.html

[15] L. Betev, F. Carminati, F. Furano, C. Grigoras, P. Saiz. The ALICE computing model: an overview.Third International Conference ”Distributed Computing and Grid-technologies in Science and Education”,GRID2008, http://grid2008.jinr.ru/

[16] D. Feichtinger, A.J. Peters. Authorization of Data Access in Distributed StorageSystems. The 6th IEEE/ACM International Workshop on Grid Computing, 2005.http://ieeexplore.ieee.org/iel5/10354/32950/01542739.pdf?arnumber=1542739

[17] Patterson R. H., Gibson G. A., E. Ginting, Stodolsky D., and Zelenka J. Informed prefetching and caching.Proceedings of the 15th ACM Symposium on Operating Systems Principles., 1995.