Upload
domenic-golden
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
“Xrootd” Storage
Some new directionsFrom the xrootd and Scalla perspective
In the ALICE Computing
Fabrizio FuranoCERN IT/GS
11-July-08
http://savannah.cern.ch/projects/xrootdhttp://xrootd.slac.stanford.edu
So many new directionsDesigners and users unleashed fantasy
And helped improving the quality of the framework…
◦What is Scalla
◦The “many” paradigm
◦Direct WAN data access
◦Clusters globalization
◦Virtual Mass Storage System and 3rd party fetches
◦Conclusion
Outline
11-July-2008Fabrizio Furano - Data access and Storage: new directions 2
Fabrizio Furano - Data access and Storage: new directions 3
The evolution of the xrootd projectData access with HEP requirements in mind◦But a very generic platform, however
Structured Cluster Architecture for Low Latency Access◦Low Latency Access to data via xrootd servers
POSIX-style byte-level random accessBy default, arbitrary data organized as files
Hierarchical directory-like name space
Protocol includes high performance features
◦Structured Clustering provided by cmsd servers (formerly olbd)
Exponentially scalable and self organizing
◦Tools and methods to cluster, harmonize, connect, …
What is Scalla?
11-July-2008
Fabrizio Furano - Data access and Storage: new directions 4
High speed access to experimental data◦Small block sparse random access (e.g., root files)
◦High transaction rate with rapid request dispersement (fast concurrent opens)Wide usability◦Generic Mass Storage System Interface (HPSS, Castor, etc)
◦Full POSIX access
◦Server clustering (up to 200K per site) with linear scalabilityLow setup cost◦High efficiency data server (low CPU/byte overhead, small memory footprint)
◦Linearly-scaling configuration requirements
◦No 3rd party software needed (avoids messy dependencies)Low administration cost◦Robustness
◦Non-Assisted fault-tolerance (the jobs recover failures – “no” crashes! – any factor of redundancy possible on the srv side)
◦Self-organizing servers remove need for configuration changes
◦No database requirements (high performance, no backup/recovery issues)
Scalla Design Points
11-July-2008
Very carefully crafted, fully multithreaded◦Server side: promote speed and scalability
High level of internal parallelism + statelessExploits OS features (e.g. async i/o, sendfile, polling and
selecting nuances)Many many speed+scalability oriented featuresSupports thousands of client connections per server
◦Client: Handles the state of the communicationReconstructs everything to present it as a simple interface
Fast data pathNetwork pipeline coordination + latency hidingSupports connection multiplexing + intelligent server cluster
crawlingServer and client exploit multi core CPUs natively
Single point performance
11-July-2008Fabrizio Furano - Data access and Storage: new directions 5
Server side◦If servers go, the overall functionality can be fully preserved
Redundancy, MSS staging of replicas, …Can means that weird deployments can give it up
E.g. storing in an external DB the physical endpoint addresses for each file.
Client side (+protocol)◦The application never notices errors
Totally transparent, until they become fatali.e. when it becomes really impossible to get to a working endpoint to resume the
activity
◦Typical tests (try it!)Disconnect/reconnect network cablesKill/restart servers
Fault tolerance
11-July-2008Fabrizio Furano - Data access and Storage: new directions 6
Creating big clusters scales linearlyThe throughput and the size, keeping latency low
We like the idea of disk-based cacheThe bigger (and faster), the better
So, why not to use the disk of every WN ?In a dedicated farm500GB * 1000WN 500TBThe additional cpu usage is anyway quite low
Can be used to set up a huge cache in front of a MSSNo need to buy a bigger MSS, just lower the miss rate !
Adopted at BNL for STAR (up to 6-7PB online)See Pavel Jakl’s (excellent) thesis work
They also optimize MSS access to nearly double the staging performance
Points of contact with the PROOF approach to storageOnly storage. PROOF is very different for the computing part.
The “many” paradigm
11-July-2008Fabrizio Furano - Data access and Storage: new directions 7
This big disk cache◦Shares the computing power of the WNs
◦Shares the network of the WNs pooli.e. No SAN-like bottlenecks (… and reduced costs)Exploits a complete graph of connections (not 1-2)
Handled by the farm’s network switch
◦The performance boost varies, depending on:Total disk cache sizeTotal “working set” size
It is very well known that most accesses are to a fraction of the repo at a time.
In HEP the data locality principle is valid. Caches work!
Throughput of a single applicationCan have many types of jobs/apps
The “many” paradigm
11-July-2008Fabrizio Furano - Data access and Storage: new directions 8
We want to make WAN data analysis convenient◦A process does not always read every byte in a file
◦Often, direct access is more practical, faster and more robust
◦The typical way in which HEP data is processed is (or can be) often known in advance
TTreeCache does an amazing job for this
◦xrootd: fast and scalable server sideMakes things run quite smooth
Gives room for improvement at the client sideAbout WHEN transferring the data
There might be better moments to trigger a chunk xfer
with respect to the moment it is neededThe app has not to wait while it receives data… in parallel
WAN direct access – Motivation
11-July-2008Fabrizio Furano - Data access and Storage: new directions 9
WAN direct access – hiding latency
11-July-2008Fabrizio Furano - Data access and Storage: new directions 10
Pre-xferdata
“locally”
Legacyremoteaccess
Remoteaccess+Data
Processing
Data access
OverheadNeed for
potentiallyuseless replicas
And a hugeBookkeeping!
LatencyWasted CPU
cyclesBut easy
to understand
Interesting!Efficientpractical
Fabrizio Furano - Data access and Storage: new directions 11
Application
Multiple streams
11-July-2008
Client1
Server
Client2
Client3
TCP (control)
Clients still seeOne Physical
connection perserver
TCP(data)
Async datagets
automaticallysplitted
Fabrizio Furano - Data access and Storage: new directions 12
It is not a copy-only tool to move data◦Can be used to speed up access to remote repos
◦Transparent to apps making use of *_async reqsThe app computes WHILE getting data, fully exploited by ROOT
xrdcp uses it (-S option)◦results comparable to other cp-like tools
For now only reads fully exploit it◦Writes (by default) use it at a lower degree
Not easy to keep the client side fault tolerance with writes…Heading towards a non trivial solution
Automatic agreement of the TCP windowsize◦You set servers in order to support the WAN mode
If requested… fully automatic.
WAN direct access - Multiple streams
11-July-2008
Recent improvements◦Parallelized initialization -> more than 10x faster than before
File open through WAN: From 45(!) to 3-4 “latencies”more than 10x faster than before
◦Windowsize studies (Fabrizio, Leo [PH/SFT])E.g. SLAC->CERN (160ms RTT) 7MB/s->13MB/sGoing to be incorporated in the setup
To avoid forcing everybody to tweak parameters
A puzzle is still there1 Apache TCP stream looks just too fast (and with ramp-up effects) via WAN
Suspect: relation with “root” capabilities to adjust TCP parameters
Xrootd with WAN multistreaming is anyway 2-3x faster (SLAC->CERN)With no ramp-up effects
But we’d like to use the same trick, if possible, to enhance even more
WAN direct access - news
11-July-2008Fabrizio Furano - Data access and Storage: new directions 13
Fabrizio Furano - Data access and Storage: new directions 14
Up to now, xrootd clusters could be populated◦With xrdcp from an external machine
◦Writing to the backend store (e.g. CASTOR/DPM/HPSS etc.) E.g. FTD in ALICE now uses the first. It “works”…
Load and resources problemsAll the external traffic of the site goes through one machine
Close to the dest cluster
If a file is missing or lost◦For disk and/or catalog screwup
◦Job failure... manual intervention neededWith 107 online files finding the source of a trouble can be
VERY tricky
Cluster globalization
11-July-2008
Fabrizio Furano - Data access and Storage: new directions 15
Purpose:◦A request for a missing file comes at cluster X,
◦X assumes that the file ought to be thereAnd tries to get it from the collaborating clusters, from the fastest one
Note that X itself is part of the game◦And it’s composed by many servers
The idea is that◦Each cluster considers the set of ALL the others like a
very big online MSS
◦This is much easier than what it seemsAnd the tests around report high robustness…
Very promising, still in alpha test, but not for much more.
Virtual MSS
11-July-2008
Global redirector acts as a WAN xrootd meta-managerLocal clusters subscribe to it◦And declare the path prefixes they export
◦Local clusters (without local MSS) treat the globality as a very big MSS
◦Coordinated by the Global redirectorLoad balancing, negligible loadPriority to files which are online somewherePriority to fast, least-loaded sitesFast file location
True, robust, realtime collaboration between storage elements!
◦Very attractive for tier-2s
Many pieces… (apparently)
11-July-2008Fabrizio Furano - Data access and Storage: new directions 16
Cluster Globalization… an example
11-July-2008Fabrizio Furano - Data access and Storage: new directions 17
cmsd
xrootdPragueNIHAM
… any other
cmsd
xrootd
CERN
cmsd
xrootd
ALICE global redirector (alirdr)all.role meta managerall.manager meta alirdr.cern.ch:1312
root://alirdr.cern.ch/Includes
CERN, GSI, and othersxroot clusters
Meta Managers can be geographically
replicatedCan have several in different places for region-aware load
balancing
cmsd
xrootd
GSIall.manager meta alirdr.cern.ch:1312 all.manager meta alirdr.cern.ch:1312 all.manager meta alirdr.cern.ch:1312all.role manager all.role manager all.role manager
cmsd
xrootd
GSI
The Virtual MSS Realized
11-July-2008Fabrizio Furano - Data access and Storage: new directions 18
cmsd
xrootd PragueNIHAM
… any other
cmsd
xrootd
CERN
cmsd
xrootd
ALICE global redirector
all.role meta managerall.manager meta alirdr.cern.ch:1312
all.role manager all.role managerall.role manager
But missing a file?Ask to the global metamgr
Get it from any othercollaborating cluster
all.manager meta alirdr.cern.ch:1312 all.manager meta alirdr.cern.ch:1312 all.manager meta alirdr.cern.ch:1312
Local clients worknormally
Powerful mechanism to increase reliability◦Data replication load is widely distributed
◦Multiple sites are available for recoveryAllows virtually unattended operation◦Automatic restore due to server failure
Missing files in one cluster fetched from anotherTypically the fastest one which has the file really online
No costly out of time (and sync!) DB lookups
◦File (pre)fetching on demandCan be transformed into a 3rd-party GET (by asking for a specific source)
◦Practically no need to track file locationBut does not stop the need for metadata repositories
Virtual MSS – The vision
11-July-2008Fabrizio Furano - Data access and Storage: new directions 19
Fabrizio Furano - Data access and Storage: new directions 20
No evidence of architectural problemsStriving to keep code quality at maximum levelAwesome collaboration
BUT... If used “outside” of the ALICE CMThe architecture can prove itself to be ultra-bandwidth-
efficientOr greedy, as you prefer
◦Need of a way to coordinate the remote connectionsIn and OutWe designed the Xrootd BWM and the Scalla DSS
Problems? Not yet.
11-July-2008
Directed Support Services Architecture (DSS)◦Clean way to associate external xrootd-based services
Via ‘artificial’, meaningful pathnamesA simple way for a client to ask for a service◦E.g. an intelligent queueing service for WAN xfers!
◦Which we called BWMJust an xrootd server with a queueing plugin
◦Can be used to queue incoming and outgoing trafficIn a cooperative and symmetrical mannerSo, clients ask to be queued for xfers at both ends
◦Design ok, dev work in progress!
The Scalla DSS and the BWM
11-July-2008Fabrizio Furano - Data access and Storage: new directions 21
The mechanism is there, once it is correctly boxed◦Checkpoint reached, first setup going on!
A (potentially good) side effect:◦Pointing an app to the “area” global redirector gives complete, load-
balanced, low latency view of all the repo
◦An app using the “smart” WAN mode can just runProbably now a full scale production won’t
But what about an interactive small analysis on a laptop?
After all, HEP sometimes just copies everything, useful and not
But… still probably better than certain always-overloaded SEs
I cannot say that in some years we will not have a more powerful WAN infrastructure
And using it to copy more useless data looks just ugly
If a web browser can do it, why not a HEP app? Looks just a little more difficult.
Better if used with a clear design in mindSometimes we call this “Computing Model”
Virtual MSS
11-July-2008Fabrizio Furano - Data access and Storage: new directions 22
Test instance cluster @GSI◦Subscribed to the ALICE global redirector
◦Until the xrdCASTOR instance is subscribed, GSI will get data only from voalice04 (and not through the global redirector coordination)
The mechanism seems very robust, can do even better
To get a file there, just open or prestage it
Need of updating AlienStaging/Prestaging tool required (done)
FTD integration (done, not tested yet)
Incoming traffic monitoring through the XrdCpapMon xrdcp extension (which is not xrdcpapmon)… done!
Technically, no more xrdcpapmon, just xrdcp does the job, nobody noticed the change!
So, one tweak less for ALICE offline
So, what? Embryonal ALICE VMSS
11-July-2008Fabrizio Furano - Data access and Storage: new directions 23
Point the test instances “remote root” to the ALICE global redirector
◦As soon as the xrdCASTOR instance (at least!) is subscribed
◦No functional changesWill continue to 'just work', hopefully
This will be accomplished by the complete revision of the setup (done, starting first serious depl!)
◦After that, all the “pure” xrootd-based sites will have thistransparently
ALICE VMSS Step 2
11-July-2008Fabrizio Furano - Data access and Storage: new directions 24
Not terrible dev work on◦Cmsd
◦Mps layer
◦Mps extension scripts
◦deep debugging and easy setupAnd then the cluster will honour the data source
specified by FTD (or whatever)◦Xrootd protocol is mandatory
The data source must honour it in a WAN friendly wayTechnically means a correct implementation of the basic xrootd protocol
Source sites supporting xrootd multistreaming will be up to 15x more efficient, but the others still will work
ALICE VMSS Step 3 – 3 rd party GET
11-July-2008Fabrizio Furano - Data access and Storage: new directions 25
Many new ideas are reality or comingTypically dealing with◦True realtime data storage distribution
◦Interoperability (Grid, SRMs, file systems, WANs…)
◦Enabling interactivity (and storage is not the only part of it)The setup refurbishment… almost done◦Proceeding by degrees, stability is a priority
Trying to avoid common mistakesBoth manual and automated setups are honorful and to be honoured
Going to use it for the ALICE OCDB data… now!
Conclusion
11-July-2008Fabrizio Furano - Data access and Storage: new directions 26
Fabrizio Furano - Data access and Storage: new directions 27
Old and new software Collaborators◦Andy Hanushevsky, Fabrizio Furano (client-side), Alvise Dorigo
◦Root: Fons Rademakers, Gerri Ganis (security), Bertrand Bellenot (windows porting)
◦Alice: Derek Feichtinger, Andreas Peters, Guenter Kickinger
◦STAR/BNL: Pavel Jackl, Jerome Lauret
◦GSI: Kilian Schwartz
◦Cornell: Gregory Sharp
◦SLAC: Jacek Becla, Tofigh Azemoon, Wilko Kroeger, Bill Weeks
◦Peter ElmerOperational collaborators◦BNL, CERN, CNAF, FZK, INFN, IN2P3, RAL, SLAC
Acknowledgements
11-July-2008
Flexible, multi-protocol system◦Abstract protocol interface: XrdSecInterface
Protocols implemented as dynamic plug-insArchitecturally self-contained
NO weird code/libs dependencies (requires only openssl)
High quality highly optimized code, great work by Gerri Ganis
Embedded protocol negotiation◦Servers define the list, clients make the choice
◦Servers lists may depend on host / domainOne handshake per process-server connection◦Reduced overhead:
◦# of handshakes ≤ # of servers contactedExploits multiplexed connectionsno matter the number of file opens per process-server
Authentication
11-July-2008Fabrizio Furano - Data access and Storage: new directions 28Courtesy of Gerardo Ganis (CERN PH-SFT)
Password-based (pwd)◦Either system or dedicated password file
User account not neededGSI (gsi)◦Handle GSI proxy certificates
◦VOMS support coming
◦No need of Globus libraries (and fast!)Kerberos IV, V (krb4, krb5)◦Ticket forwarding supported for krb5
◦Fast ID (unix, host) to be used w/ authorizationALICE security tokens◦Emphasis on ease of setup and performance
Available protocols
11-July-2008Fabrizio Furano - Data access and Storage: new directions 29Courtesy of Gerardo Ganis (CERN PH-SFT)