Xrootd Present & Future The Drama Continues Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University HEPiX 13-October-05

Xrootd Present & Future The Drama Continues

Andrew HanushevskyStanford Linear Accelerator Center

Stanford UniversityHEPiX

13-October-05http://xrootd.slac.stanford.edu

13-October-05 2: http://xrootd.slac.stanford.edu

Outline

The state of performanceSingle serverClustered servers

The SRM Debate

The Next Big Thing

Conclusion


Application Design Point

Complex embarrassingly parallel analysis Determine particle decay products 1000’s of parallel clients hitting the same data

Small block sparse random access Median size < 3K Uniform seek across whole file (mean 650MB)

Only about 22% of the file read (mean 140MB)


Performance Measurements

Goals Very low latency Handle many parallel clients

Test setup Sun V20z 1.86MHz dual Opteron, 2GB RAM 1Gb on board Broadcom NIC (same subnet) Solaris 10 x86 Linux RHEL3 2.4.21-2.7.8.ELsmp Client running BetaMiniApp with analysis removed


Latency Per Request (xrootd)


Capacity vs Load (xrootd)


xrootd Server Scaling

Linear scaling relative to load Allows deterministic sizing of server

Disk NIC CPU Memory

Performance tied directly to hardware cost Competitive to best-in-class commercial file servers


OS Impact on Performance


Device & Filesystem Impact

CPU limited

I/O limited

1 Event 2K

UFS good on small readsVXFS good on big reads


Overhead Distribution

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

100 1000 2000 3000

Blocksize

Latency distribution by percentageLinux client <-> Linux server

Server overhead CPU

xrootd sys CPU

xrootd user CPU

Client overhead CPU

Client App. CPU

Netw ork Interface

Netw ork overhead


Network Overhead Dominates


Xrootd Clustering (SLAC)

client machinesclient machines

kan01 kan02 kan03 kan04 kanxx

bbr-olb03 bbr-olb04 kanolb-a

Hidden Details

RedirectorsRedirectors


Clustering Performance

Design can scale to at least 256,000 servers SLAC runs a 1,000 node test server cluster BNL runs a 350 node production server cluster

Self-regulating (via minimal spanning tree algorithm) 280 nodes self-cluster in about 7 seconds 890 nodes self-cluster in about 56 seconds

Client overhead is extremely low Overhead added to meta-data requests (e.g., open)

~200us * log64(number of servers) / 2

Zero overhead for I/O


Current MSS SupportLightweight agnostic interfaces provided oss.mssgwcmd command

Invoked for each create, dirlist, mv, rm, stat oss.stagecmd |command

Long running command, request stream protocol Used to populate disk cache (i.e., “stage-in”)

xrootd(oss layer)

mssgwcmd

MSSstagecmd


Future Leaf Node SRM

MSS Interface ideal spot for SRM hook Use existing hooks or new long running hook

mssgwcmd & stagecmd oss.srm |command

Processes external disk cache management requests Should scale quite well

xrootd(oss layer)

srm

MSS

Grid


BNL/LBL Proposal

srmsrm

drmdrm

dasdas

Generic Standard

LBL

xrootd

dmdm

rcrcReplicaServices

BNLReplica

RegistrationService & DataMover


Alternative Root Node SRMTeam olbd with SRM File management & discovery Tight management control

Several issues need to be considered Introduces many new failure modes Will not generally scale

olbd(root node)

srm

MSS

Grid


SRM Integration Status

Unfortunately, SRM interface in flux Heavy vs light protocol

Working with LBL team Working towards OSG sanctioned future proposal

Trying to use the Fermilab SRM Artem Turnov at IN2P3 exploring issues


The Next Big Thing

High Performance Data Access ServersHigh Performance Data Access Serversplusplus

Efficient large scale clusteringEfficient large scale clusteringAllowsAllows

Novel cost-effective super-fast massive storageNovel cost-effective super-fast massive storageOptimized for sparse random accessOptimized for sparse random access

Imagine 30TB of DRAMImagine 30TB of DRAMAt commodity pricesAt commodity prices


Device Speed Delivery


Memory Access Characteristics

Server: zsuntwoCPU: SparcNIC: 100MbOS: Solaris 10UFS: Sandard


The Peta-Cache

Cost-effect memory access impacts science Nature of all random access analysis

Not restricted to just High Energy Physics Enables faster and more detailed analysis

Opens new analytical frontiers

Have a 64-node test cluster V20z each with 16GB RAM

1TB “toy” machine


Conclusion

High performance data access systems achievable The devil is in the details

Must understand processing domain and deployment infrastructure Comprehensive repeatable measurement strategy

High performance and clustering are synergetic Allows unique performance, usability, scalability, and

recoverability characteristicsSuch systems produce novel software architectures Challenges

Creating application algorithms that can make use of such systems Opportunities

Fast low cost access to huge amounts of data to speed discovery


Acknowledgements

Fabrizio Furano, INFN Padova Client-side design & development

Bill Weeks Performance measurement guru

100’s of measurements repeated 100’s of times

US Department of Energy Contract DE-AC02-76SF00515 with Stanford University

And our next mystery guest!

Documents

Xrootd Present & Future The Drama Continues Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University HEPiX 13-October-05