1 LHC-Era Data Rates in 2004 and 2005 Experiences of the PHENIX Experiment with a PetaByte of Data Martin L. Purschke, Brookhaven National Laboratory PHENIX

1

LHC-Era Data Rates in 2004 and 2005Experiences of the PHENIX Experiment

with a PetaByte of Data

Martin L. Purschke, Brookhaven National LaboratoryPHENIX Collaboration

RHIC from space

Long Island, NY

2

RHIC/PHENIX at a glance

RHIC:2 independent rings, one beam clockwise, the other counterclockwisesqrt(SNN)= 500GeV * Z/A~200 GeV for Heavy Ions~500 GeV for proton-proton (polarized)

PHENIX:

4 spectrometer arms

12 Detector subsystems

400,000 detector channels

Lots of readout electronics

Uncompressed Event size typically 240 -170 - 90 KB for AuAu, CuCu, pp

Data rate 2.4KHz (CuCu), 5KHz (pp)

Front-end data rate 0.5 - 0.9 GB/s

Data Logging rate ~300-450MB/s, 600 MB/s max

3

Where we are w.r.t. others

ATLAS

CMS

LHCb

ALICE

CDF

~25 ~40

~100

~300

All in MB/sall approximate

~100

~150

600

~1250

400-600MB/s are not so Sci-Fi these days

4

Outline

• What was planned, and what we did• Key technologies• Online data handling• How offline / the users cope with the data volumes

Related talks:

Danton Yu, “BNL Wide Area Data Transfer for RHIC and ATLAS:Experience and Plan” this (Tuesday) afternoon 16:20

Chris Pinkenburg, “Online monitoring, calibration, and reconstruction in the PHENIX Experiment”, Wednesday 16:40 (OC-6)

5

A Little History

• In its design phase, PHENIX had settled on a “planning” number of 20MB/s data rate (“40 Exabyte tapes, 10 postdocs”)

• In Run 2 (2002) we started to approach that 20MB/s number and ran Level-2 triggers

• In Run 3 (2003) we faced the decision to either go full steam ahead with Level-2 triggers… or first try something else

• Several advances in various technologies conspired to make first 100MB/s, then multi-100MB/s data rates possible

• We are currently logging at about 60 times higher rates than the original design called for

6

But is this a good thing?

We had a good amount of discussions about the merit of going to high data rates. Are we drowning in data? Will we be able to analyze the data quickly enough? Are we recording “boring” events, mostly? Is it not better to trigger and reject?

• In Heavy-Ion collisions, the rejection power of level2- triggers is limited (high multiplicity, etc)

• triggers take a lot of time to study and developers usually welcome a delay in the onset of the actual rejection mode

• The would-be rejected events are by no means “boring”, high-statistics physics in them, too

7

...good thing? Cont’d

In the end we convinced ourselves that we could (and should) do it.

The increased data rate helped defer the onset of the LVL2 rejection mode in Runs 4, 5 (didn’t run rejection at all in the end)

Saved a lot of headaches… we think

• Get the data while the getting is good - the Detector system is evolving and is hence unique for each run, better get as much data with it as you can

• Get physics that you simply can’t trigger on

• Don’t be afraid to let data sit on tape unanalyzed - computing power increases, time is on your side here

8

Ingredients to achieve the high rates

We implemented several main ingredients that made the high rates possible.

• We compress the data before they get on a disk the first time (cram as much information as possible into each MB)

• Run several local storage servers (“Buffer boxes”) in parallel, and dramatically increased the buffer disk space (40TB currently)

• Improved the overall network connectivity and topology

• Automated most of the file handling so the data rates in the DAQ become manageable

9

Event Builder Layout

ATP

ATP

ATP

ATP

ATP

ATP

ATP

ATP

ATP

ATP

SEB

SEB

SEB

SEB

SEB

SEB

SEB

SEB

SEB

SEB

Gigabit

Crossbar

Switch

To

HPSS

Event Builder

The data are stored on “Buffer boxes” that help ride out the ebb and flow of the data stream, ride out service interruptions of the HPSS tape system, and help send a steady stream to the tape robot

Buffer Box

Buffer Box

Buffer Box

Buffer Box

Buffer Box

Buffer Box

Sub-Event buffers see parts of a given Event following the granularity of the Detector electronics

With ~50% combined uptime of PHENIX * RHIC, we can log 2x the HPSS max data rate in the DAQ.Just needs enough buffer depth (40TB)

The pieces are sent to a given Assembly-and-Trigger Processor (ATP) on a next-available basis, fully assembled event available here for the first time

Some technical background...

10

Data Compression

LZO

algorithmAdd new

buffer hdrNew buffer with the compressed one as payload

buffer buffer buffer buffer buffer buffer

This is what a file then looks like

LZO

UnpackOriginal uncompressed buffer restored

On readback:

This is what a file normally looks like

All this is handled completely in the I/O layer, the higher-level routines just receive a buffer as before.

Found that the raw data are still gzip-compressible after zero-suppression and other data reduction techniques

Introduced a compressed raw data format that supports a late-stage

compression

11

Breakthrough: Distributed Compression

ATP

ATP

ATP

ATP

ATP

ATP

ATP

ATP

ATP

ATP

SEB

SEB

SEB

SEB

SEB

SEB

SEB

SEB

SEB

SEB

Gigabit

Crossbar

Switch

To

HPSS

Event Builder

The compression is handled in the “Assembly and Trigger Processors” (ATP’s) and can so be distributed over many CPU’s -- that was the breakthrough

Buffer Box

Buffer Box

Buffer Box

Buffer Box

Buffer Box

Buffer Box

The Event builder has to cope with the uncompressed data flow, e.g. 500MB/s … 900MB/s

The buffer boxes and storage system see the compressed data stream, 250MB/s … 450MB/s

12

Buffer Boxes

• We use COTS dual-CPU machines, some SuperMicro or Tyan MoBo designed for data throughput

• SATA 400G disk arrays (16 disks) Sata->SCSIRaid controller in striping (Raid0) mode

• 4 1.7 TB systems across 4 disks each on 6 Boxes, called a,b,c,d on each box. All “a” ( and b,c,d) file systems of each machine are one “set”, written to or read from at the same time

• File system set either written to or read from, not both

• Each Bufferbox adds a quantum of 2x 100MB/s rate capability (in and out) on 2 interfaces, 6 Buffer boxes = demonstrated 600 MB/s, expansion is cheap, all commodity hardware

13

Buffer Boxes (2)

• Short-term data rates are typically as high as ~450MB/s

• But the longer you average, the smaller the average data rate becomes (include machine development days, outages, etc)

• Run 4: highest rate 450MB/s, whole run average about 35MB/s

• So: The longer you can buffer, the better you can take advantage of the breaks in the data flow for your tape logging

• Also, a given file can linger on the disk until the space is needed, 40 hours or so -- the opportunity to do calibrations, skim off some events for fast-track reconstruction, etc

• “Be ready to reconstruct by the time the file leaves the countinghouse”

14

Run 5 Event statistics

CuCu 200GeV 2.2 Billion Events 157TB on tape

CuCu 62.4GeV 630 Million Events 42TB on tape

CuCu 22GeV 48 Million Events 3TB on tape

p-p 200GeV 6.8 Billion Events 286TB on tape

------------------------------

488 TB on tape for Run 5

Just shy of 0.5 PetaByte of raw data on tape in Run 5:

Had about 350 in Run 4 and about 150 in the runs before

15

How to analyze these amounts of data?

• Priority reconstruction and analysis of “filtered” events (Level2 trigger algorithms offline, filter out the most interesting events)

• Shipped a whole raw data set (Run 5 p-p) to Japan to a regional PHENIX Computing Center (CCJ)

• Radically changed the analysis paradigm to a “train-based” one

16

Near-line Filtering and Reconstruction

We didn’t run Level-2 triggers in the ATP’s, but ran the “triggers” on the data on disk and extracted the interesting events for priority reconstruction and analysis

The buffer boxes where the data are stored for ~40 hours come in handy

The filtering was started and monitored as a part of the shift operations, looked over by 2,3 experts

The whole CuCu data set was run through the filters

The so-reduced data set was sent to Oak Ridge where a CPU farm was available

Results were shown at the Quark Matter conference in August 2005. That’s record speed – the CuCu part of Run 5 ended in April .

17

Data transfers for p-p

Going into the proton-proton half of the run, the BNL Computing Center RCF was busy with last year’s data and the new CuCu data set

This was already the pre-QM era, most important conference in the field, gotta have data, can’t wait and make room for other projects

no adequate resources for pp reconstruction at BNL

historically, large-scale reconstruction could only be done at BNL where the raw data are stored, only smaller DST data sets could be exported…but times change

Our Japan computing center CCJ had resources… had plans to set up a “FedEx-tape network” to move the raw data there for speedy reconstruction

But then the GRID made things easier...

“This seems to be the first time that a data transfer of such magnitude was sustained over many weeks in actual production, and was handled as part of routine operation by non-experts.”

18

Shift Crew Sociology

The overall goal is a continuous, standard, and sustainable operation, avoid heroic efforts.

The shift crews need to do all routine operations, let the experts concentrate on expert stuff.

Automate. Automate. And automate some more.

Prevent the shift crews from getting “innovative”We find that we usually later throw out data where crews deviated from the prescribed path

In Run 5 we seemed to have settled into a nice routine operation

Encourage the shift crews to CALL THE EXPERT!

Given that, my own metrics how well it works is how many nights I and an half-dozen other folks get uninterrupted sleep

19

Shifts in the Analysis Paradigm

After a good amount of experimenting, we found that simple concepts work best. PHENIX reduced the level of complexity in the analysis model quite a bit.

We found:

You can overwhelm any storage system by having it run as a free-for-all. Random access to files at those volumes brings anything down to its knees.

People don’t care too much what data they are developing with (ok, it has to be the right kind)

Every once in a while you want to go through a substantial dataset with your established analysis code

20

Shifts in the Analysis Paradigm

• We keep some data of the most-wanted varieties on disk.

• that disk-resident dataset remains the same mostly, we add data to the collection as we get newer runs.

• The stable disk-resident dataset has the advantage that you can immediately compare the effect of a code change while you are developing your analysis module

• Once you think it’s mature enough to see more data, you register it with a “train”.

21

Analysis Trains

• After some hard lessons with the more free-for-all model, we established the concept of analysis trains.

• Pulling a lot of data off the tapes is expensive (in terms of time/resources) • Once you go to the trouble, you want to get as much “return on investment”

for that file as possible - do all the analysis you want while it’s on disk

• We also switched to tape (cartridge)-centric retrievals - once a tape is mounted, get all the files off while the getting is good

• Hard to put a speed-up factor to this, but we went from impossible to analyse to an “ok” experience. On paper the speed-up is like 30 or so.

• So now the data gets pulled from the tapes, and any number of registered analysis modules run over it -- very efficient

• You can still opt out for certain files you don’t want or need

If you don’t do that, the number of file retrievals explodes - I request the file today, next person requests it tomorrow, 3rd person next week

22

Analysis Train Etiquette

Your analysis module has to obey certain rules:• be of a certain module-kind to make it manageable by the train conductor • be a “good citizen” -- mostly enforced by inheriting from the module parent

class, start from templates, and review• Code mustn't crash, have no endless loops, no memory leaks• pass prescribed Valgrind and Insure tests

This train concept has streamlined the PHENIX analysis in hard-to-underestimate ways.

After the train, the typical output is relatively small and fits on disk

Made a mistake? Or forgot to include something you need? Bad, but not too bad… fix it, test it, the next train is leaving soon. Be on board again.

23

Other Advantages

When we started this, there was no production-grade GRID yet, so we had to have something that “just worked”

Less ambitious than current GRID projects

The train allows someone else to run your analysis (and many other’s at the same time with essentially the same effort) on your behalf

Can make use of other, non-Grid-aware, installations

Only one debugging effort to adapt to slightly different environments

24

Summary (and Lessons)

• We exceeded our design data rate by a factor of 60 (2 by compression, 30 by logging rate increases)

• We wrote 350 TB in Run 4, 500TB in Run 5, ~150 TB in the runs before -> ~1PB

• Increased data rate simplifies many tough trigger-related problems

• Get physics that one can't trigger on • Addt'l hardware relatively inexpensive and easily expandable• Offline and analysis copes by “train” paradigm shift

25

Lessons

• First and foremost, compress. Do NOT let “soft” data hit the disks. CPU power is comparably cheap. Distribute the compression load.

• Buffer the data. Even if your final storage system can deal with the instantaneous rates, take advantage of having the files around for a decent amount of time for monitoring, calibration, etc

• Don't be afraid to take more data than you can conceivably analyze before the next run. There's a new generation of grad students and post-docs coming, and technology is on your side

• Do not allow random access to files on a large scale. Impose a concept that maximizes the “return on investment” (file retrieval)

The End

26

1

Documents

1 LHC-Era Data Rates in 2004 and 2005 Experiences of the PHENIX Experiment with a PetaByte of Data Martin L. Purschke, Brookhaven National Laboratory PHENIX