View
216
Download
1
Category
Tags:
Preview:
Citation preview
The SAMGrid Data Handling System
Outline: What Is SAMGrid? Use Cases for SAMGrid in Run II Experiments Current Operational Load Stress Testing Future Project Plans
Wyatt Merritt (FNAL/CD/DØ) for the SAMGrid Team(Talk created by Adam Lyon)
2
What is SAMGrid? Data handling system for Run II DØ and CDF
SAMGrid manages file storage (replica catalogs) Data files are stored in tape systems at FNAL and elsewhere (most
use ENSTORE at FNAL) Files are cached around the world for fast access
SAMGrid manages file delivery Users at FNAL and remote sites retrieve files out of file storage.
SAMGrid handles caching for efficiency You don't care about file locations
SAMGrid manages file meta-data cataloging SAMGrid DB holds meta-data for each file. You don't need to know
the file names to get data
SAMGrid manages analysis bookkeeping SAMGrid remembers what files you ran over, what files you
processed successfully, what applications you ran, when you ran them and wher
3
SAMGrid Terms and Concepts A project runs on a station and requests delivery of a
dataset to one or more consumers on that station.
Station: Processing power + disk cache + (connection to tape storage) + network access to SAMGrid catalog and other station cachesExample: a linux analysis cluster at D0
Dataset: metadata description which is resolved through a catalog query to a list of files. Datasets are named.Examples: (syntax not exact) data_type physics and run_number 78904 and data_tier raw request_id 5879 and data_tier thumbnail
Consumer: User application (one or many exe instances)Examples: script to copy files; reconstruction job
4
Physics Analysis Use Cases
I. Process Simulated Data
II. Process Unskimmed Collider Data
III. Process Skimmed Collider Data
IV. Process Missed/New Data
V. Analyze Sets of Individual Events ("Pick Events")
Note that there are more use cases for production of simulated data and reconstruction not covered here
5
I. Process Simulated Data Look up simulation request with parameters of
interest e.g. Request 5874 has using Pythia
with mt = 174 GeV/c2
Define dataset (via command-line or GUI): request_id 5874 and data_tier thumbnail
Submit project to SAMGrid station and submit executable instance(s) to batch system (our tools to make that easy) Consumer is started Station delivers files to executable instance(s) Station marks which files were delivered and consumed
successfully and which had errors
Xt
6
II. Process Unskimmed Collider Data Define dataset by describing files of interest
(not listing file names) using command-line or GUI data_tier thumbnail and version p14.06.01 and run_type
physics and run_qual_group MUO and run_quality GOOD
Submit project to SAMGrid station and submit executable instance(s) to batch system (our tools to make that easy) Consumer is started Station delivers files to executable instance(s) Station marks which files were delivered and
consumed successfully and which had errors
7
III. Process Skimmed Collider Data
Someone (a Physics group, the Common Skimming Group, or an individual) has produced skimmed files
They created a dataset that describes these files
You can...Submit project/jobs using their dataset name ORCreate a new dataset based on theirs and adding
additional constraints__set__ DiElectronSkim and run_number 168339
Submission is same as before
8
IV. Process Missed/New Data
The set of files that satisfy the dataset query at a given time is a snapshot and is remembered with the SAMGrid project information
One can make new datasets with:Files that satisfy a dataset but are newer than
the snapshot (new since the project ran)Files that should have been processed by the
original project but were not delivered or not consumed
__set__ myDataSet minus (project_name myProject and
consumed_status consumed and consumer lyon)
9
V. Analyze Individual Events (Pick Events) Users want to analyze individual events
with lower level data tiers (e.g. raw)Event displaysSpecial investigations
The SAMGrid catalog keeps information on each raw event (run#, event#, trigger bits, file)
Users run a pick events tool that creates a dataset and submits an executable to extract and process these events
10
Current SAMGrid Production Configurations Large SMP
A 10 TB central cache with 128 attached processors Mostly used for Pick Events Local job submission (not Grid)
Linux production farms Reconstruction, Monte Carlo Production, Reprocessing Various cache arrangements: distributed on worker nodes,
NFS-mounted from a central head node, routed through a central head node and then distributed
Local or Grid (Condor) job submission
Linux analysis clusters Analysis jobs Various cache arrangements: distributed on worker nodes,
routed through a central head node and then distributed; cached on central node only
Local job submission or remote submission (with experiment-specific tools, not Condor)
12
SAMGrid Statistics - Usage Data
9000 Projects! 233 Different Users!
Data from early January 6 until February 24 at DØData from early January 6 until February 24 at DØ
14
SAMGrid Statistics - Usage Data
Raw
Thumbnails + …
256 TB!
8.3 Billion Events!
Data from early January 6 until February 24 at DØData from early January 6 until February 24 at DØ
16
SAMGrid Statistics - Operations Data Time between
Request Next File andOpen File
For CAB and CABSRV1 analysis farms stations 50% of enstore (tape)
transfers occur within 10 minutes.
75% within 20 minutes 95% within 1 hour
For CENTRAL-ANALYSIS (SMP station) and CLUED0 (desktop cluster station) 95% of enstore (tape)
transfers within 10 minutes
18
Stress Testing There are many station parameters to tune
Maximum parallel transfersMaximum concurrent enstore requestsConfiguration of cache disks…
We're moving away from d0mino to LinuxHow robust are these linux machines?How many projects can they run?How many concurrent file transfers can they
handle? Running test harness on a small cluster to
explore SAMGrid parameter space
22
SAMGrid Projects Recently Completed:
Python & C++ Client APIImproved DB SchemaBatch system adapters1st generation monitoring
Active & Future: DB-Middleware improvements2nd generation monitoringImproved query languageGrid submissionConversion to SRM interface for cache functions
23
Summary
SAMGrid has been successfully used at DØ for all data handling Over the past year 44 stations consumed (remote = outside
fermilab)
3.6 million files (0.46 million files remote) 40 billion events (3 billion events remote)
1.6 Petabytes of data (137 TB of data remote) ~25 million MC events produced remotely with SAMGrid ~90 million events reprocessed remotely with SAMGrid
SAMGrid will soon become the data handling system for CDF and later MINOS
SAMGrid deployment on grids is contributing improvements to Grid tools & interfaces
25
ENSTORE Statistics 0.6 Petabytes in tape
storage!Data sizes
0 100 200 300
9940B
9940A
LTO
Terabytes
Tape usage
0 2000 4000 6000
9940B
9940A
LTO
# of tapes
Only 5 files unrecoverable (5 GB total; 8ppm loss) !!!One of them was RAW file
26
Some SAMGrid buzzwords Dataset Definition
A set of requirements to obtain a particular set of files e.g. data_tier thumbnail and run_number 181933 Datasets can change over time
• More files that satisfy the dataset may be added to SAMGrid
Snapshot The files that satisfy a dataset at a particular time (e.g. when
you start an analysis job) Snapshots are static
Project The running of an executable over files in SAMGrid Consists of the dataset definition, the snapshot from that
dataset definition, and application information Bookkeeping data is kept - how many files did you
successfully process, where did your job run, how long did it take
Recommended