Upload
jasper-johnes
View
216
Download
2
Tags:
Embed Size (px)
Citation preview
1
Large Scale Computing at PDSF
Iwona SakrejdaNERSC User Services Group
February ??, 2006
2
Outline
• Role of PDSF in HENP computing.• Integration with other NERSC computational and storage
systems.• User management and user oriented services at NERSC• PDSF layout.• Workload management (batch systems)• File System implications of data intensive computing .• Operating system selection with CHOS.• Grid use at PDSF (Grid3, OSG, ITB)• Conclusions
33
PDSF Mission
PDSF (Parallel Distributed Systems Facility) is a networked distributed computing environment used to meet the detector simulation and data analysis requirements of large scale High Energy Physics (HEP) and Nuclear Science (NS) experiments.
4
PDSF Principle of Operation
• Multiple groups pool their resources together• Need for resources varies through the year – conferences,
data taking periods at different times (Quark Mater vs PANIC for example).
• Peak resource availability enhanced.• Idle cycles minimized by allowing groups with small
resources (cycle scavenging).• Software installation and license sharing (Totalview, IDL,
PGI)
5
PDSF at NERSC
HPSSIBM AIX Server
50 TB of cache disk8 STK robots, 44,000 tape slots,
max capacity 9 PB
PDSF~700 processors ~1.5 TF, .7 TB of
Memory~300 TB of Shared
Disk
HPSS
FC Disk
STKRobots
Testbeds and servers
SGI HPSS
Global Filesystem
Storage Fabric
Jumbo 10 Gigabit Ethernet
10 gigabit ethernet
Opteron Cluster – Jacquard
640 processors (peak: 2.8 Tflop/s
Opteron/Infiniband 4X/12X
3.1 TF/ 1.2 TB memorySSP - .41 Tflop/s
30 TB Disk
IBM POWER5 – Bassi888 processors (peak:
6.7 Tflop/s) SSP - .8 Tflop/s
2 TB Memory70 TB disk
IBM POWER3 - Seaborg6,080 processors (peak 9.1
TFlop/s)SSP – 1.35 Tflop/s
7.8 Terabyte Memory55 Terabytes of Shared Disk
Analytics Server - DaVinci32 Processors192 GB Memory
25 Terabytes Disk
6
User Management and Support at NERSC
• With >500 users and >10 projects a database management system needed.– Active user management (disabling, password expiration…)– Allocation management (especially mass storage accounting)
• PIs partly responsible for user management (from their own projects)– Adding users– Assigning users to groups– Removing users
• Users managing their own info, groups, certificates….• Account support • User Support and the trouble ticket system.
– Call center– Trouble ticket system
7
Overview of PDSF Layout
8
PDSF Layout
….. Interactive nodespdsf.nersc.gov pdsf.nersc.gov
Grid
gatekeepers
Batch pool – several generations of Intel and AMD processors
~1200 1GHz
Pool of disk vaults
GPFS file systemsHPSS
9
Workload Management (Batch)
• Effective resource sharing via batch workload management• Fair share principle links shares to groups financial
contributions– Fairness concept by groups and within groups– Concept at the heart of PDSF design
• Unused resources split among running users • Group sharing places additional requirement on batch
systems.• Impact of batch system
– LSF good scalability, performance and documentation, met requirements, costly
– Condor (concept of a group share not implemented when transition was considered – 2 years ago)
– SGE met requirements, scales reasonably, documentation lacking at times
• Changes minimized by SUMS (STAR)
10
Shares System at Work
STAR’s 70% share “pushes out” KamLAND (9% share)
SNO (1%, light blue), Majorana (no contribution) get time when the big share owners do not use it.
11
File System implications of data intensive computing - NFS
• NFS – cost effective solution but – scales poorly
– data corruption during heavy use
– data safety (raidset helps but not 100%)
• Disk vault are cheap IDE based centralized storage– Dvio batch-level “resource” integrated with the batch system – defined to limit number of simultaneous read/write access streams – hard to a priori asses load
• Ganglia facilitates load monitoring and the dvio requirement assessment – available to the users..
12
Usage per discipline
IO and data dominated by Nuclear Physics
13
File System implications of data intensive computing – local storage
• Local storage on batch nodes– Cheap storage (large and cheap hard drives)
– Very good I/O performance
– Limited to jobs running on the node
– Diversity of the user population does not facilitate batch node sharing
• users wary of Xrootd daemons
– No redundancy, drive failure causes data loss
– File catalog aids in job submission – SUMS does the rest
14
File System implications of data intensive computing - GPFS
• NERSC purchased GPFS software licenses for PDSF – Reliable (raid underneath)
– Good performance (striping)
– Self repairing• Even after disengaging under load comes back on-line• compare with “NFS stale file handles” (had to be fixed by either admin
or a cron job)
– Expensive
• PDSF hosts will host several GPFS file systems – 7 already in place
– ~15TB/filesystem – not enough experience with GPFS on linux
15
File System implications of data intensive computing – beta testing
• file system (open software version) testing – File system performed reasonably well under high load– support and maintenance manpower intensive
• Storage units from commercial vendors made available for beta testing– Support provided by vendors– Users get cutting edge, highly capable, storage appliances to use
for extensive periods of time– Staff obliged to produce reports – additional workload (light)– Units too expensive to purchase – work related to data uploading– Affordable units from new companies – uncertainty of support
continuity
16
Role of mass storage in data management
• Data intensive experiments require “smart backup”– Only $HOME, system and application area are automatically
backed up
– PDSF storage media reliable – but not disaster-proof.
– Groups have allocation in mass storage to selectively store their data
– Users have individual accounts in mass storage to backup their work
• Network bandwidth (10GB to HPSS)– large HPSS cache and large number of tape movers facilitate quick
access to stored data
– number of drives still an issue
17
Physical Sciences Dominate Storage Use
Percent of HPSS Allocation to Science Areas
0
5
10
15
20
25
30
35
40
45
50
acce
l phy
s
astro
phys
ics
chem
istry
clim
ate
+ en
vir
CS + m
ath
fusi
on e
nergy
geo +
eng
rHEP
QCD
life
scie
nces
mat
sci
nucle
ar p
hys
20022003200420052006
18
Operating system selection with CHOS
• PDSF is a secondary computing facility for most of the user groups– not free to independently select operating system– tied to the Tier0 selection
• PDSF projects originated at various times (in the past or still to come)– Tier0s embraced different operating systems, evolution
• PDSF accommodates needs of diverse groups with CHOS– framework for concurrently running multiple Linux environments
(distributions) on single node. – accomplished through a combination of the chroot system call, a
Linux kernel module, and some additional utilities. – can be configured so that users are transparently presented with
their selected distribution on login.
19
Operating system selection with CHOS (cont)
• Support for operating systems based on same kernel version.– RH7.2– RH8– RH9– SL 3.0.2
• Base system – SL 3.03– provides security– More info about CHOS available at:
http://www.nersc.gov/nusers/resources/PDSF/chos/faq.php
CHOS protected PDSF from fragmentation of resources – Unique approach to multi-group
support.Sharing possible even when diverse OS required.
20
Who Has Used the Grid at NERSC
• PDSF pioneered introduction of Grid services at NERSC.
• Participation in the Grid3 project
• Mostly PDSF (Parallel Distributed Systems Facility) users, who analyze detector data and simulations:
– STAR Detector Simulations and Data Analysis
• Studies the quark-gluon plasma and proton-proton collisions
• 631 collaborators from 52 project institutions• 265 users at NERSC …
– Simulations for the ALICE experiment at CERN
• Studies ion-ion collisions • 19 NERSC users from 11 institutions
– Simulations for the ATLAS experiment at CERN
• Studies fundamental particle processes• 56 NERSC users from 17 institutions
STAR ExperimentDetector
21
Caveats - Grid usage thoughts …
• Most NERSC Users are not Using the Grid
• The Office of Science “Massively Parallel Processing” (MPP) user communities have not embraced the grid
• Even on the PDSF, only a few “production managers” use the grid; most users do not
• Site policy side effects:– ATLAS and CMS stopped using the grid at NERSC due to lack of support
for group accounts– Difficult/tedious/confusing to get a Grid certificate– Lack of support at NERSC for Virtual Organizations
• One grid user’s opinion: instead of writing the middleware and troubleshooting just use a piece of paper to keep track of jobs and pftp for file transfers
• However, several STAR users have been testing the Grid for user analysis jobs, so interest may be growing.
22
STAR Grid Computing at NERSC
Grid computing benefits to STAR:1. Bulk data transfer RCF->NERSC with Storage
Resource Management (SRM) technologies– SRM automates end-to-end transfers: increased throughput
and reliability; less monitoring effort by data managers– Source/destination can be files on disk or in HPSS mass
storage system– 60 TB transferred in CY05 with automatic cataloging– Typical transfers are ~10k files, 5 days duration, 1 TB– Doubles STAR processing power since all data at two sites
SRM-COPY(thousands of files)SRM-COPY(thousands of files)
SRM-GET (one file at a time)SRM-GET (one file at a time)
GridFTP GET (pull mode)GridFTP GET (pull mode)
stage filesstage filesarchive filesarchive files
Network transferNetwork transfer
Get listof filesFrom directory
Get listof filesFrom directory
DiskCacheDiskCache
DataMover(Command-line Interface)
HRM(performs writes)
LBNL
DiskCacheDiskCache
HRM(performs reads)
BNL
23
STAR Grid Computing at NERSC (cont.)
Grid computing benefits to STAR:2. Grid-based job submission with STAR scheduler (SUMS)
• Production grid jobs are running daily from RCF to PDSF– SUMS job xml job description -> – condor-g grid job submission -> – SGE submission to PDSF batch system
• Uses SRMs for input and output file transfers• Handles catalog queries, job definitions, grid/local job
submission, etc. • Underlying technologies largely hidden from user
24
STAR Grid Computing at NERSC (cont.)
• Goal: use SUMS to run STAR user analysis and data mining jobs on OSG sites. Issues are:
– Transparent packaging and distribution of STAR software on OSG non-STAR-dedicated sites
– SRM services need to be deployed consistently at OSG sites (preferred) or deployed along with the jobs (how to do?)
– Inconsistencies of inbound/outbound site policies– SUMS Generic interface adaptable to other VOs
running on OSG offer community support
25
NERSC Contributions to the Grid
• myproxy.nersc.gov– Users don’t have to scp their certs to different sites– Safely stores credentials; uses ssl– Anyone can use it from anywhere– myproxy-init –s myproxy.nersc.gov– myproxy-get-delegation– Part of VDT and OSG software distribution
• Management of grid-map files– NERSC users put their certs into our NERSC Information Management
system– They automatically get propagated to all NERSC resources
• garchive.nersc.gov– GSI authentications added to the HPSS pftp client and server– Users can log in to HPSS using their grid certs– Software contributed to the HPSS consortium
26
Online Certification Services (in development)
• Would allow users to use grid services without having to get a grid cert
• myproxy-logon – s myproxy.nersc.gov• Generates a proxy cert on the fly• Built on top of PAM and Myproxy• Will use radius server to authenticate users• Radius is a protocol to securely send
authentication and auditing information between sites
• Can authenticate with LDAP, One Time Password or Grid cert
• Could be used to federate sites
27
Audit Trail for Group Accounts (proposed development)
• NERSC needs to trace back sessions and commands to individual users
• Some projects need to set up a production environment managed by multiple users (who can then jointly manage the production jobs and data)
• Build an environment that accepts multiple certs or multiple username/passwords for a single account
• Keep logs that can associate PID/UIDs with the actual user
• Provide audit trail that constructs the original authentication associated with the PID/UID
28
Conclusions
• NERSC/PDSF is a fully resource sharing facility– Several storage solutions evaluated, lots of choices and some emerging
trend (distributed file systems, IO balanced systems, …)– CPU shared based on financial contributions– Fully opportunistic (if not used, can be take by others)– NERSC will base its deployment decisions on science and user driven
requirements
• A lot of ongoing research in distributed computing technologies
• NERSC can contribute to STAR/OSG efforts:– Auditing and login tracing tools– Online certification services (integrate LDAP, One Time Passwords and
Grid certs)– Testbed for OSG software on HPC architectures– User Support