Upload
isabel-sparks
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments
T.L. Hsieh, S. Hou, P.K. TengAcademia Sinica, Taipei, Taiwan
HTTP proxy
CDF GRID Computing
Condor
IPAS_OSGTaiwan-IPAS-LCG2
submitter
monitor
CE
Grid nodes
Glide-insscheduler
CAF is alsoa Portal for users
Resources are shared with othersacquire resourcesfrom grid if needed
HTTP proxy
Jobs areself-contained
PacificCAF
job
CDF users
CDFSoft2
parrot
Glide-insscheduler
fcp
firewall
dCAF Computing Model
Worker nodes
submittermonitor
Software distribution by NFS
job
Condor
CDF user
CDFSoft2
ATLAS GRID Computing
ATLAS users
Dispatcher
AtlasSoft
Taiwan-IPAS-LCG2Taiwan-LCG2 CE
Grid nodes
Resources are shared with others
firewall
ATLAS
job
Glide-inssubmitterGlide-ins
submitterGlide-inssubmitter
job
SE
Conglia web browser
A Conglia monitoring system is developed for interface to Condor. It is a web-browsing system that illustrates job status particularly useful for debugging of jobs and system errors by tracing the progress of the jobs. Commands in a running section are printed and the CPU history are shown in graphs.
Integrated resources and monitoring
Condor
ASGC_OSG
Taiwan-IPAS-LCG2
CDF UI
Other UIs
Conglia
ATLAS UI
WNs
IPAS_OSG
middlewares
Pacific CAF
A distributed computing facility is developed at Academia Sinica for remote data processing of high energy physics experiments. We first developed a dCAF (de-centralized CDF Analysis Farm) for the Collider Detector at Fermilab experiment, which provides a “portal” to users with a single submission prototype to access a dozen dCAFs in Pacific Asia, Europe and North America. The dCAF service includes
Submitter: which accepts user’s job that contains a task archive (tarball) submitted to the local batch system – Condor is in use;Monitor: which offers limited access to job scratch area and web browser services to job information. Users can 1. get job status, list user jobs, and show processes of a job 2. remove, hold, release jobs 3. display files of a job.
The customized dCAF is upgraded to become the PacificCAF that is capable of allocating resources on LCG (LHC Computing GRID) and OSG (Open Science GRID). It is a regional resource collector that has
a Glide-in Condor pool with CPUs of joined GRID sites in the Pacific Asia region;Generic Connection Brokering (GCB) to nodes in GRID sites protected by Firewalls. PARROT HTTP service for distribution of dedicated CDF software.
We also develop Tier services for applications of the ATLAS experiment on the LCG. A coherent computing service model is constructed to share GRID resources for users of high energy experiments. ATLAS users submit jobs to a Condor-G platform that has a dispatcher server which performs as a resource collector for a Condor pool of joined LCG sites
In migration to GRID, we have integrated computing clusters into a Condor system of over 250 CPUs for local users and GRID access to LCG and OSG. The Condor system has multiple submission nodes. Some act as User Interface (UI) nodes and some as Computing Element (CE) gatekeepers. We have a 2 Gb link to the Taiwan LHC Tier-1 GRID site, several Gb links to sites in Pacific Asia region and a total 10 Gb to US and Europe.
The 9The 9thth International Conference on High Performance Computing, Grid and e-Science in Asia Pacific International Conference on High Performance Computing, Grid and e-Science in Asia Pacific 。。 September 9-12 2007, Seoul, September 9-12 2007, Seoul, Korea Korea