Upload
emmet
View
39
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Distributed Services for Grid Enabled Data Analysis. Distributed Services for Grid Enabled Data Analysis. Scenario. Liz and John are members of CMS Liz is from Caltech and is an expert in event reconstruction John is from Florida and is an expert in statistical fits - PowerPoint PPT Presentation
Citation preview
Distributed Services for GridEnabled Data Analysis
Distributed Services for GridEnabled Data Analysis
Scenario
• Liz and John are members of CMS • Liz is from Caltech and is an expert in
event reconstruction• John is from Florida and is an expert
in statistical fits• They wish to combine their expertise
and collaborate on a CMS Data Analysis Project
Grid Monitoring Service
MonALISA
Grid Resource Service
VDT Server
Grid Execution Service
VDT Client
Grid Scheduling Service
Sphinx
Virtual Data Service
Chimera
WorkflowGeneration
Service
ShahKar
CollaborativeEnvironment
Service
CAVE
Grid-services Web Service: Clarens
Analysis Client
IGUANA
Analysis Client
ROOT
Analysis Client
Web Browser
Analysis Client
PDA
Remote Data Service
Clarens
Demo Goals• Prototype vertically integrated
system– Transparent/seamless
experience
• Distribute grid services using a uniform web service
– Clarens !– Understand system
• latencies • failure modes
• Investigate request scheduling in a resource limited and dynamic environment
– Emphasize functionality over scalability
• Investigate interactive vs. scheduled data analysis on a grid
– Hybrid example– Understand where are the
difficult issues
ChimeraVirtual Data System
Virtual data products are pre-registered with the Chimera Virtual Data Service.
Using Clarens, data products are discovered by Liz and John by remotely browsing the Chimera Virtual Data Service
y.ntply.rootx.ntplx.root
request browse
y.ntpl
y.cards
pythia
h2root
y.root
x.ntpl
x.cards
pythia
h2root
x.root
Data Discovery
Liz wants to analyse x.rootusing her analysis code a.C
// Analysis code: a.C#include <iostream.h>#include <math.h>
#include "TFile.h"#include "TTree.h"#include "TBrowser.h"#include "TH1.h"#include "TH2.h"#include "TH3.h"#include "TRandom.h"#include "TCanvas.h"#include "TPolyLine3D.h"#include "TPolyMarker3D.h"#include "TString.h"
void a( char treefile[], char newtreefile[] ){ Int_t Nhep; Int_t Nevhep; Int_t Isthep[3000]; Int_t Idhep[3000], Jmohep[3000][2], Jdahep[3000][2]; Float_t Phep[3000][5], Vhep[3000][4]; Int_t Irun, Ievt; Float_t Weight; Int_t Nparam; Float_t Param[200];
TFile *file = new TFile( treefile ); TTree *tree = (TTree*) file -> Get( "h10 tree -> SetBranchAddress( "Nhep", &Nh
x.ntpl
x.cards
pythia
h2root
x.root
Data Analysis
ChimeraVirtual Data System
ChimeraVirtual Data System
register browse
Select CINT scriptDefine output LFN
Select input LFN
Liz browses the local directory for her analysis code and the ChimeraVirtual Data Service for input LFNs…
x.ntpl
x.cards
pythia
h2root
x.root
Interactive Workflow Generation
ChimeraVirtual Data System
y.ntply.rootx.ntplx.root
register browse
xa.root
a.Cb.Cc.Cd.C
Select CINT scriptDefine output LFN
Select input LFN
She selects and registers (to theGrid) her analysis code, the appropriate input LFN, and a newlydefined ouput LFN
x.ntpl
x.cards
pythia
h2root
x.root
Interactive Workflow Generation
ChimeraVirtual Data System
y.ntply.rootx.ntplx.root
register browse
x.root
xa.root
a.Cb.Cc.Cd.C
a.C
Select CINT scriptDefine output LFN
Select input LFN
A branch is automatically added in the Chimera Virtual Data Catalog, and a.C is uploaded into“gridspace” and registered with RLS
root
a.C
xa.root
x.ntpl
x.cards
pythia
h2root
x.root
Interactive Workflow Generation
Querying the Virtual Data Service, Liz sees that xa.root is now available to her as a new virtual data product
y.ntply.rootx.ntplx.rootxa.root
request browse
root
a.C
xa.root
x.ntpl
x.cards
pythia
h2root
x.root
Interactive Workflow Generation
ChimeraVirtual Data System
She requests it….
y.ntply.rootx.ntplx.rootxa.root
request browse
xa.root
root
a.C
xa.root
x.ntpl
x.cards
pythia
h2root
x.root
Request Submission
ChimeraVirtual Data System
Brief Interlude: The Grid is Busy and Resources
are Limited!• Busy:
– Production is taking place– Other physicists are using the system– Use MonALISA to avoid congestion in the grid
• Limited:– As grid computing becomes standard fare, oversubscription to resources will
be common !
• CMS gives Liz a global high priority
• Based upon local and global policies, and current Grid weather, a grid-scheduler:
– must schedule her requests for optimal resource use
Sphinx Scheduling Server
• Nerve Centre– Global view of system
• Data Warehouse– Information driven– Repository of current state
of the grid
• Control Process– Finite State Machine
• Different modules modify jobs, graphs, workflows, etc and change their state
– Flexible– Extensible
Sphinx Server
Control Process
Job Execution Planner
Graph Reducer
Graph Tracker
Job Predictor
Graph Data Planner
Job Admission Control
Message Interface
Graph Predictor
Graph Admission Control
Data Warehouse
Data Management
Information Gatherer
• Policies
• Accounting Info
• Grid Weather
• Resource Prop. and status
• Request Tracking
• Workflows
• etc
Distributed Services for GridEnabled Data Analysis
Distributed Services for GridEnabled Data Analysis
Sphinx
Scheduling Service Fermilab
FileService
VDT ResourceService
Caltech
FileService
VDT ResourceService
RLS
Replica LocationService
Sphinx/VDT
Execution Service
MonALISA
Monitoring Service
ROOT
Data AnalysisClient
Chimera
Virtual Data Service
Iowa
FileService
VDT ResourceService
Florida
FileService
VDT ResourceService
Clarens
Clarens
Clare
ns
Globus
Globus
Gri
dF
TP
Claren
s
Globus
MonALISA
Meanwhile, John has beendeveloping his statistical fits in b.Cby analysing the data product x.root
y.ntply.rootx.ntplx.rootxa.rootxb.root
request browse
xb.rootroot
a.C
xa.root
x.ntpl
x.cards
pythia
h2root
x.root
root
b.C
xb.root
Collaborative Analysis
After Liz has finished optimisingthe event reconstruction, John uses hisanalysis code b.C on her data product xa.root to produce the final statistical fits and results !
y.rootx.ntplx.rootxa.rootxb.rootxab.root
request browse
root
a.C
xa.root
x.ntpl
x.cards
pythia
h2root
x.root
root
b.C
xb.root
root
xab.root
xab.root
Collaborative Analysis
Key Features• Distributed Services
Prototype in Data Analysis– Remote Data Service– Replica Location
Service– Virtual Data Service– Scheduling Service– Grid-Execution Service– Monitoring Service
• Smart Replication Strategies for “Hot Data”– Virtual Data w.r.t.
Location
• Execution Priority Management on a Resource Limited Grid– Policy Based Scheduling &
QoS– Virtual Data w.r.t.
Existence
• Collaborative Environment– Sharing of Datasets– Use of Provenance
Credits
• California Institute of Technology– Julian Bunn, Iosif Legrand, Harvey Newman, Suresh Singh,
Conrad Steenberg, Michael Thomas, Frank Van Lingen, Yang Xia
• University of Florida– Paul Avery, Dimitri Bourilkov, Richard Cavanaugh, Laukik
Chitnis, Jang-uk In, Mandar Kulkarni, Pradeep Padala, Craig Prescott, Sanjay Ranka
• Fermi National Accelerator Laboratory– Anzar Afaq, Greg Graham
DMC (Data Management Component)
• Scheduling the data transfers to achieve optimal workflow execution
• The problem: Combining data and Execution scheduling
• Various kinds of data transfers• Smart replication
– User initiated– Workflow based replication– Automatic replication
• Hot data management
Monitoring in SPHINX
• Scheduler needs information to make decisions. – The information needs to be
as “current” as possible
• That brings monitoring into the picture– Load Average– Free Memory– Disk Space
• Virtual Organization (VO) Quota System– Different policies for
resources– Needs monitoring and
accounting/tracking of resource quotas
• MonALISA– Dynamic discovery of sites– Configurable monitoring
service and parameters– View Generation using filters– Displays SPHINX job
information
• Future Directions– As grid grows, the problem of
latency becomes more potent– Solution: Data
Fusion/Aggregation– Inline with the hierarchical
views of grid (VO) and the hierarchical scheduler!
Distributed Services for GridEnabled Data Analysis
Distributed Services for GridEnabled Data Analysis
Sphinx
Scheduling Service Fermilab
FileService
VDT ResourceService
Caltech
FileService
VDT ResourceService
RLS
Replica LocationService
Sphinx/VDT
Execution Service
MonALISA
Monitoring Service
ROOT
Data AnalysisClient
Chimera
Virtual Data Service
Iowa
FileService
VDT ResourceService
Florida
FileService
VDT ResourceService
Clarens
Clarens
Clare
ns
Globus
Globus
Gri
dF
TP
Claren
s
Globus
MonALISA