21
Distributed Services for Grid Enabled Data Analysis Distributed Services for Grid Enabled Data Analysis

Distributed Services for Grid Enabled Data Analysis

  • Upload
    emmet

  • View
    39

  • Download
    0

Embed Size (px)

DESCRIPTION

Distributed Services for Grid Enabled Data Analysis. Distributed Services for Grid Enabled Data Analysis. Scenario. Liz and John are members of CMS Liz is from Caltech and is an expert in event reconstruction John is from Florida and is an expert in statistical fits - PowerPoint PPT Presentation

Citation preview

Page 1: Distributed Services for Grid Enabled Data Analysis

Distributed Services for GridEnabled Data Analysis

Distributed Services for GridEnabled Data Analysis

Page 2: Distributed Services for Grid Enabled Data Analysis

Scenario

• Liz and John are members of CMS • Liz is from Caltech and is an expert in

event reconstruction• John is from Florida and is an expert

in statistical fits• They wish to combine their expertise

and collaborate on a CMS Data Analysis Project

Page 3: Distributed Services for Grid Enabled Data Analysis

Grid Monitoring Service

MonALISA

Grid Resource Service

VDT Server

Grid Execution Service

VDT Client

Grid Scheduling Service

Sphinx

Virtual Data Service

Chimera

WorkflowGeneration

Service

ShahKar

CollaborativeEnvironment

Service

CAVE

Grid-services Web Service: Clarens

Analysis Client

IGUANA

Analysis Client

ROOT

Analysis Client

Web Browser

Analysis Client

PDA

Remote Data Service

Clarens

Demo Goals• Prototype vertically integrated

system– Transparent/seamless

experience

• Distribute grid services using a uniform web service

– Clarens !– Understand system

• latencies • failure modes

• Investigate request scheduling in a resource limited and dynamic environment

– Emphasize functionality over scalability

• Investigate interactive vs. scheduled data analysis on a grid

– Hybrid example– Understand where are the

difficult issues

Page 4: Distributed Services for Grid Enabled Data Analysis

ChimeraVirtual Data System

Virtual data products are pre-registered with the Chimera Virtual Data Service.

Using Clarens, data products are discovered by Liz and John by remotely browsing the Chimera Virtual Data Service

y.ntply.rootx.ntplx.root

request browse

y.ntpl

y.cards

pythia

h2root

y.root

x.ntpl

x.cards

pythia

h2root

x.root

Data Discovery

Page 5: Distributed Services for Grid Enabled Data Analysis

Liz wants to analyse x.rootusing her analysis code a.C

// Analysis code: a.C#include <iostream.h>#include <math.h>

#include "TFile.h"#include "TTree.h"#include "TBrowser.h"#include "TH1.h"#include "TH2.h"#include "TH3.h"#include "TRandom.h"#include "TCanvas.h"#include "TPolyLine3D.h"#include "TPolyMarker3D.h"#include "TString.h"

void a( char treefile[], char newtreefile[] ){ Int_t Nhep; Int_t Nevhep; Int_t Isthep[3000]; Int_t Idhep[3000], Jmohep[3000][2], Jdahep[3000][2]; Float_t Phep[3000][5], Vhep[3000][4]; Int_t Irun, Ievt; Float_t Weight; Int_t Nparam; Float_t Param[200];

TFile *file = new TFile( treefile ); TTree *tree = (TTree*) file -> Get( "h10 tree -> SetBranchAddress( "Nhep", &Nh

x.ntpl

x.cards

pythia

h2root

x.root

Data Analysis

ChimeraVirtual Data System

Page 6: Distributed Services for Grid Enabled Data Analysis

ChimeraVirtual Data System

register browse

Select CINT scriptDefine output LFN

Select input LFN

Liz browses the local directory for her analysis code and the ChimeraVirtual Data Service for input LFNs…

x.ntpl

x.cards

pythia

h2root

x.root

Interactive Workflow Generation

Page 7: Distributed Services for Grid Enabled Data Analysis

ChimeraVirtual Data System

y.ntply.rootx.ntplx.root

register browse

xa.root

a.Cb.Cc.Cd.C

Select CINT scriptDefine output LFN

Select input LFN

She selects and registers (to theGrid) her analysis code, the appropriate input LFN, and a newlydefined ouput LFN

x.ntpl

x.cards

pythia

h2root

x.root

Interactive Workflow Generation

Page 8: Distributed Services for Grid Enabled Data Analysis

ChimeraVirtual Data System

y.ntply.rootx.ntplx.root

register browse

x.root

xa.root

a.Cb.Cc.Cd.C

a.C

Select CINT scriptDefine output LFN

Select input LFN

A branch is automatically added in the Chimera Virtual Data Catalog, and a.C is uploaded into“gridspace” and registered with RLS

root

a.C

xa.root

x.ntpl

x.cards

pythia

h2root

x.root

Interactive Workflow Generation

Page 9: Distributed Services for Grid Enabled Data Analysis

Querying the Virtual Data Service, Liz sees that xa.root is now available to her as a new virtual data product

y.ntply.rootx.ntplx.rootxa.root

request browse

root

a.C

xa.root

x.ntpl

x.cards

pythia

h2root

x.root

Interactive Workflow Generation

ChimeraVirtual Data System

Page 10: Distributed Services for Grid Enabled Data Analysis

She requests it….

y.ntply.rootx.ntplx.rootxa.root

request browse

xa.root

root

a.C

xa.root

x.ntpl

x.cards

pythia

h2root

x.root

Request Submission

ChimeraVirtual Data System

Page 11: Distributed Services for Grid Enabled Data Analysis

Brief Interlude: The Grid is Busy and Resources

are Limited!• Busy:

– Production is taking place– Other physicists are using the system– Use MonALISA to avoid congestion in the grid

• Limited:– As grid computing becomes standard fare, oversubscription to resources will

be common !

• CMS gives Liz a global high priority

• Based upon local and global policies, and current Grid weather, a grid-scheduler:

– must schedule her requests for optimal resource use

Page 12: Distributed Services for Grid Enabled Data Analysis

Sphinx Scheduling Server

• Nerve Centre– Global view of system

• Data Warehouse– Information driven– Repository of current state

of the grid

• Control Process– Finite State Machine

• Different modules modify jobs, graphs, workflows, etc and change their state

– Flexible– Extensible

Sphinx Server

Control Process

Job Execution Planner

Graph Reducer

Graph Tracker

Job Predictor

Graph Data Planner

Job Admission Control

Message Interface

Graph Predictor

Graph Admission Control

Data Warehouse

Data Management

Information Gatherer

• Policies

• Accounting Info

• Grid Weather

• Resource Prop. and status

• Request Tracking

• Workflows

• etc

Page 13: Distributed Services for Grid Enabled Data Analysis

Distributed Services for GridEnabled Data Analysis

Distributed Services for GridEnabled Data Analysis

Sphinx

Scheduling Service Fermilab

FileService

VDT ResourceService

Caltech

FileService

VDT ResourceService

RLS

Replica LocationService

Sphinx/VDT

Execution Service

MonALISA

Monitoring Service

ROOT

Data AnalysisClient

Chimera

Virtual Data Service

Iowa

FileService

VDT ResourceService

Florida

FileService

VDT ResourceService

Clarens

Clarens

Clare

ns

Globus

Globus

Gri

dF

TP

Claren

s

Globus

MonALISA

Page 14: Distributed Services for Grid Enabled Data Analysis

Meanwhile, John has beendeveloping his statistical fits in b.Cby analysing the data product x.root

y.ntply.rootx.ntplx.rootxa.rootxb.root

request browse

xb.rootroot

a.C

xa.root

x.ntpl

x.cards

pythia

h2root

x.root

root

b.C

xb.root

Collaborative Analysis

Page 15: Distributed Services for Grid Enabled Data Analysis

After Liz has finished optimisingthe event reconstruction, John uses hisanalysis code b.C on her data product xa.root to produce the final statistical fits and results !

y.rootx.ntplx.rootxa.rootxb.rootxab.root

request browse

root

a.C

xa.root

x.ntpl

x.cards

pythia

h2root

x.root

root

b.C

xb.root

root

xab.root

xab.root

Collaborative Analysis

Page 16: Distributed Services for Grid Enabled Data Analysis

Key Features• Distributed Services

Prototype in Data Analysis– Remote Data Service– Replica Location

Service– Virtual Data Service– Scheduling Service– Grid-Execution Service– Monitoring Service

• Smart Replication Strategies for “Hot Data”– Virtual Data w.r.t.

Location

• Execution Priority Management on a Resource Limited Grid– Policy Based Scheduling &

QoS– Virtual Data w.r.t.

Existence

• Collaborative Environment– Sharing of Datasets– Use of Provenance

Page 17: Distributed Services for Grid Enabled Data Analysis

Credits

• California Institute of Technology– Julian Bunn, Iosif Legrand, Harvey Newman, Suresh Singh,

Conrad Steenberg, Michael Thomas, Frank Van Lingen, Yang Xia

• University of Florida– Paul Avery, Dimitri Bourilkov, Richard Cavanaugh, Laukik

Chitnis, Jang-uk In, Mandar Kulkarni, Pradeep Padala, Craig Prescott, Sanjay Ranka

• Fermi National Accelerator Laboratory– Anzar Afaq, Greg Graham

Page 18: Distributed Services for Grid Enabled Data Analysis

DMC (Data Management Component)

• Scheduling the data transfers to achieve optimal workflow execution

• The problem: Combining data and Execution scheduling

• Various kinds of data transfers• Smart replication

– User initiated– Workflow based replication– Automatic replication

• Hot data management

Page 19: Distributed Services for Grid Enabled Data Analysis

Monitoring in SPHINX

• Scheduler needs information to make decisions. – The information needs to be

as “current” as possible

• That brings monitoring into the picture– Load Average– Free Memory– Disk Space

• Virtual Organization (VO) Quota System– Different policies for

resources– Needs monitoring and

accounting/tracking of resource quotas

• MonALISA– Dynamic discovery of sites– Configurable monitoring

service and parameters– View Generation using filters– Displays SPHINX job

information

• Future Directions– As grid grows, the problem of

latency becomes more potent– Solution: Data

Fusion/Aggregation– Inline with the hierarchical

views of grid (VO) and the hierarchical scheduler!

Page 20: Distributed Services for Grid Enabled Data Analysis

Distributed Services for GridEnabled Data Analysis

Distributed Services for GridEnabled Data Analysis

Sphinx

Scheduling Service Fermilab

FileService

VDT ResourceService

Caltech

FileService

VDT ResourceService

RLS

Replica LocationService

Sphinx/VDT

Execution Service

MonALISA

Monitoring Service

ROOT

Data AnalysisClient

Chimera

Virtual Data Service

Iowa

FileService

VDT ResourceService

Florida

FileService

VDT ResourceService

Clarens

Clarens

Clare

ns

Globus

Globus

Gri

dF

TP

Claren

s

Globus

MonALISA

Page 21: Distributed Services for Grid Enabled Data Analysis