38
Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

Embed Size (px)

Citation preview

Page 1: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

Data analysis in I2U2

I2U2 all-hands meetingMichael Wilde

Argonne MCS

University of Chicago Computation Institute

12 Dec 2005

Page 2: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Scaling up Social Science:Parallel Citation Network Analysis

2002

1975

1990

1985

1980

2000

1995

Work of James Evans, University of Chicago,

Department of Sociology

Page 3: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Scaling up the analysis Database queries of 25+ million citations Work started on small workstations Queries grew to month-long duration With database distributed across

U of Chicago TeraPort cluster: 50 (faster) CPUs gave 100 X speedup Many more methods and hypotheses can be

tested! Grid enables deeper analysis and wider

access

Page 4: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Grids Provide Global ResourcesTo Enable e-Science

Page 5: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Why Grids?eScience is the Initial Motivator …

New approaches to inquiry based on Deep analysis of huge quantities of data Interdisciplinary collaboration Large-scale simulation and analysis Smart instrumentation Dynamically assemble the resources to

tackle a new scale of problem Enabled by access to resources & services

without regard for location & other barriers… but eBusiness is catching up rapidly,

and this will benefit both domains

Page 6: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Technology that enables the Grid

Directory to locate grid sites and services Uniform interface to computing sites Fast and secure data set mover Directory to track where datasets live Security to control access Toolkits to create application services

Globus, Condor, VDT, many more

Page 7: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Virtual Data and Workflows

Next challenge is managing and organizing the vast computing and storage capabilities provided by Grids

Workflow expresses computations in a form that can be readily mapped to Grids

Virtual data keeps accurate track of data derivation methods and provenance

Grid tools virtualize location and caching of data, and recovery from failures

Page 8: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Virtual Data Process

Describe data derivation or analysis steps in a high-level workflow language (VDL)

VDL is cataloged in a database for sharing by the community

Workflows for Grid generated automatically from VDL

Provenance of derived results goes back into catalog for assessment or verification

Page 9: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Virtual Data Lifecycle

Describe Record the processing and analysis steps applied to the data Document the devices and methods used to measure the data

Discover I have some subject images - what analyses are available?

Which can be applied to this format? I’m a new team member – what are the methods and protocols of

my colleagues? Reuse

I want to apply an image registration program to thousands of objects. If the results already exist, I’ll save weeks of computation.

Validate I’ve come across some interesting data, but I need to understand

the nature of the preprocessing applied when it was constructed before I can trust it for my purposes.

Page 10: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Virtual Data WorkflowAbstracts Grid Details

Page 11: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Workflow - the nextprogramming model?

Page 12: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao,

University of Chicago

1

10

100

1000

10000

100000

1 10 100

Num

ber

of C

lust

ers

Number of Galaxies

Galaxy clustersize distribution

DAG

Virtual Data Example:Galaxy Cluster Search

Sloan Data

Page 13: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

A virtual data glossary virtual data

defining data by the logical workflow needed to create it virtualizes it with respect to location, existence, failure, and representation

VDS – Virtual Data System The tools to define, store, manipulate and execute

virtual data workflows VDT – Virtual Data Toolkit

A larger set of tools, based on NMI, VDT provides the Grid environment in which VDL workflows run

VDL – Virtual Data Language A language (text and XML) that defines the functions

and function calls of a virtual data workflow VDC – Virtual Data Catalog

The database and schema that store VDL definitions

Page 14: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

What must we “virtualize”to compute on the Grid?

Location-independent computing: represent all workflow in abstract terms

Declarations not tied to specific entities: sites file systems schedulers

Failures – automated retry for data server and execution site un-availability

Page 15: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Expressing Workflow in VDL

TR grep (in a1, out a2) {

argument stdin = ${a1}; 

argument stdout = ${a2}; }

TR sort (in a1, out a2) {

argument stdin = ${a1};

argument stdout = ${a2}; }

DV grep (a1=@{in:file1}, a2=@{out:file2});

DV sort (a1=@{in:file2}, a2=@{out:file3});

file1

file2

file3

grep

sort

Page 16: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Expressing Workflow in VDL

TR grep (in a1, out a2) {

argument stdin = ${a1}; 

argument stdout = ${a2}; }

TR sort (in a1, out a2) {

argument stdin = ${a1};

argument stdout = ${a2}; }

DV grep (a1=@{in:file1}, a2=@{out:file2});

DV sort (a1=@{in:file2}, a2=@{out:file3});

file1

file2

file3

grep

sort

Define a “function” wrapper for an

application

Provide “actual” argument values for the invocation

Define “formal arguments” for the application

Define a “call” to invoke application

Connect applications via output-to-input dependencies

Page 17: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Essence of VDL

Elevates specification of computation to a logical, location-independent level

Acts as an “interface definition language” at the shell/application level

Can express composition of functions Codable in textual and XML form Often machine-generated to provide ease of

use and higher-level features Preprocessor provides iteration and variables

Page 18: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Using VDL

Generated directly for low-volume usage Generated by scripts for production use Generated by application tool builders as

wrappers around scripts provided for community use

Generated transparently in an application-specific portal (e.g. quarknet.fnal.gov/grid)

Generated by drag-and-drop workflow design tools such as Triana

Page 19: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Basic VDL Toolkit

Convert between text and XML representation

Insert, update, remove definitions from a virtual data catalog

Attach metadata annotations to defintions Search for definitions Generate an abstract workflow for a data

derivation request Multiple interface levels provided:

Java API, command line, web service

Page 20: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Representing Workflow

Specifies a set of activities and control flow Sequences information transfer between

activities VDS uses XML-based notation called

“DAG in XML” (DAX) format VDC Represents a wide range of workflow

possibilities DAX document represents steps to create

a specific data product

Page 21: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Executing VDL Workflows

Abstractworkflow

Local planner

DAGmanDAG

StaticallyPartitioned

DAG

DAGman &Condor-GDynamically

PlannedDAG

VDLProgram

Virtual Datacatalog

Virtual DataWorkflowGenerator

JobPlanner

JobCleanup

Workflow spec Create Execution Plan Grid Workflow Execution

Page 22: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

OSG:The “target chip” for VDS Workflows

Supported by the National Science Foundation and the Department of Energy.

Page 23: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

VDS ApplicationsApplication Jobs / workflow Levels Status

ATLAS

HEP Event Simulation

500K 1 In Use

LIGO

Inspiral/Pulsar

~700 2-5 Inspiral In Use

NVO/NASA

Montage/Morphology

1000s 7 Both In Use

GADU

Genomics: BLAST,…

40K 1 In Use

fMRI DBIC

AIRSN Image Proc

100s 12 In Devel

QuarkNet

CosmicRay science

<10 3-6 In Use

SDSS

Coadd; Cluster Search

40K500K

28

In Devel/ CS Research

FOAM

Ocean/Atmos Model

2000 (core app runs

250 8-CPU jobs)

3 In use

GTOMO

Image proc

1000s 1 In Devel

SCEC

Earthquake sim

1000s In use

Page 24: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

A Case Study – Functional MRI Problem: “spatial normalization” of a images to

prepare data from fMRI studies for analysis Target community is approximately 60 users at

Dartmouth Brain Imaging Center Wish to share data and methods across country

with researchers at Berkeley Process data from arbitrary user and archival

directories in the center’s AFS space; bring data back to same directories

Grid needs to be transparent to the users: Literally, “Grid as a Workstation”

Page 25: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

A Case Study – Functional MRI (2)

Based workflow on shell script that performs 12-stage process on a local workstation

Adopted replica naming convention for moving user’s data to Grid sites

Creates VDL pre-processor to iterate transformations over datasets

Utilizing resources across two distinct grids – Grid3 and Dartmouth Green Grid

Page 26: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Functional MRI Analysis3a.h

align_warp/1

3a.i

3a.s.h

softmean/9

3a.s.i

3a.w

reslice/2

4a.h

align_warp/3

4a.i

4a.s.h 4a.s.i

4a.w

reslice/4

5a.h

align_warp/5

5a.i

5a.s.h 5a.s.i

5a.w

reslice/6

6a.h

align_warp/7

6a.i

6a.s.h 6a.s.i

6a.w

reslice/8

ref.h ref.i

atlas.h atlas.i

slicer/10 slicer/12 slicer/14

atlas_x.jpg

atlas_x.ppm

convert/11

atlas_y.jpg

atlas_y.ppm

convert/13

atlas_z.jpg

atlas_z.ppm

convert/15

Workflow courtesy James Dobson, Dartmouth Brain Imaging Center

Page 27: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Spatial normalization of functional runreorientRun

reorientRun

reslice_warpRun

random_select

alignlinearRun

resliceRun

softmean

alignlinear

combinewarp

strictmean

gsmoothRun

binarize

reorient/01

reorient/02

reslice_warp/22

alignlinear/03 alignlinear/07alignlinear/11

reorient/05

reorient/06

reslice_warp/23

reorient/09

reorient/10

reslice_warp/24

reorient/25

reorient/51

reslice_warp/26

reorient/27

reorient/52

reslice_warp/28

reorient/29

reorient/53

reslice_warp/30

reorient/31

reorient/54

reslice_warp/32

reorient/33

reorient/55

reslice_warp/34

reorient/35

reorient/56

reslice_warp/36

reorient/37

reorient/57

reslice_warp/38

reslice/04 reslice/08reslice/12

gsmooth/41

strictmean/39

gsmooth/42gsmooth/43gsmooth/44 gsmooth/45 gsmooth/46 gsmooth/47 gsmooth/48 gsmooth/49 gsmooth/50

softmean/13

alignlinear/17

combinewarp/21

binarize/40

reorient

reorient

alignlinear

reslice

softmean

alignlinear

combine_warp

reslice_warp

strictmean

binarize

gsmooth

Dataset-level workflow Expanded (10 volume) workflow

Page 28: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Conclusion: Motivation for the Grid

Provide flexible, cost-effective supercomputing Federate computing resources Organize storage resources and make them

universally available Link them on networks fast enough to achieve

federation Create usable Supercomputing

Shield users from heterogeneity Organize and locate widely distributed resources Automate policy mechanisms for resource sharing

Provide ubiquitous access while protecting valuable data and resources

Page 29: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Grid Opportunities

Vastly expanded computing and storage Reduced effort as needs scale up Improved resource utilization, lower costs Facilities and models for collaboration Sharing of tools, data, and procedures and

protocols Recording, discovery, review and reuse of

complex tasks Make high-end computing more readily

available

Page 30: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

fMRI Dataset processing

FOREACH BOLDSEQDV reorient (# Process Blood O2 Level Dependent Sequence input = [ @{in: "$BOLDSEQ.img"},

@{in: "$BOLDSEQ.hdr"} ], output = [@{out: "$CWD/FUNCTIONAL/r$BOLDSEQ.img"} @{out: "$CWD/FUNCTIONAL/r$BOLDSEQ.hdr"}], direction = "y", );END

DV softmean ( input = [ FOREACH BOLDSEQ @{in:"$CWD/FUNCTIONAL/har$BOLDSEQ.img"} END ], mean = [ @{out:"$CWD/FUNCTIONAL/mean"} ]);

Page 31: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

fMRI Virtual Data QueriesWhich transformations can process a “subject image”? Q: xsearchvdc -q tr_meta dataType

subject_image input A: fMRIDC.AIR::align_warp

List anonymized subject-images for young subjects: Q: xsearchvdc -q lfn_meta dataType subject_image privacy anonymized subjectType young A: 3472-4_anonymized.img

Show files that were derived from patient image 3472-3: Q: xsearchvdc -q lfn_tree 3472-3_anonymized.img A: 3472-3_anonymized.img

3472-3_anonymized.sliced.hdr atlas.hdr atlas.img … atlas_z.jpg 3472-3_anonymized.sliced.img

Page 32: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Blasting for Protein KnowledgeBlasting complete nr file for sequence similarity and function Characterization Knowledge Base

PUMA is an interface for the researchers to be able to find information about a specific protein after having been analyzed against the complete set of sequenced genomes (nr file ~ approximately 2 million sequences)

Analysis on the Grid

The analysis of the protein sequences occurs in the background in the grid environment. Millions of processes are started since several tools are run to analyze each sequence, such as finding out protein similarities (BLAST), protein family domain searches (BLOCKS), and structural characteristics of the protein.

Page 33: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

FOAM:Fast Ocean/Atmosphere Model

250-Member EnsembleRun on TeraGrid under VDS

FOAM run for Ensemble Member 1

FOAM run for Ensemble Member 2

FOAM run for Ensemble Member N

Atmos Postprocessing Ocean

Postprocessing for Ensemble Member 2

Coupl Postprocessing for Ensemble Member 2

Atmos Postprocessing for Ensemble Member 2

Coupl Postprocessing for Ensemble Member 2

Results transferred to archival storage

Work of: Rob Jacob (FOAM), Veronica Nefedova (Workflow design and execution)

Remote Directory Creation for Ensemble Member 1

Remote Directory Creation for Ensemble Member 2

Remote Directory Creation for Ensemble Member N

Page 34: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

FOAM: TeraGrid/VDSBenefits

Climate Supercomputer

TeraGrid with NMI and VDS

Visualization courtesy Pat

Behling and Yun Liu, UW Madison

Page 35: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Small Montage Workflow

~1200 node workflow, 7 levelsMosaic of M42 created onthe Teragrid using Pegasus

Page 36: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

LIGO Inspiral Search Application

Describe…

Inspiral workflow application is the work of Duncan Brown, Caltech,

Scott Koranda, UW Milwaukee, and the LSC Inspiral group

Page 37: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

US-ATLASData Challenge 2

Sep 10

Mid July

CP

U-d

ay

Event generation using Virtual Data

Page 38: Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

www.griphyn.org/vds

Provenance for DC2

How much compute time was delivered?| years| mon | year |+------+------+------+| .45 | 6 | 2004 || 20 | 7 | 2004 || 34 | 8 | 2004 || 40 | 9 | 2004 || 15 | 10 | 2004 || 15 | 11 | 2004 || 8.9 | 12 | 2004 |+------+------+------+

Selected statistics for one of these jobs:start: 2004-09-30 18:33:56duration: 76103.33 pid: 6123exitcode: 0 args: 8.0.5 JobTransforms-08-00-05-09/share/dc2.g4sim.filter.trf CPE_6785_556

... -6 6 2000 4000 8923 dc2_B4_filter_frag.txt utime: 75335.86 stime: 28.88 minflt: 862341 majflt: 96386

Which Linux kernel releases were used ?

How many jobs were run on a Linux 2.4.28 Kernel?