42
Ian Foster Computation Institute Argonne National Lab & University of Chicago Towards an Open Analytics Environment

Open Analytics Environment

Embed Size (px)

DESCRIPTION

I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.

Citation preview

Page 1: Open Analytics Environment

Ian Foster

Computation Institute

Argonne National Lab & University of Chicago

Towards anOpen Analytics Environment

Page 2: Open Analytics Environment

2

The Computation Institute

A joint institute of Argonne and the University of Chicago, focused on furthering system-level science via the development and use of advanced computational methods.

Solutions to many grand challenges facing science and society today require the analysis and understanding of entire systems, not just individual components. They require not reductionist approaches but the synthesis of knowledge from multiple levels of a system, whether biological, physical, or social (or all three).

www.ci.uchicago.edu

Faculty, fellows, staff, students, computers, projects.

Page 3: Open Analytics Environment

3

The Good Old Days: Astronomy ~1600

30 years? years

10 years6 years2 years

Page 4: Open Analytics Environment

4

Automation10

-1 108 Hz

data capture

Community10

0 104

astronomers(106 amateur)

ComputationData10

6 1015

Baggregate 10

-1 1015

Hzpeak

Literature10

1 105

pages/year

Astronomy,from 1600 to 2000

Page 5: Open Analytics Environment

5

Biomedical Research ~1600

Page 6: Open Analytics Environment

6

Biomedical Research ~2000

...atcgaattccaggcgtcacattctcaattcca...

DNA sequencesalignments

MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT...

Proteins sequence

2º structure 3º structure

Protein-ProteinInteractions

metabolism pathways

receptor-ligand 4º structure

Polymorphism and Variants

genetic variants individual patients

epidemiology

Physiology Cellular biology

Biochemistry Neurobiology

Endocrinology etc.>10

6

ESTs Expression patternsLarge-scale screensGenetics and Maps

Linkage Cytogenetic Clone-based

From John Wooley>10

6

>109

>106

>105

>109

Page 7: Open Analytics Environment

7

Growth of Sequences and Annotations since 1982

Folker Meyer, Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade, CTWatch, August 2006.

Page 8: Open Analytics Environment

8

The Analyst in Denial

“I just need a bigger disk (and workstation)”

Page 9: Open Analytics Environment

9

An Open Analytics Environment

Resultsout

Datain

Programs& rules in

“No limits” Storage Computing Format Program

Allowing for Versioning Provenance Collaboration Annotation

Page 10: Open Analytics Environment

10

o·pen [oh-puhn] adjective

having the interior immediately accessible

relatively free of obstructions to sight, movement, or internal arrangement

generous, liberal, or bounteous

in operation; live

readily admitting new members

not constipated

Page 11: Open Analytics Environment

11

What Goes In (1)

Page 12: Open Analytics Environment

12

What Goes In (2)

RulesRules

WorkflowsWorkflows

DryadDryad

MapReduceMapReduce

Parallel programsParallel programs

SQLSQL

BPELBPEL

SwiftSwift

SCFLSCFL

RR

MatLabMatLab

OctaveOctave

Page 13: Open Analytics Environment

13

How it Cooks

Virtualization Run any program, store

any data Indexing

Automated maintenance Provisioning

Policy-driven allocation of resources to competing demands

Page 14: Open Analytics Environment

14

What Comes Out

DataData

Page 15: Open Analytics Environment

15

Analysis as (Collaborative) ProcessTransformAnnotate SearchAdd toTag

VisualizeDiscover

ExtendGroupShare

Page 16: Open Analytics Environment

16

Centralizedor

Distributed?

Both

Page 17: Open Analytics Environment

17

Towards an Open Analysis Environment:(1) Applications

Astrophysics Cognitive science East Asian studies Economics Environmental science Epidemiology Genomic medicine Neuroscience Political science Sociology Solid state physics

Page 18: Open Analytics Environment

18

Towards an Open Analysis Environment:(2) Hardware

SiCortex6K cores, 6 Top/s

IBM BG/P160K cores, 500 Top/s

PADS

PADS

10-40 Gbit/s

Page 19: Open Analytics Environment

19

PADS: Petascale Active Data Store

500 TB reliable storage (data &

metadata)

180 TB, 180 GB/s 17 Top/s

analysisData

ingest

Dynamic provisioning

Parallel analysis

Remote access

Offload to remote data centers

P A D S

Diverseusers

Diversedata

sources

1000 TBtape backup

Page 20: Open Analytics Environment

20

Towards an Open Analysis Environment:(3) Methods

HPC systems software (MPICH, PVFS, etc.) Collaborative data tagging (GLOSS) Data integration (XDTM) HPC data analytics and visualization Loosely coupled parallelism (Swift, Hadoop) Dynamic provisioning (Falkon) Service authoring (Introduce, caGrid, gRAVI) Provenance recording and query (Swift) Service composition and workflow (Taverna) Virtualization management Distributed data management (GridFTP, etc.)

Page 21: Open Analytics Environment

21

Tagging & Social Networking

GLOSS: Generalized

Labels Over Scientific data Sources

Page 22: Open Analytics Environment

22

./group23

drwxr-xr-x 4 yongzh users 2048 Nov 12 14:15 AA

drwxr-xr-x 4 yongzh users 2048 Nov 11 21:13 CH

drwxr-xr-x 4 yongzh users 2048 Nov 11 16:32 EC

./group23/AA:

drwxr-xr-x 5 yongzh users 2048 Nov 5 12:41 04nov06aa

drwxr-xr-x 4 yongzh users 2048 Dec 6 12:24 11nov06aa

. /group23/AA/04nov06aa:

drwxr-xr-x 2 yongzh users 2048 Nov 5 12:52 ANATOMY

drwxr-xr-x 2 yongzh users 49152 Dec 5 11:40 FUNCTIONAL

. /group23/AA/04nov06aa/ANATOMY:

-rw-r--r-- 1 yongzh users 348 Nov 5 12:29 coplanar.hdr

-rw-r--r-- 1 yongzh users 16777216 Nov 5 12:29 coplanar.img

. /group23/AA/04nov06aa/FUNCTIONAL:

-rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0001.hdr

-rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0001.img

-rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0002.hdr

-rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0002.img

-rw-r--r-- 1 yongzh users 496 Nov 15 20:44 bold1_0002.mat

-rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0003.hdr

-rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0003.img

XDTM: XML Data Typing & Mapping

LogicalPhysical

Page 23: Open Analytics Environment

23

fMRI Type Definitions

type Study { Group g[ ];

}

type Group { Subject s[ ];

}

type Subject { Volume anat; Run run[ ];

}

type Run { Volume v[ ];

}

type Volume { Image img; Header hdr;

}

type Image {};

type Header {};

type Warp {};

type Air {};

type AirVec { Air a[ ];

}

type NormAnat {Volume anat; Warp aWarp; Volume

nHires;}

Page 24: Open Analytics Environment

24

High-PerformanceData Analytics

FunctionalMRI

Ben Clifford, Mihael Hatigan, Mike Wilde,Yong Zhao

Page 25: Open Analytics Environment

25

SwiftScript for fMRI Data Analysis

(Run snr) functional ( Run r, NormAnat a, Air shrink ) {

Run yroRun = reorientRun( r , "y" );Run roRun = reorientRun( yroRun , "x" );Volume std = roRun[0];Run rndr = random_select( roRun, 0.1 );AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" );Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" );Volume meanRand = softmean( reslicedRndr, "y", "null" );Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" );Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir );…

}

(Run or) reorientRun (Run ir, string direction) { foreach Volume iv, i in ir.v { or.v[i] = reorient(iv, direction); } }

Page 26: Open Analytics Environment

26

Provenance Data Model

dvIDhoststart

durationexitcode

stats

Invocation

nmspacename

version

Call

passes passes

executescalls

binds references

describesuses

includes

nmspacename

version

Procedure

argnametype

direction

FormalArg

argnamevalue

ActualArg

wfidfromDV

toDV

Workflow

nmspacename

Dataset

objectpred

type/valuserdate

Annotation

1

1

1

1

1

1

*

*

*

*

*

1

11

1

1

1

1 describes

Page 27: Open Analytics Environment

27

Virtual Node(s)

SwiftScript

Abstractcomputation

Virtual DataCatalog

SwiftScriptCompiler

Specification Execution

Worker Nodes

Provenancedata

ProvenancedataProvenance

collector

launcher

launcher

file1

file2

file3

AppF1

AppF2

Scheduling

Execution Engine(Karajan w/

Swift Runtime)

Swift runtimecallouts

C

C CC

Status reporting

Multi-level Scheduling

Provisioning

FalkonResource

Provisioner

AmazonEC2

Page 28: Open Analytics Environment

28

DOCK on SiCortex CPU cores: 5760 Power: 15,000 W Tasks: 92160 Elapsed time: 12821 sec Compute time: 1.94 CPU years

(does not include ~800 sec to stage input data)

Ioan Raicu,Zhao

Zhang

Page 29: Open Analytics Environment

29

Birmingham•

LIGO Gravitational WaveObservatory

>1 Terabyte/day to 8 sites770 TB replicated to date: >120 million replicasMTBF = 1 month

Cardiff

AEI/Golm

Ann Chervenak et al., ISI; Scott Koranda et al, LIGO

Page 30: Open Analytics Environment

30

Lag Plot for Data Transfers to Caltech

Credit: Kevin Flasch, LIGO

Page 31: Open Analytics Environment

31

SIDGrid: B. Bertenthal et al., U.Chicago, IU, UIC

Page 32: Open Analytics Environment

32

Social Informatics Data Grid (SIDgrid)

TeraGrid PADS …

SIDgrid

Collaborative, multi-modal analysis of cognitive science data

Diverseexperimenta

ldata &

metadata Browse dataSearchContent previewTranscodeDownloadAnalyze

Page 33: Open Analytics Environment

33

ELAN

SIDGrid Portal

Page 34: Open Analytics Environment

34

Page 35: Open Analytics Environment

35

A Community Integrated Model for Economic and Resource Trajectories for

Humankind (CIM-EARTH)

Dynamics,foresight,

uncertainty,resolution, …

Agriculture,transport,

taxation, …

Data (global,local, …)

(Super)computers

CIM-EARTHFramework

Communityprocess

Opencode, data

Page 36: Open Analytics Environment

36

Alleviating Poverty

in Thailand:Modeling

Entrepreneurship

Consider only wealth,

access to capital

Consider alsodistance to

6 major cities

Rob Townsend, Victor Zhorin, et al.

Match

High

Low

Page 37: Open Analytics Environment

37

Text Mining

Page 38: Open Analytics Environment

38

GeneWays

Online Journals

Pathways

GeneWays

Andrey Rzhetsky et al.

Screening 250,000 journal articles

2.5M reasoning chains

4M statements

Page 39: Open Analytics Environment

39

Identify Genes

Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4

Predictive Disease Susceptibility

Physiology

Metabolism Endocrine

Proteome

Immune Transcriptome

BiomarkerSignatures

Morphometrics

Pharmacokinetics

EthnicityEnvironment

AgeGender

Evidence Integration:Genetics & Disease Susceptibility

Source: Terry Magnuson

Page 40: Open Analytics Environment

40James Evans, U.Chicago

Arabidopsis articles

Page 41: Open Analytics Environment

41

An Open Analytics Environment

Resultsout

Datain

Programs& rules in

“No limits” Storage Computing Format Program

Allowing for Versioning Provenance Collaboration Annotation

Page 42: Open Analytics Environment

42