29
Globus Scientific Data Publication Services Ben Blaiszik , Kyle Chard, Rachana Anathakrishnan, Steve Tuecke, Ian Foster, Globus Team [email protected] www.globus.org Computation Institute

20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Embed Size (px)

Citation preview

Page 1: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Globus Scientific Data Publication

Services Ben Blaiszik, Kyle Chard, Rachana Anathakrishnan, Steve Tuecke, Ian

Foster, Globus Team

[email protected] www.globus.org

ComputationInstitute

Page 2: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Overview •  What is Globus? •  Globus Services

– Data publication – Data cataloging – Data transfer – User authentication – Groups – Sharing

2

Page 3: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

> 8000 endpoints > 85 U.S. campuses European Globus Community: http://www.egcf.eu/

3

Globus is ...

Research data management delivered via SaaS

Page 4: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Big data transfer, sharing,publication, and discovery…

…directly from your own storage systems OR the cloud

4

Globus Delivers

Page 5: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

SaaS Market Domination...

…for your photos

…for your e-mail

…for your entertainment

…for your research data

5

Page 6: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Research data management scenarios and challenges

6

Page 7: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Public Cloud

“I need to easily, quickly, & reliably move or mirror portions of my data to other places.”

Research Computing HPC Cluster Lab Server Personal Laptop

XSEDE Resource

7

Scientific Instrumentation

Page 8: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

“I need to easily and securely share my data with my colleagues at other institutions.”

8

Page 9: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

“I need to publish my data so that others can find it and use it.”

Scholarly Publication

Reference Dataset

Active Research Collaboration

9

Page 10: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Globus Transfer •  “Fire-and-forget” transfers

–  Optimize transfer –  Automatic fault recovery –  Automatic retry –  Seamless security integration –  128-bit checksums

•  Intuitive Web GUI and powerful APIs for automation –  REST and Python APIs

10

B

Globus moves the data for you

secureendpoint,

e.g. laptop

You submit a transfer request Globus

notifies you once the transfer is complete

secureendpoint,e.g. midway

transfer

A

Page 11: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Catalog Data Publication

Endpoint File Systems

Discover   Plugin Point [Federation?]

Globus, the Abridged Version

Transfers

Groups Sharing

User Auth

Metadata  Layer  

Data  layer  

11 * REST and Python APIs throughout

Page 12: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Globus Catalog •  Automate metadata ingestion from

instrumentation and acquisition machines

–  API/CLI integration

•  Allow near real-time metadata-driven feedback to experiments

•  Allow for insert points in the workflow –  Ingest at point of collection –  Catalog metadata and provenance –  Push to data store –  Push to local or external HPC

•  Allow building and sharing of typed metadata definitions –  e.g. build definition set that specifically

fits X-ray scattering data at your beamline

–  Addresses problem of T, temp, Temp, temperature, temperature_kelvin, ...

12

Page 13: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

13

•  Group data based on use and features, not location/filename –  Logical grouping to organize, search, and

describe

•  Operate on datasets as units •  Tag datasets with characteristics

that reflect content •  Share/move datasets for

collaboration •  Interact with via REST API,

Python API, GUI, and CLI

Vs.

Globus Catalog Catalog à Datasets à Members

Page 14: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Globus Catalog Web User Interface

Page 15: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Near field-HEDM Workflow (Sharma, Almer)

15

** Supported by Data Engines for Big Data LDRD

(Wilde, Wozniak, Sharma, Almer, Blaiszik)

3: GenerateParameters

(FOP.c)50 tasks25s/task

¼ CPU hoursconcurrent

DetectorUp to 1000 datasets/week

Dataset360 files

4 GB total

1: Median calc75s (90% I/O)

MedianImage.cUses Swift

2: Peak Search15s per file

ImageProcessing.cUses Swift

ReducedDataset360 files

5 MB total

4: Analysis PassFitOrientation.c

105 tasks20s/task

555 CPU hoursthen

1m/task 1667 CPU hours

concurrent

real-time overnight or real-time

feedback to experiment

Up to2.2 M CPU hours

per week!

real time: 4/4/2014

On

Orth

ros

Experimenting “in the data dark” •  Feedback during each experiment was non-existent •  Required months to calculate relevant information for

publication OR to find out experiment was corrupted •  Now, initial feedback over lunch using (Globus, SWIFT,

and Catalog) to leverage HPC and track metadata

Page 16: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Globus Data Publication •  Operated as a hosted

service

•  Designed for Big Data

•  Bring your own (per collection) storage

•  Extensible metadata schemas and input forms

•  Customizable publication and curation workflows

•  Associate unique and persistent digital identifiers with datasets

•  Rich discovery model (in dev)

16

Curator reviews and approves; data set published on campus or other

Researcher assembles data set; describes it using metadata (Dublin core and domain-specific) Peers and public

search and discover data sets; access and transfer using

publisheddatastore

curator

researcher

Page 17: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Data Publication Dashboard

17

Page 18: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Start a New Submission

18

Policies at the Collection Level •  Required metadata, schemas •  Data storage location •  Metadata curation policies

Page 19: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Describe Submission: Dublin Core

19

•  Scientist or representative describes the data they are submitting

•  For this collection

Dublin Core and a collection metadata template are required

Page 20: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Describe submission: 2) Scientific metadata

20

•  Scientist or representative describes the data they are submitting

•  For this collection

Dublin Core and a collection metadata template are required

Page 21: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Assemble the dataset

21

Page 22: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Transfer Files to Submission Endpoint

22

•  Scientist transfers dataset files to a unique publish endpoint

•  Endpoint is created

on collection-specified data store

•  Dataset may be

assembled over any period of time

•  When submission is finished, dataset will be rendered immutable via checksum

Page 23: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Check Dataset Assembly

23

•  Verify size, file names, etc

•  System attempts to

determine file types

•  Scientist can choose to edit, remove, or add more files

•  Scientist then accepts the collection-specified license and completes the submission (not pictured)

Page 24: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

DOI Assignment

Page 25: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Submission Curation

25

If  configured,  a  curator  can  approve  the  submission,  reject,    or  edit  metadata  

Page 26: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Discover a Published Dataset

26

•  Search on ranged meta-data

•  Link back to the

published dataset

Page 27: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

View Downloaded Dataset

27

Use Globus Connect Personal to pull the files locally for analysis

Page 28: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

...all of this via SaaS and with your own (institutional or personal) resources or cloud resources

Summary

Transfer

User Authentication Groups Sharing

Data Publication

Data Cataloging

Automation and Workflows

Page 29: 20150126-Globus-USUK-Data-Workshop-Blaiszik-final

Thank you to our sponsors! U.S. DEPARTMENT OF

ENERGY

Data  Engines  for  Big  Data  LDRD