53
nci.org.au © National Computational Infrastructure 2015 Ben Evans, OzEWEX 2015 nci.org.au @NCInews Working with High Performance Datasets & Collections Ben Evans - L. Wyborn - J. Wang, C. Trenham, K. Druken, R Yang, P. Larraondo, …, and many in NCI team Our partners ANU, BoM, CSIRO and GA and many community stakeholders

Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

nci.org.au

@NCInews

Working with High Performance Datasets & Collections

Ben Evans

- L. Wyborn

- J. Wang, C. Trenham, K. Druken,

R Yang, P. Larraondo, …, and many in NCI team

• Our partners ANU, BoM, CSIRO and GA and

many community stakeholders

Page 2: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

What is NCI:

• Australia’s most highly integrated e-infrastructure environment

• Petascale supercomputer + highest performance research cloud

+ highest performance storage in the southern hemisphere

• Comprehensive and integrated expert service

• Experienced Advanced Computing Scientific & Technical Teams

NCI is important to Australia because it:

• NCI is National and Strategic capability

• Enables research that otherwise would be impossible

• Enables delivery of world-class science

• Enables interrogation of big data, otherwise impossible

• Enables high-impact research that matters; informs public policy

• Attracts and retains world-class researchers for Australia

• Catalyses development of young researchers’ skills

Research Outcomes

Communities and Institutions/

Access and Services

Expertise Support and

Development

HPC Services Virtual Laboratories/

Data-intensive Services

Integration

Compute (HPC/Cloud) Storage/Network

Infrastructure

Res

ea

rch

Ob

jec

tive

s

NCI: World-class, high-end computing services for research & innovation

Particular focus on Earth System, Environment and Water Mgmt

Page 3: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

2002 2004 2006 2008 2010 2012 2014

20

40

60

80

100 Compute Data Tools Networks

Mill

ion

s o

f D

olla

rs

Australian Research Infrastructure Funding 2006-2015

• Two main tranches of funding: • National Collaborative Research Infrastructure Strategy (NCRIS)

– $542M for 2006-2011 ($75 M for cyberinfrastructure)

• Super Science Initiative

– $901 million for 2009-2013 ($347M for cyberinfrastructure)

• Annual operational funding of around $180M pa since 2014-2015

– All infrastructure programmes were designed ensure that Australian research continues to be competitive and rank highly on an international scale.

Page 4: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

RDSI Phase 1 (2011): Infrastructure

• The NCI Proposal in September 2011 was for a High Performance Data Node

• The goal was to: • Enable dramatic increases in the scale and reach of

Australian research by providing nationwide access to enabling data collections;

• Specialise in nationally significant research collections requiring high-performance computational and data-intensive capabilities for their use in effective research methods; and

• Realise synergies with related national research infrastructure programs.

RDSI Phase 1 (2011): Infrastructure

Page 5: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Example of Letter of Support for NCI HPD Node

• “will work with the partners to develop a shared data environment” … where … “there will be agreed standards to enable interoperability between the partners”

• “ it now make sense to explore these new opportunities within the NCI partnership, rather than as a separate agenda that GA runs independently”

…Chris Pigram, CEO, Geoscience Australia 26 July 2011

Page 6: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Organisational Steward of the Data Collection

Mutually agreed plan on how the collection will be managed and published

The DMP enables federated governance of the collection

NCI Data Collections Manager

Page 7: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

The Research Data Storage Infrastructure

Source: https://www.rds.edu.au/

Progress on Data Ingest as of 16 October, 2015: ~43 Petabytes in 8 distributed nodes

© National Computational

Infrastructure 2015

10 PBytes

Page 8: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

• Researchers are able to share, use and reuse significant collections of data that were previously either unavailable to them or difficult to access

• Researchers will be able to access the data in a consistent manner which will support a general interface as well as discipline specific access

• Researchers will be able to use the consistent interface established/funded by this project for access to data collections at participating institutions and other locations as well as data held at the Nodes

Source: https://www.rds.edu.au/project-overview

RDS (Phase 2) targeted outcomes from this infrastructure

Page 9: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

10PB+ Research Data

Server-side analysis

and visualization

Data Services

THREDDS

VDI: Cloud scale user

desktops on data

Web-time analytics software

Integrated World-class Scientific Computing Environment

Page 10: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Data Collections Approx. Capacity

CMIP5, CORDEX ~3 Pbytes

ACCESS products 2.4 Pbytes

LANDSAT, MODIS, VIIRS, AVHRR, INSAR, MERIS 1.5 Pbytes

Digital Elevation, Bathymetry, Onshore Geophysics 700 Tbytes

Seasonal Climate 700 Tbytes

Bureau of Meteorology Observations 350 Tbytes

Bureau of Meteorology Ocean-Marine 350 Tbytes

Terrestrial Ecosystem 290 Tbytes

Reanalysis products 100 Tbytes

1. Climate/ESS Model Assets and Data Products 2. Earth and Marine Observations and Data Products 3. Geoscience Collections 4. Terrestrial Ecosystems Collections 5. Water Management Collections http://geonetwork.nci.org.au.au

National Environment Research Data Collections (NERDC)

Page 11: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

10+ PB of Data for Interdisciplinary Science

GA CSIRO ANU Inter-

national Other

National

CMIP5

3PB

Astronomy

(Optical)

200 TB

Water

Ocean

1.5 PB

Atmosphere

2.4 PB

Earth

Observ.

2 PB

Marine

Videos

10 TB

Geophysics

300 TB

Weather

340 TB

Bathy,

DEM

100

TB

Page 12: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

10+ PB of Data for Interdisciplinary Science

GA CSIRO ANU Inter-

national Other

National

CMIP5

3PB

Astronomy

(Optical)

200 TB

Water

Ocean

1.5 PB

Atmosphere

2.4 PB

Earth

Observ.

2 PB

Marine

Videos

10 TB

Geophysics

300 TB

Weather

340 TB

Bathy,

DEM

100

TB

Page 13: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Managing 10+ PB of Data for Scalable In-situ Access

• Combined and integrated, the NCI collections are too large to move • bandwidth limits the capacity to move them easily

• the data transfers are too slow, complicated and too expensive

• even if our data can be moved, few can afford to store 10 PB on spinning disk

• We need to change our focus to: • moving users to the data (for sophisticated analysis)

• moving processing to data

• having online applications to process the data in-situ

• Improving the sophistication of users – with our help

• We called for a new form of system design where: • storage and various types of computation are co-located • systems are programmed and operated to allow users to interactively invoke different

forms of analysis in-situ over integrated large-scale data collections

Page 14: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Connection HPC Infrastructure for Data-intensive Science

• highlighted the need for balanced systems to enable Data-intensive Science including:

• Interconnecting processes and high throughput to reduce inefficiencies

• The need to really care about placement of data resources

• Better communications between the nodes

• I/O capability to match the computational power

• Close coupling of cluster, cloud and storage

NCI’s Integrated High Performance Environment

Page 15: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Earth System Grid Federation:

Exemplar of an International Collaboratory for large scientific data and analysis

Ben Evans, Geoscience Australia, August 2015

42/58

Page 16: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

‘Big Data’ vs ‘High Performance Data’

http://www.sas.com/content/dam/SAS/en_us/doc/whitepaper1/big-data-meets-big-data-analytics-105777.pdf

• Big Data is a relative term where the volume, velocity and variety of data exceed an organisations storage or compute capacity for accurate and timely decision making

• We define High Performance Data (HPD) as data that is carefully prepared, standardised and structured so that it can be used in Data-Intensive Science on HPC (Evans et al., 2015)

• Need to convert ‘Big data’ collections into HPD by

• Aggregating data into seamless high-quality data products

• Creating intelligent access to self describing data arrays

1964: 1KB = 2m of tape or ~ 20 cards

2014: a 4 GB Thumb drive = ~8000 Km of Tape

or ~83 million cards

2014: 20 PB of modern storage = ~ 32 trillion metres of tape

~ 320 trillion cards

Page 17: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

NCI, BoM and Fujitsu Collaborative Project 2014-6 Project A: ACCESS Optimisation

• Evaluate and Improve Computational methods and performance for ACCESS-Opt

NWP • UM, APS3 (Global, Regional, City)

Seasonal Climate • ACCESS-GC2 (GloSea)

Data Assimilation • 4D-VAR (Atmosphere), EnKF (Ocean), DART (NCAR)

Ocean Forecasting and Research • MOM5, CICE/SIS, WW3, ROMS

Fully Coupled Earth System Model • ACCESS-CM2, ACCESS-ESM, CMIP5/6

49/58

Next Gen and Performance Analysis of Earth Systems codes

Page 18: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Project B: Scalability of algorithms, hardware And other earth systems and geophysics codes • Tsunami – NOAA MOST and comMIT

• Data Assimilation – NCAR DART

• Ocean - MOM6, MITGCM, MOM5(WOMBAT), SHOC

• Water Quality and BioGeochemical models – particularly for eReefs

• Hydrodynamic and ecological models - Relocatable Coastal modelling project (RECOM)

• Weather and Convection Research – Non-access (e.g., WRF)

• Groundwater

• Hydrology - ?

• Natural Hazards - Tropical Cyclone (TCRM), Volcanic ash code, ANUGA, EQRM

• Shelf Reanalysis

• Onshore/Offshore seismic data processing

• Earthquake and Seismic waves

• 3D Geophysics: Gravimetric, Magnetotelluric, AEM, Inversion (Forward and Back)

• Earth Observation Satellite data processing

• Hydrodynamics, Oil and Petroleum

• Elevation, Bathymetry, Geodesy – data conversions, grids and processing

Page 19: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Data Platforms of today need to scale down to small users

Page 20: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

High-Res, Multi-Decadal, Continental-Scale Analysis

• 27 Years of data from LS5 & LS7(1987-2014)

• 25m Nominal Pixel Resolution

• Approx. 300,000 individual source scenes in approx. 20,000 passes

• Entire archive of 1,312,087 ARG25 tiles => 93x1012 pixels

• can be processed in ~3 hours

Water Detection from Space

c/- Geoscience Australia

Page 21: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Scaling down to the smaller users – e.g. AGDC

Do we enable individual scenes to be downloaded for locally hosted small scale analysis? Or do we facilitate small scale analysis, in-situ on data sets that are dynamically updated?

Page 22: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

HDF5 MPI-enabled HDF5 Serial

netCDF-4 Climate/Weather/Ocean

Libgdal EO

Data Library

Layer 1

HP Data Library Layer 2

SEG-Y Airborne

Geophysics Line data

FITS BAG LAS

LiDAR

Metadata Layer netCDF-CF HDF-EOS ISO 19115, RIF-CS, DCAT, etc.

VGL AGDC

VL

Services Layer (expose data models & semantics)

Fast “whole-of-library” catalogue

Lustre Other Storage (options)

National Environmental Research Data Interoperability Platform (NERDIP)

Climate & Weather Systems Lab

Biodiversity & Climate Change VL

OG

C

WFS

OG

C

W*S

OG

C

WP

S

OG

C

WC

S

OG

C

WM

S

Op

en

D

AP

RD

F, LD

VHIRL Globe Claritas

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways

netCDF-4 EO

AuScope Portal

TERN Portal

AODN/IMOS Portal

eMAST Speddexes

All Sky Virtual Observatory

ANDS/RDA Portal

eReefs

Models Fortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

Visualisation Drishti

Ferret, NCO, GDL, GDAL, GRASS, QGIS

Digital Bathymetry & Elevation Portal

Data. gov.au

Open Nav Surface

Tools Data Portals

Dire

ct A

ccess

OG

C

SOS

Introducing the National Environmental Data Interoperability Research Platform (NERDIP)

Page 23: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

HDF5 MPI-enabled HDF5 Serial

netCDF-4 Climate/Weather/Ocean

Libgdal EO

Data Library

Layer 1

HP Data Library Layer 2

SEG-Y Airborne

Geophysics Line data

FITS BAG LAS

LiDAR

Metadata Layer

netCDF-CF HDF-EOS ISO 19115, RIF-CS, DCAT, etc.

VGL AGDC

VL

Services Layer (expose data models & semantics)

Fast “whole-of-library” catalogue

Lustre Other Storage (options)

National Environmental Research Data Interoperability Platform (NERDIP)

Climate & Weather Systems Lab

Biodiversity & Climate Change VL

OG

C

WFS

OG

C

W*S

OG

C W

PS

OG

C W

CS

OG

C

WM

S

Op

en

D

AP

RD

F, LD

VHIRL Globe Claritas

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways

netCDF-4 EO

AuScope Portal

TERN Portal

AODN/IMOS Portal

eMAST Speddexes

All Sky Virtual Observatory

ANDS/RDA Portal

eReefs

Models Fortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

Visualisation Drishti

Ferret, NCO, GDL, GDAL, GRASS, QGIS

Digital Bathymetry & Elevation Portal

Data. gov.au

Open Nav Surface

Tools Data Portals

Dire

ct A

ccess

OG

C

SOS

Infrastructure to Lower Barriers to Entry

Ace Users

Data Platform

Data Discovery

NERDIP: Enabling Multiple Ways to Interact with the Data

Page 24: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

NERDIP: Enabling Multiple Ways to Interact with the Data

HDF5 MPI-enabled HDF5 Serial

netCDF-4 Climate/Weather/Ocean

Libgdal EO

Data Library

Layer 1

HP Data Library Layer 2

SEG-Y Airborne

Geophysics Line data

FITS BAG LAS

LiDAR

Metadata Layer

netCDF-CF HDF-EOS ISO 19115, RIF-CS, DCAT, etc.

VGL AGDC

VL

Services Layer (expose data models & semantics)

Fast “whole-of-library” catalogue

Lustre Other Storage (options)

National Environmental Research Data Interoperability Platform (NERDIP)

Climate & Weather Systems Lab

Biodiversity & Climate Change VL

OG

C

WFS

OG

C

W*S

OG

C W

PS

OG

C W

CS

OG

C

WM

S

Op

en

D

AP

RD

F, LD

VHIRL Globe Claritas

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways

netCDF-4 EO

AuScope Portal

TERN Portal

AODN/IMOS Portal

eMAST Speddexes

All Sky Virtual Observatory

ANDS/RDA Portal

eReefs

Models Fortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

Visualisation Drishti

Ferret, NCO, GDL, GDAL, GRASS, QGIS

Digital Bathymetry & Elevation Portal

Data. gov.au

Open Nav Surface

Tools Data Portals

Dire

ct A

ccess

OG

C

SOS

Infrastructure to Lower Barriers to Entry

Ace Users Data Portals

Data Platform

Page 25: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Platforms Free Data from the “Prison of the Portals”

• Portals are for visiting, platforms are for building on

• Portals present aggregated content in a way that invites exploration, but the experience is pre-determined by a set of decisions by the builder about what is necessary, relevant and useful.

• Platforms put design decisions into the hands of users: there are innumerable ways of interacting with the data

• Platforms offer many more opportunities for innovation: new interfaces can be built, new visualisations framed, ultimately new science rapidly emerges

Tim Sherratt http://www.nla.gov.au/our-publications/staff-papers/from-portal-to-platform

Page 26: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

HDF5 MPI-enabled HDF5 Serial

netCDF-4 Climate/Weather/Ocean

Libgdal EO

Data Library Layer 1

HP Data Library

Layer 2

SEG-Y Airborne

Geophysics Line data

FITS BAG LAS

LiDAR

Metadata

Layer netCDF-CF HDF-EOS ISO 19115, RIF-CS, DCAT, etc.

VGL AGDC

VL

Services Layer (expose data models & semantics)

Fast “whole-of-library” catalogue

Lustre Other Storage (options)

National Environmental Research Data Interoperability Platform (NERDIP)

Climate & Weather Systems Lab

Biodiversity & Climate Change VL

OG

C

WFS

OG

C

W*S

OG

C W

PS

OG

C W

CS

OG

C

WM

S

Op

en

D

AP

RD

F, LD

VHIRL Globe Claritas

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways

netCDF-4 EO

AuScope Portal

TERN Portal

AODN/IMOS Portal

eMAST Speddexes

All Sky Virtual Observatory

ANDS/RDA Portal

eReefs

Models Fortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

Visualisation Drishti

Ferret, NCO, GDL, GDAL, GRASS, QGIS

Digital Bathymetry & Elevation Portal

Data. gov.au

Open Nav Surface

Tools Data Portals

Dire

ct A

ccess

OG

C

SOS

Infrastructure to Lower Barriers to Entry

Data Platform

Data Discovery

NERDIP: Enabling Ace Users to Interact with the Data

Page 27: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

NERDIP: Applications Replicating Ways of Interacting with the Data

HDF5 MPI-enabled HDF5 Serial

netCDF-4 Climate/Weather/Ocean

Libgdal EO

Data Library

Layer 1

HP Data Library Layer 2

SEG-Y Airborne

Geophysics Line data

FITS BAG LAS

LiDAR

Metadata Layer

netCDF-CF HDF-EOS ISO 19115, RIF-CS, DCAT, etc.

VGL AGDC

VL

Services Layer (expose data models & semantics)

Fast “whole-of-library” catalogue

Lustre Other Storage (options)

National Environmental Research Data Interoperability Platform (NERDIP)

Climate & Weather Systems Lab

Biodiversity & Climate Change VL

OG

C

WFS

OG

C

W*S

OG

C W

PS

OG

C W

CS

OG

C

WM

S

Op

en

D

AP

RD

F, LD

VHIRL Globe Claritas

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways

netCDF-4 EO

AuScope Portal

TERN Portal

AODN/IMOS Portal

eMAST Speddexes

All Sky Virtual Observatory

ANDS/RDA Portal

eReefs

Models Fortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

Visualisation Drishti

Ferret, NCO, GDL, GDAL, GRASS, QGIS

Digital Bathymetry & Elevation Portal

Data. gov.au

Open Nav Surface

Tools Data Portals

Dire

ct A

ccess

OG

C

SOS

Page 28: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

NERDIP: Loosely coupling Applications and Data via a Services Layer

HDF5 MPI-enabled HDF5 Serial

netCDF-4 Climate/Weather/Ocean

Libgdal EO

Data Library

Layer 1

HP Data Library Layer 2

SEG-Y Airborne

Geophysics Line data

FITS BAG LAS

LiDAR

Metadata Layer

netCDF-CF HDF-EOS ISO 19115, RIF-CS, DCAT, etc.

VGL AGDC

VL

Services Layer (expose data models & semantics)

Fast “whole-of-library” catalogue

Lustre Other Storage (options)

National Environmental Research Data Interoperability Platform (NERDIP)

Climate & Weather Systems Lab

Biodiversity & Climate Change VL

OG

C

WFS

OG

C

W*S

OG

C W

PS

OG

C W

CS

OG

C

WM

S

Op

en

D

AP

RD

F, LD

VHIRL Globe Claritas

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways

netCDF-4 EO

AuScope Portal

TERN Portal

AODN/IMOS Portal

eMAST Speddexes

All Sky Virtual Observatory

ANDS/RDA Portal

eReefs

Models Fortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

Visualisation Drishti

Ferret, NCO, GDL, GDAL, GRASS, QGIS

Digital Bathymetry & Elevation Portal

Data. gov.au

Open Nav Surface

Tools Data Portals

Dire

ct A

ccess

OG

C

SOS

Infrastructure to Lower Barriers to Entry

Ace Users

Data Platform

Data Discovery

APPLICATION

FOCUSSED DEVELOPERS

DATA MANAGEMENT

FOCUSSED DEVELOPERS

Page 29: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

NERDIP: Infrastructure to Lower Barriers to Entry

HDF5 MPI-enabled HDF5 Serial

NetCDF-4

Climate/Weather/Ocean

Libgdal

EO

Data Library Layer 1

HP Data Library Layer 2

[SEG-Y] Airborne

Geophysics Line data

[FITS] BAG LAS

LiDAR

Metadata Layer

netCDF-CF HDF-EOS ISO 19115, RIF-CS, DCAT, etc.

VGL AGDC

VL

Services Layer (expose data model & semantics)

Fast “whole-of-library” catalogue

Lustre Other Storage (options)

National Environmental Research Data Interoperability Platform (NERDIP)

Climate & Weather Systems Lab

Biodiversity & Climate Change VL

OG

C W

FS

OG

C SO

S

OG

C W

PS

OG

C W

CS

OG

C W

MS

Op

en

D

AP

RD

F, LD

VHIRL Globe Claritas

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways

NetCDF-4

EO

AuScope Portal

TERN Portal

AODN/IMOS Portal

eMAST Speddexes

All Sky Virtual Observatory

ANDS/RDA Portal

eReefs

Models Fortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

Visualisation Drishti

Ferret, NCO, GDL, GDAL, GRASS, QGIS

Digital Bathymetry & Elevation Portal

Data. gov.au

Open Nav Surface

Tools Data Portals

Dire

ct A

ccess

Data Discovery Ace Users

Data Platform

Page 31: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

data size/format: potential 9.3 PBytes in NetCDF

Format NetCDF Non NetCDF

Size 7717 TB (6799TB available) 2579 TB (2221TB available) 1600 TB will be converted (bathymetry, Landsat, geophysics)

Domain Climate Models Weather Models and Obs Water Obs Satellite Imagery Other Imagery

Satellite Imagery Astronomy Medical Biosciences Social sciences Geodesy Video Geological Model Hazards Phenology

Page 32: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Straightening out the data format standards

Page 33: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

NCI Compliance Checking: Data & Metadata

• Metadata: Use the Attribute Convention for Dataset Discovery (ACDD) (previously Unidata Dataset Discovery Conventions)

• Data: Based on the CF-Checker developed at the Hadley Centre for Climate Prediction and Research (UK) by Rosalyn Hatcher

Page 34: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Auditing the system for data standards compliance

Page 35: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Grid Diversity in CMIP5

Downstream communities may not wish to deal with different

grids, but the modelling communities generate data appropriate to them. Mercator grid in south

Tripolar grid in north

Page 36: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

CMIP6 WGCM Infrastructure Panel recommendations

• Use netCDF4 with lossless compression as the data format for CMIP6.

• Lossless compression from zlib (settings deflate=2 and shuffle) expected to generate roughly 2X decrease in data volumes (varies depending on data entropy or noisiness).

Requires upgrading entire toolchain (data production and consumption) to netCDF4.

• Recommends the use of standard grids for datasets where native-grid data is not strictly required. • For example: the Clivar Ocean Model

Development Panel (OMDP) may request the use of World Ocean Atlas (WOA) standard grids (1◦×1◦, 0.25◦×0.25◦) as the target grid of choice.

• No progress on adoption of standard calendars.

Page 37: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

10PB+ Research Data

Server-side analysis

and visualization

Data Services

THREDDS

VDI: Cloud scale user

desktops on data

Web-time analytics software

Integrated World-class Scientific Computing Environment

Page 38: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

IO Performance Tuning Metrics

NFS-CEPH NFS-Lustre Local SSD Lustre Lustre

VDI:/short VDI:/g/data1 VDI:/local Raijin:/short Raijin:/g/data2

2 cores 32GB

2 cores 32GB

2 cores 32GB

16 cores/node 32GB/node

16 cores/node 32GB/node

Lustre MPI-IO HDF5 General

Stripe count Data sieving Chunk pattern File type

Stripe size Collective buffer Chunk cache size File size

Alignment Transaction size

Metadata cache Access pattern

Compression Concurrency

File systems

Local (Direct) Methods for Accessing the data

Page 39: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Serial Write Throughput

0

100

200

300

400

500

600

700

800

900

GTIFF HDF5 NC_CLASSIC NC4 NC4_CLASSIC

Thro

ugh

pu

ts

Interfaces

write

GDAL_FILL

PURE_FILL

GDAL_NOFILL

PURE_NOFILL

Page 40: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Serial Read Throughput

0

100

200

300

400

500

600

700

800

900

1000

GTIFF HDF5 NC_CLASSIC NC4 NC4_CLASSIC

Thro

ugh

pu

ts (

MB

/s)

Interfaces

read

GDAL_FILL

PURE_FILL

GDAL_NOFILL

PURE_NOFILL

Page 41: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

MPI size = 16 Stripe size =1M Block size = 8G Transfer size = 32M 0

200

400

600

800

1000

1200

1400

1600

1800

1 8 16 32 64 128

MB

/s

Stripe count

Independent write

HDF5 MPIIO POSIX

Low performance when using default parameters

Parallel Write Throughput

Page 42: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Low performance when using default parameters

MPI size = 16 Stripe size =1M Block size = 8G Transfer size = 32M

0

1000

2000

3000

4000

5000

6000

7000

1 8 16 32 64 128

MB

/s

Stripe count

Independent Read

HDF5 MPIIO POSIX

Parallel Read Throughput

Page 43: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

A Simple Comparison of File formats and compression

Page 44: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Contiguous Access: compressed vs non-compressed

Transfer

Count

5500

5500

10×

5500

20×

5500

40×

5500

55×

5500

100×

5500

550×

5500

1000×

5500

5500×

5500

Transfer

Size 22kB 110KB 220KB 440KB 880KB 1.21MB 2.2MB 12.1MB 22MB 121MB

Raijin

(src) 182.39 229.19 216.08 218.45 220.58 220.74 222 203.86 192.56 189.49

Vdi

(src) 199.8 248.23 235.42 238.15 239.28 241.25 228.93 219.08 217.09 220.75

raijin

(nrm) 479.77 790.84 804.79 848.09 888.82 887.62 889.31 800.65 710.48 544.84

vdi

(nrm) 1473.45 3965.39 4521.17 5182.22 5785.31 4972.67 5898.41 3818.48 3162.54 2066.02

• Read Throughput (MB/s) • Source file (19 MB)

• Normal file (121 MB)

Page 45: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Chunk Cache

Page 46: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

HDF5 Chunked Storage

• Data is stored in chunks of predefined size • Two-dimensional instance may be referred to as data tiling

• HDF5 library writes/reads the whole chunk

Contiguous Chunked

Page 47: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Chunk shape: 1100×1100 Subset shape: 1×5500,275×275,1100×1100

0

50

100

150

200

250

300

350

400

0 1 2 3 4 5 6 7 8 9

Thro

ugh

pu

t (M

B/s

)

Deflate Level

1×5500_4M

1×5500_32M

275×275_4M

275×275_32M

1100×1100_4M

1100×1100_32M

Subset Access – compression, subsetting and caches

Page 48: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

OpenDAP

DAP2 -> NetCDF on the wire

DataArray [...] Shape (...) DType GeoTransform Projection Metadata

NetCDF File

HTTP Conn OpenDAP Format

Page 49: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

DAP2 -> NetCDF on the wire

NetCDF File

HTTP Conn OpenDAP Format

DataArray [...] Shape (...) DType GeoTransform Projection Metadata

OpenDAP

Page 50: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Tile Map Servers

Serving Maps

WMTS Server

Client (Browser)

THREDDS Server

1 2

3 4

Page 51: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Serving Maps

WMTS Server

Client (Browser)

THREDDS Server

1 2

3 4

Tile Map Servers

Page 52: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Examples of on-the-fly data delivery

Page 53: Working with High Performance Datasets & Collectionsozewex.org/workshop2015/wp-content/uploads/2016/01/B...Working with High Performance Datasets & Collections Ben Evans - L. Wyborn

nci.org.au © National Computational

Infrastructure 2015 Ben Evans, OzEWEX 2015

Key Messages on Accessing High Performance Data

• Data at scales of today have to be built as shared global facilities based around national institutions.

• Domain-neutral international standards for data collections and interoperability are critical for allowing complex interactions in HP environments both within and between HPD collections

• No one can do it alone. No one organisation, no one group, no one country has the required resources or the expertise.

• Shared collaborative efforts such as Research Data Alliance, the Earth Systems Grid Federation (ESGF), the Belmont Forum, EarthServer, the Oceans Data Interoperability Platform (ODIP), EarthCube, GEO and OneGeology are needed to realise the full potential of the new data intensive science infrastructures

• It now takes a ‘village of partnerships’ to raise a ‘HPD data center’ in a Big Data World

http://www.onegeology.org/

https://www.sfwa.org/wp-content/uploads/2010/06/iStock_000012734413XSmall.jpg