Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
NCAR Plan for Science at Scale
...and some digressions
J-F de La Beaujardière, PhDDirector, NCAR/CISL Information Systems Division
[email protected]://orcid.org/0000-0002-1001-9210
International Computing for the Atmospheric Sciences (iCAS) Symposium
2019 Sept 12
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
2019-09-12 2
My Focus: Data
NumericalModels
ObservingSystems
BigData
Nowwhat?
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
The Big Data Problem
2019-09-12 3
Huge Model
Outputs
Satellite Imagery
In situ sensors
Manual sampling
Billions of files
Many formats• NetCDF3• NetCDF4• GRiB• CSV• XLS• TXT• GeoTIFF• etc
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
Don't Move Huge Datasets – Move the Computing
2019-09-12 4
Huge Data
Subset #1
Subset #2
User #1 Computer
User #2 Computer
Internetdata distribution
Huge Data
Shared Computing
User #2Code
User #1Code
Shared Code
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
2019-09-12 5
Current Typical Data Organization
Dataset #A1
• Data URL• README file• Folder hierarchy• Filename convention• Standard format
(maybe)
CF conventions, ISO metadata, OpenDAP, OGC WxS (maybe)
CF conventions, ISO metadata, OpenDAP, OGC WxS (maybe)
Data Provider A
Dataset #B1
• Data URL• README file• Folder hierarchy• Filename convention• Standard format
(maybe)
Data Provider B
Domain-Specific Catalog offering text-based discoveryProblems:
• Tedious plumbing code for individual datasets and providers
• Difficult to do science or make decisions based on multiple datasets
NCAR Plan for Science at Scale• Draft document (v0.5.0) attempting to address some of these
challenges• Developed at request of CISL Director & NSF Program Managers
• Proposes enhancements to NCAR infrastructure and activities in support of science by NCAR and external communities• Also continues/improves existing CISL/ISD data management activities
• Vision, Goals, and Objectives - specific and achievable• Thanks to numerous people for comments, including:
2019-09-12
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
6
Anke Kamrath/CISLEric Nienhouse/ISDSteven Worley/ISD
Douglas Schuster/ISDSeth McGinnis/ISD
Sophie Hou/ISDBrian Bonnlander/ISD
Irfan Elahi/HSSDave Hart/USSJohn Clyne/TDDKevin Paul/TDD
Elizabeth Chapin/CISLJ-F Lamarque/CGD
Matthew Long/CGDJoseph Hamman/CGDCindy Bruyère/MMMCaspar Amman/RAL
Tyler McCandless/RALTor Mohling/RALMike Daniels/EOL
Greg Stossmeier/EOLEthan Davis/Unidata
Matt Mayernik/NCAR DSETSubashree Mishra/NSF
Sarah Ruth/NSF
https://bit.ly/2SjnXFL
Science at Scale
The ability to perform scientific analysis on "Big Data"
without being constrained by
storage capacity,processing power,network bandwidth,
unfamiliar data formats, orinsufficient software tools.
2019-09-12
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
7
Goals of the Plan
Data Discovery and Access
Data Analytics
Data Management
Data Storage
Science and Collaboration
NCAR community science is supported and enhanced by CISL's hardware and software deployments.
NCAR is able to control data storage costs with appropriate performance for different usage scenarios.
Users both at NCAR and externally can compute directly on large-volume data, using either prebuilt tools or their own code.
Users are able to find NCAR-hosted data at a fine level of granularity; standardized formats and services are available.
NCAR scientists can readily comply with requirements for Open Data; CISL can streamline data archiving and curation.
2019-05-03
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
8
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
High-Level Concept
NCAR
Analysis-Ready Data
NCAR Compute Nodes
Analysis Tools
NCAR/NSF users
2019-09-12 9
Public Cloud
Analysis-Ready Data
Cloud Computing
Analysis Tools
selected data
External users (public, industry,
international)
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
10
Pangeo for Analysis-Ready Data
Jupyter for interactive access by remote systems
Cloud / HPC System
Xarray provides data structures and intuitive interface for interacting with datasets
Dask allows users to deploy clusters of compute nodes for
data processing.
Dis
tribu
ted
stor
age
“Analysis Ready Data”stored on globally-available
distributed storage.
Slide credit: Ryan Abernathy, LDEO/Columbia U.
2019-09-12
Zarr
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
Current Data Storage Architecture
2019-09-12 11
GLADE(Posix f/s)
38 PB
Cheyenne HPC
Casper DAV
Local Network
Campaign (Globus)
20 PB
HPSS tape(HSI)
100 PB
Off-prem tape
(Disasterrecovery)
User-initiatedcopy/move
User-initiatedcopy/move
Automated backup
AuthorizedUsers
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
2019-09-12 12
GLADE-2(Posix f/s)
few PB
NWSC-3 HPC
DAVCluster
Local Network
AuthorizedUsers
User copy or demand-driven burst
Object Store(S3 API)
5 - 100 PB3-geo scale
out
User-initiatedcopy
Automatic Move
CloudCompute
PublicUsers
Dedicated Connection Cloud deep
archive(DR only)
Automated backup
Cloud Store(S3 API)
PB as needed
Possible Future Data Storage Architecture(Note: Jeff DLB's opinion, not a decision by NCAR/CISL)
Data Commons (on-prem object storage)• NCAR/CISL acquiring 5 PB object store
• Western Digital X100• Expect to be available for testing in Oct 2019• Motivation: POSIX filesystem not suitable for
billions of files• Uses:
• Host archival data holdings• Host analytics-optimized versions of some datasets• Evaluate lossy compression approaches• Research performance, usability, and cost
relative to Campaign disk & HPSS tape
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
2019-09-12 13
Cloud Commons (off-prem object storage)
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
2019-09-12 14
• Host selected datasets in Cloud for public access• NOTE: Not turning off existing data repos & access services
• First dataset: CESM LENS• AWS providing free S3 hosting for 100 TB of CESM LENS• Includes selected monthly, daily, 6-hour fields; both surface and 3D
• Planning additional NCAR datasets• Xarray/Zarr format rather than individual NetCDF files• Python-based analysis tools (in progress)• Will enable:
• Broad public access, incl. industry• Research on optimization for big data analysis• Evaluation of commercial cloud pros/cons
A Word on Costs• When comparing Cloud vs on-prem costs, need to be honest
• Include hardware, power, cooling, real estate• Include cost of staffing for procurement, operation, security• Include opportunity costs:
• being stuck with same HW for 4 years• not benefiting from new managed services
• Goal: Don't build anything that someone else can build• We already outsource facility construction• We buy electricity from the grid instead of running our own power plant• We use existing WANs instead of laying our own fiber optic cables• We do not manufacture our own CPUs and disk drives• We leverage open-source & commercial software• We moved to Google Mail/Calendar/Docs• ⇒ Why build & operate our own data centers?
• Don't let outdated business models prevent us from trying to negotiate innovate contracts with cloud infrastructure vendors Je
ff de
La
Bea
ujar
dier
e <j
effd
lb@
ucar
.edu
>
2019-09-12 15
CESM LENS• Community Earth System Model (CESM) Large Ensemble
• Kay et al. 2015 (doi:10.1175/BAMS-D-13-00255.1)
• Simulates climate from 1920-2005 using 20th century historical data, then 2006-2100 assuming RCP8.5 scenario• RCP8.5 = Representative Concentration Pathway 8.5 W/m2 radiative forcing by
increased greenhouse gas concentration. Worst-case scenario in 5th IPCC Report.
• Complicated dataset• 2 spatial grids (land/atmosphere and ocean/ice)• 3 vertical axes (surface, 3D atmosphere, 3D ocean)• 3 temporal resolutions (monthly, daily, 6hr)• Multiple time axes (20C, RCP8.5, 3 diff. control runs)• 40 ensemble members (simulations with slightly diff. initial cond.)
• Somewhat difficult to access and use• 500TB divided between on-line disk and near-line tape• ~150,000 NetCDF files in ~1,000 directories• Download from NCAR Climate Data Gateway• Compute in place on NCAR supercomputer if authorized Je
ff de
La
Bea
ujar
dier
e <j
effd
lb@
ucar
.edu
>
2019-09-12 16
CESM LENS on AWS• DOI: https://doi.org/10.26024/wt24-5j82• 1 Zarr store for each component, frequency, experiment, and
variable• 175 Zarr stores instead of ~30k files• May aggregate further into multi-variable Xarray datasets
• Currently in pre-release – finalizing documentation and Jupyter Notebook for Oct 9 announcement
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
2019-09-12 17
Thanks to:Anderson Banihirwe
Chi-Fan ShihBrian BonnlanderJoseph HammanSeth McGinnis
Kevin PaulGary Strand
Matthew LongDouglas Schuster
Eric Nienhouse
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
2019-09-12 18
Kay et al. 2015, Figure 2
Figure from Jupyter Notebook
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
2019-09-12 19
Kay et al. 2015, Figure 4 Figure from Jupyter Notebook
Need higher level of abstraction• Jupyter Notebooks are useful but not sufficient• Can we enable "Geodata Fabric" of information about the Earth?• Leverage space & time coordinates as organizing framework
• Latitude, longitude, named places, watersheds, etc• Multidimensional virtual data collection• Specify what data you want, location of interest, other attributes
→ software automatically makes it visible/available/computable• Standardize to simplify both human analysis and machine
learning applications
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
2019-09-12 20
2018-04-20 23
jeffd
lb@gm
ail.com
Data to Decisions:� Distill huge & complex data to ~1 bit:
plant crop? evacuate?build wind farm? go skiing?
� Support non-expert data users:city planners, business analysts, citizens, ...
Some users want answers, not huge datasets
(... or 100s of tiny datasets)
Source: "The promise and peril of a digital ecosystem for the planet," Campbell & Jensen, UN Environment Pgm (2019).See also "The Case for a Digital Ecosystem for the Environment," UN Science Policy Business Forum (2019)
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
Analysis Software Stack Concept
Analysis-Ready Data
Xarray, Dask, Zarr
2019-09-12 25
CannedVisualizations OGC WMS GIS
Integration Notebooks
Workflows Containers ServerlessFunctions
Bare-metalHPC/EC2
EducationDecisions
PapersResearch
Desired Outcomes of Science at Scale Plan• New scientific discoveries
• Increased public use of NCAR data
• Improved capability to analyze big data
• Reduction in cost to maintain data systems
• Employee recognition for good data management
• Improved consistency in data-related proposals to NSF
• Easier compliance with Open Data requirements
2019-09-12
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
26
Challenges and Opportunities• Funding
• Work to date has leveraged existing relevant projects
• Very modest additional support FY2020
• Need dedicated software engineers to wire everything together
• Cultural practices
• Some people prefer familiar, less-efficient approaches
• Cadre of early adopters at NCAR and elsewhere
• Growing community and ecosystem of tools
• Pangeo, Python HoloViz, etc
• Many institutions facing similar Big Data problems
2019-09-12
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
27
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
2019-09-12 28
NCAR
Data Commons (object storage)
RDA CDG
DASH Repo
DAV Cluster(Casper)
Analysis-Ready Data
Archival Optimized
Intake Catalog
JupyterPyViz, GeoCAT Workflows
NCARCloud
Public Cloud
CloudCommons
Analysis-Ready Data
Jupyter
Workflows
Serverless
selected data
Intake Catalog
PyViz
EOL HAO
Containers
NCAR Plan for Science at Scale: https://bit.ly/2SjnXFL
Questions?
NCAR Plan for Science at Scale:https://bit.ly/2SjnXFL
(draft v0.5.0)
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
2019-09-12 29
POSIX Filesystem vs Object Storage
2019-09-12
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
30
POSIX ObjectHPC systems, desktop/laptop AWS, Google Docs, Facebook,
Netflix, etc
Hierarchical directory structure Object ID + user-defined label(can simulate hierarchy)
open(), read(), seek(), close() semantics(Campaign store: Globus interface)
HTTP GET, PUT, DELETE, HEAD (+optional POSIX emulation)
limited file metadata (owners, permissions, size, dates, etc)
arbitrary additional key/value metadata pairs
stateful (system keeps track of every file's open/close state)
stateless
strong consistency (ensure no other process can read until write finished)
read-after-write consistency
resize partitions to scale up arbitrary scale up/scale out
RAID data protection Erasure coding
Immediately replace failed disk Fail-in-place approach
Key Objectives for Data Discovery & Access Goal
Object Inventory & Query
Data Objects
•_____•_____•_____•_____
* Icon credit: SimpleIcon from Flaticonhttps://www.flaticon.com/authors/simpleicon
(CC BY 3.0)
*
Data Access Services
S3 API
2019-09-12
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
31
Dataset Search
Key Objectives for Data Management Goal
EOL
Archival Repositories
RDA CDG DASH HAO
Data Stewardship
ISO 19115Metadata <XML/>
DM Plan Support
2019-09-12
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
32
Observing Sources• Sensors• Field Campaigns• External Providers
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
2019-09-12 33
Data Science Workflow
DataIngest
DataCleaning
Metadata Creation
Observational Data
ModelOutputs
Earth SystemModels
Data Assimilation
ML-basedParameterization
Data Storage
Obs/Model Intercomparison
Data Services• Product Generation• ML Training & Analyses• Workflow Tools• Analysis & Visualization
Tools• Data Optimization• Discovery, Access, Subset• Short-term Working Copies• Long-term Archival Curation
Jeff
de L
a B
eauj
ardi
ere
<jef
fdlb
@uc
ar.e
du>
2019-09-12 34
Storage
Compute
Egress
• S3/S3 1Z iA/Glacier/Deep Archive
• $21/TB/mo - $1/TB/mo• 99.999999999% (11 9s)
durability
• Hardware• Power, cooling, real estate• Staff• Remote backup
hardware/facility
• Rich & evolving CPU/GPU choices
• $0 Free tier - • 99.999999999% (11 9s)
durability
• Hardware• Power, cooling, real estate• Staff• Remote backup
hardware/facility