OPeNDAP in the Cloud Optimizing the Use of Storage Systems Provided by Cloud Computing Environments OPeNDAP James Gallagher, Nathan Potter and NOAA/NODC

OPeNDAP in the CloudOptimizing the Use of Storage Systems Provided by Cloud Computing Environments

OPeNDAPJames Gallagher, Nathan Potter

and

NOAA/NODCDeirdre Byrne, Jefferson Ogata, John Relph

26 June 2013

Cloud Systems Now*

•Providers: IBM, Microsoft, Amazon, Google, Rackspace, …

•Microsoft: Azure “…handles 100 petabytes of data a day”

•Amazon: “…hundreds of thousands of users”•Netflix: “…stopped building it’s own data

centers in 2008;” all in Amazon by 2012•Snapchat: 4000 pictures per second; “…never

owned a computer server.” (Google cloud)

*Quentin Hardy, “Google Joins a Heavyweight Competition in Cloud Computing,” NY Times, 3 December 2013

• TheOPeNDAP request smaller and is just the data the person wants

• In cloud systems cost is a function of data transfer, in addition to to data stored, so smaller targeted requests reduce costs

OPeNDAP request

4% Download

Full dataset

100% Download

Why use OPeNDAP?

NOAA Environmental Data Management Conceptual Cloud Architecture*

Potential locations of cloud-enabled OPeNDAP instances

*Aadapted from NOAA Environmental Data Management Framework Draft v0.3Appendix C - Dr. Jeff de La Beaujardière, NOAA Data Management Architect

• No vendor lock-in! • No Stovepipes! - flexible storage method

• What will be the client of 2020?• Hierarchical/human browsable

Constraints

file

dataset

file file

Data stores: S3 and Glacier•S3

• Spinning disk with a flat file system• Designed to make web-scale computing easier

•Glacier• Near-line device with 4-hour (or >) access times• Secure and durable storage

•EC2• EC2 was used to run the OPeNDAP data server• Linux

Using S3 as a Data Store

Catalog

Data

S3HTTP GET & HEAD requests

Web requests

S3

Catalog, or data request

XML or data file

To enhance performance, data were accessed from S3 only when not already cached.

OPeNDAP Catalog requests

S3OPeNDAP

Server

catalogcache

XML File

User catalogRequest Catalog Access

THREDDScatalog or HTML

EC2

datacache

To enhance performance, data were accessed from S3 only when not already cached.

OPeNDAP Data requests

S3OPeNDAP

Server

catalogcache

Data File

User dataRequest Data Access

Data Slice

EC2

datacache

Observations

• S3FS & Amazon's APIs: vendor lock-in

• XML catalogs were flexible: • Support both direct web and…

• Subsetting server access

• Likely adaptable to other use-cases

• Easily support hierarchical structure

• Catalogs didn't need to be stored in S3

Glacier and Asynchronous Responses

• To use Glacier, a web service protocol must support asynchronous access! Glacier is a near-line device; not a spinning disk.

• Support via protocol is not enough: typical use cases cannot be met without caching ‘metadata’o To support web interfaces/clients DAP metadata

objects should be cachedo To support smart clients, may need domain data in

cache

Glacier Implementation

• Cachingo Catalogo DAP metadata

• Support for programmatic and web clientso Web clients are the primary user of the DAP

metadata because of their ‘click and browse’ behavior

• XML with an embedded XSL style sheeto Single response (XML) o Multiple target clients – smart and browser

Comparison: S3 and Glacier*

•Glacier provides “secure and durable storage”•S3 is “designed to make web-scale computing

easier”•These graphs: A tiny part of complex cost model.

They do not include the cost to move data out of the Amazon cloud, EC2 instances, etc.

*http://calculator.s3.amazonaws.com/calc5.html

Summary

• OPeNDAP server with minimal changes • Data stored in S3 and Glacier• Solution widely applicable: Web + Smart

clients• Complexity of the cost model combination

of both S3 and Glacier likely• Modeling & Monitoring use required

Documents

OPeNDAP in the Cloud Optimizing the Use of Storage Systems Provided by Cloud Computing Environments OPeNDAP James Gallagher, Nathan Potter and NOAA/NODC