35
eResearch Australia 2014 - AGDC: A Common Analytical Framework The Australian Geoscience Data Cube (AGDC) – A Common Analytical Framework Alex Ip (GA), Ben Evans (NCI), Leo Lymburner (GA), Simon Oliver (GA)

The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

eResearch Australia 2014 - AGDC: A Common Analytical Framework

The Australian Geoscience Data Cube (AGDC) – A Common Analytical Framework

Alex Ip (GA), Ben Evans (NCI), Leo Lymburner (GA), Simon Oliver (GA)

Page 2: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

The Past: Origins of the AGDC – A Quick Recap

WARNING: Recycled slides ahead! (not many)

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 3: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Why did we “roll our own” in the first place?

The situation in early 2013:

• ~0.5PB of Landsat data and derived products

• GA needed to deliver urgent, large-scale analyses

• Scientists previously had to (manually) locate and arrange individual scenes before doing any science

• Needed a means of leveraging compute power and storage at NCI

• No third-party product was guaranteed to handle the scale and temporal depth of the whole archive

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 4: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Problems with Traditional Monolithic Array Approaches• Remote sensing data is typically both spatially and temporally sparse and

irregular, unlike modelling output.

eResearch Australia 2014 - AGDC: A Common Analytical Framework

• Monolithic array approaches (e.g. x-y-t) do not work well – too many empty pixels, temporal “binning” required.

XY

t

Page 5: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Dice’n’Stack(‘n’Analyse)

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 6: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Where are we now? Key Features of the Current AGDC• (Almost embarrassing) simplicity is its virtue.

• Flexibility favoured over performance.

• Implemented in Python (with NumPy/NumExpr)

• Regular, spatially-partitioned, time-stamped, multi-band 2D GeoTIFF (or netCDF) storage units (tiles)

• PostGIS relational database indexing tile files and associated metadata

• GDAL used for all geospatial and file operations

• Separate (small) metadata database can be replicated for scalability against a single copy of (large) data files on a parallel filesystem

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 7: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Bottom line: IT WORKS!

KISS – Simple scales, complex fails…

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 8: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Seamless, continental-scale, multi-decadal, high-resolution data

First ever image showing pixel-based clear observation counts for the whole Australian Landsat archive (then 1998-2012)

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 9: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

1. Water Observations from Space - National Flood Risk Information Project2. The Murray-Darling Basin Vegetation Condition Assessment3. CSIRO-GA Shallow water bathymetry using Landsat4. CSIRO Australian Landsat MODIS Blending Infrastructure (ALMBI)5. GeoGlam Rangelands6. Inland water – data query test case7. The National Carbon Accounting Scheme8. Primary production in space and time eMAST9. Crop mapping10. Cal/Val site identification11. Inter-tidal mapping12. CRCSI Big Data for Environmental Information13. Joint Remote Sensing Research Program Trial14. Bare Soil for Mineral Enhanced Mapping

eResearch Australia 2014 - AGDC: A Common Analytical Framework

AGDC Application Projects Currently Supported

Page 10: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

eResearch Australia 2014 - AGDC: A Common Analytical Framework

NFRIP water detection• 15 Years of data from

LS5 & LS7(1998-2012)• 25m Nominal Pixel

Resolution• Approx. 133,000

individual source scenes in approx. 12,400 passes

• Entire archive of 1,312,087 ARG25 tiles => 21x1012 pixels visited

• Originally 2 days at NCI (elapsed time) to compute. Now <3hrs.

Water Observations from Space (WOfS) - still our current poster child

Find this at www.ga.gov.au

Page 11: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Not Quite.

Mission Accomplished! We can all go home now.

eResearch Australia 2014 - AGDC: A Common Analytical Framework

http://www.theguardian.com/commentisfree/2013/may/01/george-bush-mission-accomplished-at-ten

Page 12: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

The Present: What we’re busy on right now

We’re not resting on our fat laurels…

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 13: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Tasks Underway (or Completed) with Current AGDC• All code now open-sourced on GitHub

(https://github.com/GeoscienceAustralia/agdc)

• Ingesting new data collections using generic ingestion framework (e.g. MODIS).

• Hardening remaining prototype code and optimising prototype DB schema.

• Developing new APIs to support specific use case patterns.

• Developing generic workflow tools to manage parallel processing (Luigi)

• Delivering basic WMS, WCS, WPS & WCPS web services

• Providing simple tools for cross-sensor interoperability (e.g. spectral matching/adjustment)

• Managing versioning and process provenance

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 14: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Standard Workflow Patterns

1. Use cases analysed

2. APIs designed

3. Generic, HPC-friendly workflow engines implemented

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 15: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Current State of the Australian Landsat Archive in the AGDC27 Years of Landsat Data (1987-2014) processed so far*:

• 20,500 Passes in 301,400 Acquisitions

• 857,000 available datasets (all processing levels)

• 93 x 1012 Pixels (peta-Pixels?) in all available datasets (>11x more if counting bands separately)

• ~1200 observations for some areas

• ~0.75PB data currently stored in RDSI

eResearch Australia 2014 - AGDC: A Common Analytical Framework

* Figures as at 30/9/14 rounded to nearest hundred

Page 16: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

…but Landsat is only ONE satellite mission.

eResearch Australia 2014 - AGDC: A Common Analytical Framework

http://animals.desktopnexus.com/wallpaper/51953/

Page 17: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

The Approaching Earth Observation Data Deluge…

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 18: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Looming Technical Issues• Need to better align with global standards

• One-file-per-tile model will eventually hit file system limits.

• n-deep temporal stacks mean at least n 2D file open operations for time series (actually more like n2) – not efficient. No contiguous time series access.

• Can’t easily leverage third-party developments with current externally-indexed 2D file format

• Need to handle non-EO data with very different characteristics

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 19: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Two-pronged development approach:

AGDC collaboration between GA, CSIRO, NCI, JRSP et al formed to:

1. Support existing use cases with enhanced, operationalised version of current 2D-storage based system

2. Concurrently develop new multidimensional-storage based system and extensible data model

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 20: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

The Future: What we’re designing

If it were easy, everyone would be doing it.

So let’s make it easy…

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 21: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Time is of the EssenceChange detection and other temporal analyses are proving rich sources of fresh research insights.

• We have already implemented and tested very effective temporal detection and removal of transient artefacts (e.g. cloud and cloud-shadow).

• Temporal classification of EO data provides a far more accurate method for identifying seasonally-varying vegetation (e.g. crop identification) than purely spectral methods.

• Data fusion with time-varying ancillary data (e.g. tide level, rainfall, temperature) can provide even richer insights.

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 22: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

eResearch Australia 2014 - AGDC: A Common Analytical Framework22

Temporal Analysis at Work:Fractional Cover (Keytah Station)

Page 23: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

High Performance Computing

A way of turning a compute problem into an I/O problem.

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 24: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Stacks of Time…

Many “multidimensional” array systems (including our current one) actually use 2D storage units behind the scenes.

eResearch Australia 2014 - AGDC: A Common Analytical Framework

http://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters

Page 25: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

I really LOVE GDAL, but…the Earth is NOT Flat

GDAL (Geospatial Data Abstraction Library) is our “Swiss Army Knife” for geospatial raster file manipulation, but it has its limitations:

• Inherently 2D (multi-band)

• Not temporally-aware.

• Optimised for 2D spatial access.

• Georeferencing is largely redundant

• Native file format libraries would be heaps faster

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 26: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

NetCDF – An Obvious Choice (However you slice it)• NetCDF is the dominant standard format for

multidimensional, array-based scientific data.

• Encapsulation of HDF5 in netCDF 4 provides rich multidimensional configurability

• Extensive and growing range of tools being developed for NetCDF.

• Able to leverage existing delivery mechanisms (e.g. THREDDS, Hyrax, OPeNDAP, ERDDAP, etc)

• HPC friendly – Parallel processing tools already exist

• According to Unidata: Self-describing, portable, scalable, appendable, sharable, archivable, blah blah blah…

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 27: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

I Like it Chunky…

• NetCDF 4 provides a buffering system called “chunking” where the buffer size can be configured for each array dimension.

• Data is compressed/written and read/decompressed in chunks, resulting in vastly improved I/O performance for subsets (slices) of data across different dimensions

eResearch Australia 2014 - AGDC: A Common Analytical Framework

http://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters

Page 28: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Actually, I like it SUPER-Chunky!• The current AGDC partitions the data into regular spatial

extents

• We are extending this concept to temporally aggregate observations into “super-chunk” or “data-block” files with a regular spatial and temporal extents

• These multidimensional files would still be indexed using a spatially-enabled relational database (PostGIS)

• All degrees of freedom for storage units need to be configured in the database.

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 29: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

A truly Multidimensional Storage Subsystem

From feeling a bit flat, we’re to making the jump to hyperspace…

eResearch Australia 2014 - AGDC: A Common Analytical Framework

http://blog.digilentinc.com/wp-content/uploads/2014/08/6D-array.gif

Page 30: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

A New Data Model will be Required

It’s only a model…

http://www.dvd.net.au/review.cgi?review_id=2769

Page 31: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Data Model Wishlist – Storage Management

• Storage format agnosticism (not specifically dependent on NetCDF or any other file format)

• Generalised dimensionality (i.e. not hard-coded for x-y-t: could also manage x-y-z-t, x-y-z-λ-t or more)

• Multiple measurements (e.g. spectral bands in EO datasets) stored either as individual variables or as an additional array dimension

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 32: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Data Model Wishlist – Metadata Management

• Domain-neutrality (i.e. not Earth Observation specific)

• Hierarchical categorisation with management of inherited metadata attributes and values

• Generalised and extensible specification and management of metadata attributes (very tricky)

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 33: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Hierarchical Categorisation

• We need to accommodate data across diverse domains which have different categorisation schemes

• We need to map these schemes and provide linkages to controlled vocabularies

• Data will inherit metadata attributes from each parent category

eResearch Australia 2014 - AGDC: A Common Analytical Framework

|---Observation Type: Remote Sensing (RS)| |---Mission: Landsat Series (LS)| | |---Satellite: Landsat 5 (LS5)| | | |---Sensor: Thematic Mapper (TM)| | |---Satellite: Landsat 7 (LS7)| | |---Sensor: Enhanced Thematic Mapper (ETM+)| |---Band Type: Reflectance (VSWIR)| |---Band Type: Thermal (TIR)| |---Band Type: Panchromatic (PAN)| |---Dataset Type: Level 1 Topographically Corrected (L1T)| |---Dataset Type: ARG25 - Surface-Reflectance Corrected (NBAR)| |---Dataset Type: Pixel Quality (PQA)| |---Dataset Type: Fractional Cover (FC)||---Observation Type: Bathymetry (BATHY)

|---Survey: Victorian Continental Shelf, 2013 (VCS2013)| |---Vessel: Jolly Roger (JOLROG)| |---Voyage: January 15, 2013 – February 15, 2013 (20130115-20130215)|---Instrument Type: Multi-Beam Sonar (MBS)|---Instrument Type: Side-Scan Sonar (SSS)

Page 34: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Researcher-Focused Outcomes• As many ways as possible to access, combine and use data

• Easy implementation of common analysis patterns with custom algorithms

• Delivery of spatio-temporal subsets via OGC-compliant web services

• Ability to leverage HPC and cloud computing resources

• Support for a wide range of gridded data across diverse domains

• Data selection, filtering and sorting on any metadata attributes

• Support for in-situ processing (WPS, WCPS) – bring the compute to the data

eResearch Australia 2014 - AGDC: A Common Analytical Framework

Page 35: The Australian Geoscience Data Cube (AGDC) –A Common ... · • Developing generic workflow tools to manage parallel processing (Luigi) • Delivering basic WMS, WCS, WPS & WCPS

Phone: +61 2 6249 9517

Web: www.ga.gov.au

Email: [email protected]

Address: Cnr Jerrabomberra Avenue and Hindmarsh Drive, Symonston ACT 2609

Postal Address: GPO Box 378, Canberra ACT 2601

Thank you!

Any Questions?

eResearch Australia 2014 - AGDC: A Common Analytical Framework