Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
eResearch Australia 2014 - AGDC: A Common Analytical Framework
The Australian Geoscience Data Cube (AGDC) – A Common Analytical Framework
Alex Ip (GA), Ben Evans (NCI), Leo Lymburner (GA), Simon Oliver (GA)
The Past: Origins of the AGDC – A Quick Recap
WARNING: Recycled slides ahead! (not many)
eResearch Australia 2014 - AGDC: A Common Analytical Framework
Why did we “roll our own” in the first place?
The situation in early 2013:
• ~0.5PB of Landsat data and derived products
• GA needed to deliver urgent, large-scale analyses
• Scientists previously had to (manually) locate and arrange individual scenes before doing any science
• Needed a means of leveraging compute power and storage at NCI
• No third-party product was guaranteed to handle the scale and temporal depth of the whole archive
eResearch Australia 2014 - AGDC: A Common Analytical Framework
Problems with Traditional Monolithic Array Approaches• Remote sensing data is typically both spatially and temporally sparse and
irregular, unlike modelling output.
eResearch Australia 2014 - AGDC: A Common Analytical Framework
• Monolithic array approaches (e.g. x-y-t) do not work well – too many empty pixels, temporal “binning” required.
XY
t
Dice’n’Stack(‘n’Analyse)
eResearch Australia 2014 - AGDC: A Common Analytical Framework
Where are we now? Key Features of the Current AGDC• (Almost embarrassing) simplicity is its virtue.
• Flexibility favoured over performance.
• Implemented in Python (with NumPy/NumExpr)
• Regular, spatially-partitioned, time-stamped, multi-band 2D GeoTIFF (or netCDF) storage units (tiles)
• PostGIS relational database indexing tile files and associated metadata
• GDAL used for all geospatial and file operations
• Separate (small) metadata database can be replicated for scalability against a single copy of (large) data files on a parallel filesystem
eResearch Australia 2014 - AGDC: A Common Analytical Framework
Bottom line: IT WORKS!
KISS – Simple scales, complex fails…
eResearch Australia 2014 - AGDC: A Common Analytical Framework
Seamless, continental-scale, multi-decadal, high-resolution data
First ever image showing pixel-based clear observation counts for the whole Australian Landsat archive (then 1998-2012)
eResearch Australia 2014 - AGDC: A Common Analytical Framework
1. Water Observations from Space - National Flood Risk Information Project2. The Murray-Darling Basin Vegetation Condition Assessment3. CSIRO-GA Shallow water bathymetry using Landsat4. CSIRO Australian Landsat MODIS Blending Infrastructure (ALMBI)5. GeoGlam Rangelands6. Inland water – data query test case7. The National Carbon Accounting Scheme8. Primary production in space and time eMAST9. Crop mapping10. Cal/Val site identification11. Inter-tidal mapping12. CRCSI Big Data for Environmental Information13. Joint Remote Sensing Research Program Trial14. Bare Soil for Mineral Enhanced Mapping
eResearch Australia 2014 - AGDC: A Common Analytical Framework
AGDC Application Projects Currently Supported
eResearch Australia 2014 - AGDC: A Common Analytical Framework
NFRIP water detection• 15 Years of data from
LS5 & LS7(1998-2012)• 25m Nominal Pixel
Resolution• Approx. 133,000
individual source scenes in approx. 12,400 passes
• Entire archive of 1,312,087 ARG25 tiles => 21x1012 pixels visited
• Originally 2 days at NCI (elapsed time) to compute. Now <3hrs.
Water Observations from Space (WOfS) - still our current poster child
Find this at www.ga.gov.au
Not Quite.
Mission Accomplished! We can all go home now.
eResearch Australia 2014 - AGDC: A Common Analytical Framework
http://www.theguardian.com/commentisfree/2013/may/01/george-bush-mission-accomplished-at-ten
The Present: What we’re busy on right now
We’re not resting on our fat laurels…
eResearch Australia 2014 - AGDC: A Common Analytical Framework
Tasks Underway (or Completed) with Current AGDC• All code now open-sourced on GitHub
(https://github.com/GeoscienceAustralia/agdc)
• Ingesting new data collections using generic ingestion framework (e.g. MODIS).
• Hardening remaining prototype code and optimising prototype DB schema.
• Developing new APIs to support specific use case patterns.
• Developing generic workflow tools to manage parallel processing (Luigi)
• Delivering basic WMS, WCS, WPS & WCPS web services
• Providing simple tools for cross-sensor interoperability (e.g. spectral matching/adjustment)
• Managing versioning and process provenance
eResearch Australia 2014 - AGDC: A Common Analytical Framework
Standard Workflow Patterns
1. Use cases analysed
2. APIs designed
3. Generic, HPC-friendly workflow engines implemented
eResearch Australia 2014 - AGDC: A Common Analytical Framework
Current State of the Australian Landsat Archive in the AGDC27 Years of Landsat Data (1987-2014) processed so far*:
• 20,500 Passes in 301,400 Acquisitions
• 857,000 available datasets (all processing levels)
• 93 x 1012 Pixels (peta-Pixels?) in all available datasets (>11x more if counting bands separately)
• ~1200 observations for some areas
• ~0.75PB data currently stored in RDSI
eResearch Australia 2014 - AGDC: A Common Analytical Framework
* Figures as at 30/9/14 rounded to nearest hundred
…but Landsat is only ONE satellite mission.
eResearch Australia 2014 - AGDC: A Common Analytical Framework
http://animals.desktopnexus.com/wallpaper/51953/
The Approaching Earth Observation Data Deluge…
eResearch Australia 2014 - AGDC: A Common Analytical Framework
Looming Technical Issues• Need to better align with global standards
• One-file-per-tile model will eventually hit file system limits.
• n-deep temporal stacks mean at least n 2D file open operations for time series (actually more like n2) – not efficient. No contiguous time series access.
• Can’t easily leverage third-party developments with current externally-indexed 2D file format
• Need to handle non-EO data with very different characteristics
eResearch Australia 2014 - AGDC: A Common Analytical Framework
Two-pronged development approach:
AGDC collaboration between GA, CSIRO, NCI, JRSP et al formed to:
1. Support existing use cases with enhanced, operationalised version of current 2D-storage based system
2. Concurrently develop new multidimensional-storage based system and extensible data model
eResearch Australia 2014 - AGDC: A Common Analytical Framework
The Future: What we’re designing
If it were easy, everyone would be doing it.
So let’s make it easy…
eResearch Australia 2014 - AGDC: A Common Analytical Framework
Time is of the EssenceChange detection and other temporal analyses are proving rich sources of fresh research insights.
• We have already implemented and tested very effective temporal detection and removal of transient artefacts (e.g. cloud and cloud-shadow).
• Temporal classification of EO data provides a far more accurate method for identifying seasonally-varying vegetation (e.g. crop identification) than purely spectral methods.
• Data fusion with time-varying ancillary data (e.g. tide level, rainfall, temperature) can provide even richer insights.
eResearch Australia 2014 - AGDC: A Common Analytical Framework
eResearch Australia 2014 - AGDC: A Common Analytical Framework22
Temporal Analysis at Work:Fractional Cover (Keytah Station)
High Performance Computing
A way of turning a compute problem into an I/O problem.
eResearch Australia 2014 - AGDC: A Common Analytical Framework
Stacks of Time…
Many “multidimensional” array systems (including our current one) actually use 2D storage units behind the scenes.
eResearch Australia 2014 - AGDC: A Common Analytical Framework
http://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters
I really LOVE GDAL, but…the Earth is NOT Flat
GDAL (Geospatial Data Abstraction Library) is our “Swiss Army Knife” for geospatial raster file manipulation, but it has its limitations:
• Inherently 2D (multi-band)
• Not temporally-aware.
• Optimised for 2D spatial access.
• Georeferencing is largely redundant
• Native file format libraries would be heaps faster
eResearch Australia 2014 - AGDC: A Common Analytical Framework
NetCDF – An Obvious Choice (However you slice it)• NetCDF is the dominant standard format for
multidimensional, array-based scientific data.
• Encapsulation of HDF5 in netCDF 4 provides rich multidimensional configurability
• Extensive and growing range of tools being developed for NetCDF.
• Able to leverage existing delivery mechanisms (e.g. THREDDS, Hyrax, OPeNDAP, ERDDAP, etc)
• HPC friendly – Parallel processing tools already exist
• According to Unidata: Self-describing, portable, scalable, appendable, sharable, archivable, blah blah blah…
eResearch Australia 2014 - AGDC: A Common Analytical Framework
I Like it Chunky…
• NetCDF 4 provides a buffering system called “chunking” where the buffer size can be configured for each array dimension.
• Data is compressed/written and read/decompressed in chunks, resulting in vastly improved I/O performance for subsets (slices) of data across different dimensions
eResearch Australia 2014 - AGDC: A Common Analytical Framework
http://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters
Actually, I like it SUPER-Chunky!• The current AGDC partitions the data into regular spatial
extents
• We are extending this concept to temporally aggregate observations into “super-chunk” or “data-block” files with a regular spatial and temporal extents
• These multidimensional files would still be indexed using a spatially-enabled relational database (PostGIS)
• All degrees of freedom for storage units need to be configured in the database.
eResearch Australia 2014 - AGDC: A Common Analytical Framework
A truly Multidimensional Storage Subsystem
From feeling a bit flat, we’re to making the jump to hyperspace…
eResearch Australia 2014 - AGDC: A Common Analytical Framework
http://blog.digilentinc.com/wp-content/uploads/2014/08/6D-array.gif
A New Data Model will be Required
It’s only a model…
http://www.dvd.net.au/review.cgi?review_id=2769
Data Model Wishlist – Storage Management
• Storage format agnosticism (not specifically dependent on NetCDF or any other file format)
• Generalised dimensionality (i.e. not hard-coded for x-y-t: could also manage x-y-z-t, x-y-z-λ-t or more)
• Multiple measurements (e.g. spectral bands in EO datasets) stored either as individual variables or as an additional array dimension
eResearch Australia 2014 - AGDC: A Common Analytical Framework
Data Model Wishlist – Metadata Management
• Domain-neutrality (i.e. not Earth Observation specific)
• Hierarchical categorisation with management of inherited metadata attributes and values
• Generalised and extensible specification and management of metadata attributes (very tricky)
eResearch Australia 2014 - AGDC: A Common Analytical Framework
Hierarchical Categorisation
• We need to accommodate data across diverse domains which have different categorisation schemes
• We need to map these schemes and provide linkages to controlled vocabularies
• Data will inherit metadata attributes from each parent category
eResearch Australia 2014 - AGDC: A Common Analytical Framework
|---Observation Type: Remote Sensing (RS)| |---Mission: Landsat Series (LS)| | |---Satellite: Landsat 5 (LS5)| | | |---Sensor: Thematic Mapper (TM)| | |---Satellite: Landsat 7 (LS7)| | |---Sensor: Enhanced Thematic Mapper (ETM+)| |---Band Type: Reflectance (VSWIR)| |---Band Type: Thermal (TIR)| |---Band Type: Panchromatic (PAN)| |---Dataset Type: Level 1 Topographically Corrected (L1T)| |---Dataset Type: ARG25 - Surface-Reflectance Corrected (NBAR)| |---Dataset Type: Pixel Quality (PQA)| |---Dataset Type: Fractional Cover (FC)||---Observation Type: Bathymetry (BATHY)
|---Survey: Victorian Continental Shelf, 2013 (VCS2013)| |---Vessel: Jolly Roger (JOLROG)| |---Voyage: January 15, 2013 – February 15, 2013 (20130115-20130215)|---Instrument Type: Multi-Beam Sonar (MBS)|---Instrument Type: Side-Scan Sonar (SSS)
Researcher-Focused Outcomes• As many ways as possible to access, combine and use data
• Easy implementation of common analysis patterns with custom algorithms
• Delivery of spatio-temporal subsets via OGC-compliant web services
• Ability to leverage HPC and cloud computing resources
• Support for a wide range of gridded data across diverse domains
• Data selection, filtering and sorting on any metadata attributes
• Support for in-situ processing (WPS, WCPS) – bring the compute to the data
eResearch Australia 2014 - AGDC: A Common Analytical Framework
Phone: +61 2 6249 9517
Web: www.ga.gov.au
Email: [email protected]
Address: Cnr Jerrabomberra Avenue and Hindmarsh Drive, Symonston ACT 2609
Postal Address: GPO Box 378, Canberra ACT 2601
Thank you!
Any Questions?
eResearch Australia 2014 - AGDC: A Common Analytical Framework