Abstract Analysis and Visualization of Hydrologic Data and Observations Catalogs Using the OLAP Data...

Preview:

Citation preview

AbstractAbstract

Analysis and Visualization of Hydrologic Data and Analysis and Visualization of Hydrologic Data and Observations Catalogs Using the OLAP Data Cube TechnologyObservations Catalogs Using the OLAP Data Cube Technology

Ilya ZaslavskyIlya Zaslavskyaa, Matthew Rodriguez, Matthew Rodriguezaa, Bora Beran, Bora Beranbb, David Valentine, David Valentineaa, and Catharine van Ingen, and Catharine van Ingenbb

San Diego Supercomputer Center, UCSD, San Diego, CASan Diego Supercomputer Center, UCSD, San Diego, CAaa; Microsoft Research, San Francisco, CA; Microsoft Research, San Francisco, CAbb

About CUAHSI HISAbout CUAHSI HIS

One of the goals of the CUAHSI Hydrologic Information System project One of the goals of the CUAHSI Hydrologic Information System project is to create a comprehensive portrait of hydrologic observations for the is to create a comprehensive portrait of hydrologic observations for the country, integrating observations data and metadata from multiple country, integrating observations data and metadata from multiple sources, at federal, regional, and local levels. The data are made sources, at federal, regional, and local levels. The data are made available via a uniform set of web service interfaces, called CUAHSI available via a uniform set of web service interfaces, called CUAHSI WaterOneFlow services, and via online mapping applications that WaterOneFlow services, and via online mapping applications that include the Data Access System for Hydrology  (DASH), and include the Data Access System for Hydrology  (DASH), and Hydroseek. To provide rapid access to data summaries, in particular for Hydroseek. To provide rapid access to data summaries, in particular for several nation-wide data repositories including EPA STORET, USGS several nation-wide data repositories including EPA STORET, USGS NWIS, and USDA SNOTEL, we convert the observations data catalogs NWIS, and USDA SNOTEL, we convert the observations data catalogs and the databases with the harvested values, into special and the databases with the harvested values, into special representations that support high performance analysis and representations that support high performance analysis and visualization. visualization.

Construction of OLAP (Online Analytical Processing) cubes, often Construction of OLAP (Online Analytical Processing) cubes, often called data cubes, is an approach to organizing and querying large called data cubes, is an approach to organizing and querying large multi-dimensional data collections. We have applied the OLAP multi-dimensional data collections. We have applied the OLAP techniques, as implemented in Microsoft SQL Server 2005, to the techniques, as implemented in Microsoft SQL Server 2005, to the analysis of the catalogs from several agencies. analysis of the catalogs from several agencies.

The initial results reflect hydrologic data availability from a group of The initial results reflect hydrologic data availability from a group of observations data catalogs, and temporal and spatial drilldown observations data catalogs, and temporal and spatial drilldown summaries for selected datasets. The results demonstrate geography summaries for selected datasets. The results demonstrate geography and history of available data totals from USGS NWIS, EPA STORET, and history of available data totals from USGS NWIS, EPA STORET, and USDA SNOTEL repositories, and spatial and temporal dynamics of and USDA SNOTEL repositories, and spatial and temporal dynamics of available measurements for several key parameters.available measurements for several key parameters.

The Consortium of Universities for the Advancement of Hydrologic The Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) is an organization representing 120+ Science, Inc. (CUAHSI) is an organization representing 120+ universities in the US and abroad. As part of its mission, CUAHSI universities in the US and abroad. As part of its mission, CUAHSI supports the development of cyberinfrastructure for the hydrologic supports the development of cyberinfrastructure for the hydrologic sciences. The CUAHSI HIS (Hydrologic Information System) project is sciences. The CUAHSI HIS (Hydrologic Information System) project is a multi-year multi-institution effort focused on consistent management a multi-year multi-institution effort focused on consistent management of observations data available from several federal agencies (USGS, of observations data available from several federal agencies (USGS, EPA, USDA, NOAA, etc.) as well as published by individual EPA, USDA, NOAA, etc.) as well as published by individual investigators. investigators.

CUAHSI HIS develops service-oriented architecture for hydrologic CUAHSI HIS develops service-oriented architecture for hydrologic research and education, to enable publication, discovery, retrieval, research and education, to enable publication, discovery, retrieval, analysis and integration of hydrologic data. The project team has analysis and integration of hydrologic data. The project team has defined a common information model for organizing hydrologic defined a common information model for organizing hydrologic observation data, designed a common exchange protocol (Water observation data, designed a common exchange protocol (Water Markup Language) and developed a collection of SOAP web services Markup Language) and developed a collection of SOAP web services (WaterOneFlow services) that provide uniform access to different (WaterOneFlow services) that provide uniform access to different federal, state and local hydrologic data repositories. federal, state and local hydrologic data repositories.

This system is now implemented as a collection of Hydrologic This system is now implemented as a collection of Hydrologic Information Servers deployed at NSF-supported Hydrologic Information Servers deployed at NSF-supported Hydrologic Observatory test beds.Observatory test beds.

EPA STORETEPA STORET

ConclusionConclusionOLAP datacubes allow domain scientists to access and mine huge volumes OLAP datacubes allow domain scientists to access and mine huge volumes of observations data and catalogs simply and efficiently using familiar desktop of observations data and catalogs simply and efficiently using familiar desktop tools such as Excel PivotTables. tools such as Excel PivotTables.

We demonstrate how OLAP datacubes are constructed for several large We demonstrate how OLAP datacubes are constructed for several large observations data catalogs which contain site summary information, including observations data catalogs which contain site summary information, including site’s period of record for each parameter, or series. The datacubes are site’s period of record for each parameter, or series. The datacubes are analyzed spatially and temporally. The the summary maps and diagrams of analyzed spatially and temporally. The the summary maps and diagrams of data availability by various datacube dimensions and filters appear useful to data availability by various datacube dimensions and filters appear useful to understanding the nature of the datsets. Constructing datacubes over data understanding the nature of the datsets. Constructing datacubes over data values as well can automatically produce what are now referred to as unique values as well can automatically produce what are now referred to as unique data products such as daily and yearly statistics (Mean, Minimum, Maximum, data products such as daily and yearly statistics (Mean, Minimum, Maximum, Variance) across one or more sites filtered by quality or other attribute. Variance) across one or more sites filtered by quality or other attribute.

Future work will include construction of domain specific hierarchies (e.g. Future work will include construction of domain specific hierarchies (e.g. using HUC codes), and generation and analysis of watershed-specific cubes using HUC codes), and generation and analysis of watershed-specific cubes that integrate observations from multiple sources. that integrate observations from multiple sources.

Changes over timeChanges over time

Some physical properties by decade: Some physical properties by decade: Available Data TotalAvailable Data Total

USGS NWIS Daily ValuesUSGS NWIS Daily Values

US Map of USGS Daily Values Series US Map of USGS Daily Values Series by 1 degree latitude-longitude binsby 1 degree latitude-longitude bins

CUAHSI HIS Service Oriented Architecture: General OutlineCUAHSI HIS Service Oriented Architecture: General Outline

OLAP Technology for CatalogsOLAP Technology for CatalogsObservations data catalogs assembled within CUAHSI HIS follow standard relational Observations data catalogs assembled within CUAHSI HIS follow standard relational schema of the schema of the Observations Data ModelObservations Data Model (ODM). The ODM catalogs collections of (ODM). The ODM catalogs collections of observation series, each described by:observation series, each described by:

• what: what: observed parameter variable including metadata such as processing observed parameter variable including metadata such as processing methodology, units, quality assessmentsmethodology, units, quality assessments

• where: where: location or site where the observation was made including relevant site location or site where the observation was made including relevant site metadata such as Hydrologic Unit or vegetation classmetadata such as Hydrologic Unit or vegetation class

• when: when: start and end date of measurements and time granularity of measurementstart and end date of measurements and time granularity of measurement

OLAP (Online Analytical Processing) is an approach to rapidly analyze such large multi-OLAP (Online Analytical Processing) is an approach to rapidly analyze such large multi-dimensional databases. OLAPdimensional databases. OLAP Datacubes Datacubes organize data along organize data along dimensionsdimensions, enable , enable drilldown on drilldown on hierarchieshierarchies, and , and aggregatesaggregates such as average or minimum. Using such as average or minimum. Using SQL Server SQL Server Analysis Services, Analysis Services, an OLAP cube can be created from any database with a star schema an OLAP cube can be created from any database with a star schema such as ODM. using such as ODM. using SQL Server Analysis ServicesSQL Server Analysis Services. Simple aggregates (such as daily . Simple aggregates (such as daily averages or yearly maxima) can be pre-computed for improved query performance. averages or yearly maxima) can be pre-computed for improved query performance.

LinksLinksCUAHSI HIS: CUAHSI HIS: http://www.cuahsi.org/his/

HIS Wiki @ SDSC : HIS Wiki @ SDSC : http://river.sdsc.edu/wiki

BWC OLAP user manual: BWC OLAP user manual: http://bwc.berkeley.edu/UserManual/UserManual.htm

Building a Cube from a CUAHSI ODM database: Building a Cube from a CUAHSI ODM database: http://research.microsoft.com/~vaningen/Hydrology/BearRiver/default.htm

Using OLAP DatacubesUsing OLAP Datacubes

Bear River Sample WatershedBear River Sample Watershedstream gage comings and goingsstream gage comings and goings1958-19881958-1988

Daily Discharge Averaged Daily Discharge Averaged by Year, Month, Week by Year, Month, Week

1980-1990 1980-1990 (Top 12 Sites Only)(Top 12 Sites Only)

Annual Mean

MonthlyMean

WeeklyMean

Drilling down Drilling down time hierarchy time hierarchy gives overall gives overall sense of wet sense of wet years and years and flood eventsflood events

Datacube enables simple temporal drilldown. These Datacube enables simple temporal drilldown. These examples are for the ODM sample database from the examples are for the ODM sample database from the Little Bear River region in Utah (2 sites, 12 series, Little Bear River region in Utah (2 sites, 12 series, 593,000 observations). 593,000 observations).

The three graphs to the right plot mean annual, The three graphs to the right plot mean annual, monthly, and weekly discharge at the dozen highest monthly, and weekly discharge at the dozen highest flow gages in the watershed over the years 1980 to flow gages in the watershed over the years 1980 to 1990. The annual graph clearly shows the wet years - 1990. The annual graph clearly shows the wet years - years of highest flow. The monthly and weekly graph years of highest flow. The monthly and weekly graph show differences between years. For example, the show differences between years. For example, the high flows in 1984 occur in the middle of the year as high flows in 1984 occur in the middle of the year as the result of snow melt. The high flows in 1986 begin the result of snow melt. The high flows in 1986 begin in the early part of the year before the usual snow melt in the early part of the year before the usual snow melt continue through the summer season. continue through the summer season.

• Datacubes can be easily browsed with Excel PivotTables and drag-drop over the internet Datacubes can be easily browsed with Excel PivotTables and drag-drop over the internet • MatLab and other rich analysis client access integration comingMatLab and other rich analysis client access integration coming• Browsing can give fast overview insight into data availability and qualityBrowsing can give fast overview insight into data availability and quality

Mean Period of RecordMean Period of Record

Mapping Data AvailabilityMapping Data Availability

Different types of nutrients by decade Different types of nutrients by decade

These are examples from the USGS These are examples from the USGS National Water Information System National Water Information System datacube. The cube contains 1.4 million datacube. The cube contains 1.4 million sites and 12 million series. sites and 12 million series.

To the left are “maps” of data availability, To the left are “maps” of data availability, by one degree cells. Note higher by one degree cells. Note higher volumes of measurements accumulated volumes of measurements accumulated along the coasts, on the east coast along the coasts, on the east coast especially. Stations with the longest especially. Stations with the longest mean period of records are distributed mean period of records are distributed across the country, with peaks in the across the country, with peaks in the central parts.central parts.

To the right is dynamics in data To the right is dynamics in data availability for selected parameters. The availability for selected parameters. The top chart shows changes in top chart shows changes in measurement of nutrients over years: measurement of nutrients over years: note the drop in phosphate and increase note the drop in phosphate and increase in phosphorus and orthophosphate in phosphorus and orthophosphate measurements. The lower chart shows measurements. The lower chart shows that ‘discharge’ dropped as ‘gauge that ‘discharge’ dropped as ‘gauge height’ increased in the last two decadesheight’ increased in the last two decades

To the left, data availability for a To the left, data availability for a parameter, discharge, for a site shows parameter, discharge, for a site shows that data availability is not always that data availability is not always continuous. Sites have disappeared, and continuous. Sites have disappeared, and reappeared.reappeared.

USDA SNOTELUSDA SNOTEL

The SNOTEL datacube The SNOTEL datacube contains 810 sites, 4552 contains 810 sites, 4552 series, and 28 million series, and 28 million observations. The data observations. The data is is collected by a standard set collected by a standard set of variables, and the of variables, and the number of observations number of observations increases over time.increases over time.

The SNOTEL datacube The SNOTEL datacube also contains values, so also contains values, so you can rapidly investigate you can rapidly investigate aggregations over county, aggregations over county, states, and the entire states, and the entire dataset.dataset.

Total Available DataTotal Available Data

The EPA STORET datacube contains 273K sites and 2.7M series. The EPA STORET datacube contains 273K sites and 2.7M series.

Florida is the source of about 25% of the total records. Florida is the source of about 25% of the total records.

Water Quality Data Record Counts by Site ClassificationWater Quality Data Record Counts by Site Classification

Number of years of record Number of years of record by start decadeby start decade

About 60% of the About 60% of the water quality records water quality records are short term are short term measurements (one measurements (one year or less in year or less in duration). The duration). The starting decade of starting decade of the longer series are the longer series are to the right.to the right.

93% of the series are 93% of the series are water quality data. water quality data.

Test bed HISServers

Central HIS servers

ArcGIS

Matlab

IDL

MapWindow

Excel

Programming (Fortran, C, VB)

Desktop clients

Customizable web interface (DASH)

HTML - XML

WS

DL

-SO

AP

Modeling (OpenMI)

Global search (Hydroseek)

WaterOneFlow Web Services, WaterML

HIS LiteServers

External data providers

Deployment to test beds

Other popular online clients

ODM DataLoader

Streaming Data Loading

Ontology tagging (Hydrotagger)

WSDL and ODM registration

Data publishing

ODMTools

Server configtools

Test bed HISServers

Test bed HISServers

Test bed HISServers

Central HIS servers

Central HIS servers

ArcGIS

Matlab

IDL

MapWindow

Excel

Programming (Fortran, C, VB)

Desktop clients

Customizable web interface (DASH)

HTML - XML

WS

DL

-SO

AP

Modeling (OpenMI)

Global search (Hydroseek)

WaterOneFlow Web Services, WaterML

HIS LiteServersHIS LiteServersHIS LiteServers

External data providers

External data providers

Deployment to test beds

Other popular online clients

ODM DataLoader

Streaming Data Loading

Ontology tagging (Hydrotagger)

WSDL and ODM registration

Data publishing

ODMTools

Server configtools

Recommended