28
A brief history in TimeSeries data at Environment Canada James Doyle Project Manager & Christopher Thorne Geomatics Data Analyst

Environment Canada's Data Management Service

Embed Size (px)

Citation preview

Page 1: Environment Canada's Data Management Service

A brief history in TimeSeriesdata at Environment CanadaJames DoyleProject Manager

&Christopher ThorneGeomatics Data Analyst

Page 2: Environment Canada's Data Management Service

Environment Canada’s Data Management Program (2011 –Present)

Projects:

1. Data Governance and Architecture(Data Stewardship Model & Standards)

2. Data Catalogue (supporting Open Data and Federal Geospatial Platform)

3. Data Access and Sharing

4. Data Consolidation

5. Data Integration

Page 3: Environment Canada's Data Management Service

EC Subject Area Model

Page 4: Environment Canada's Data Management Service

Hunting for a standard -XML Architecture

North American Profile of ISO 19115

(ISO/TS 19139) GeographyMarkupLanguage 3.2(ISO 19136)

Observations andMeasurements 2.0 (ISO 19156)

SWECommon Data Model

2.0

WaterML 2.0Part 1- Timeseries

TimeSeriesML

• WMO/NOAA and EC want WaterML 2.0 Part 1 rebranded

• IMD is participating in the OGC TimeSeriesML SWG

Page 5: Environment Canada's Data Management Service

COMP Logical Data ModelProvides a simple, stable, logical layer used for:

User interfaces Data resource modularization

Common Observation and Measurement ProfileA common XML exchange profile for time series data that is 100% compliant with the OGC international standards:

wml2: WaterML 2.0 Part 1 – Timeseries om: Observations & Measurements swe: Sensor Web Enablement Common Data Model gml: Geography Markup Language

What does the standard look like?The Anatomy of COMP

Page 6: Environment Canada's Data Management Service

What does COMP offer EC and its partners?XML Data Exchange

COMP ViewerWhen you open an online COMP XML file in your browser, the Viewer tracks down all the external references and presents you with a complete picture of the metadata and data as an HTML report in the official language of your choice – with outlining for easy navigation

COMP Data Point UtilitiesTo extract data values into tabular formats for consumption by your analytical software

Value Added Tools

GIS Mapping Data Visualization

A common XML exchange profile for time series data that is 100% compliant with OGC international standards – no local extensions

Page 7: Environment Canada's Data Management Service

COMP XMLThis XML fragment references a name and a unit of measure in SKOS taxonomies

SKOS TaxonomiesDefine these terms in English and French

COMP ViewerLooks up these SKOS references and resolves them in English and French

en-CAfr-CA

Simple Knowledge Organization System

EC ISO-NAP NAtChem (Air Quality) Substance Unit of Measure WaterML2 Species Bio-organism Water Quality Water Quantity Meteorology Ice Service Wild Life Service ?

Example of COMP Use of SKOS Taxonomies

Page 8: Environment Canada's Data Management Service

COMP in Actionhttp://www.ec.gc.ca/data_donnees/compCOMP XML File

When the user clicks on the file, it asks the browser to render the XML using the COMP Viewer instead of its own default XSLT script

(See 2nd line of syntax)

2

The Download Service pipes back the output to the browser invoking its standard file download facilities

4

Selecting a download optioninvokes the Download Service

3

DownloadService

Data PointExtraction Scripts

COMP Viewer XSLT

COMP Files SKOS Taxonomies

1Browser uses COMP XSLT

Page 9: Environment Canada's Data Management Service

Setting up Pilot Project

What EC Monitoring Program will be our guinea pig?

Weather Monitoring

Water Quality & Availability Monitoring

Air Quality Monitoring

Emissions Sources (Air, Water, Land)

Species & Habitat Monitoring

…etc.

Pick me

Pick me!

Page 10: Environment Canada's Data Management Service

Selecting Program Observations (Input Dataset)

Data Input:

The National Atmospheric Chemistry Database (NatChem) NARSTO Quality Science Center of the U.S. Oak Ridge Laboratory.

Accessory COMP specific XLS data entry templates For data not found or not easily accessible within source data.

Output Data:

OGC WaterML2.0- Time series (XML) Observations Data linking to Reference Master Data

Reference Data: monitoring site, instrument procedures, parameters (data types), bilingual terms & look up lists.

Page 11: Environment Canada's Data Management Service

Who is going to migrate the data?

“No problem, Chris will do it!”

(Correction: Chris + FME )

Page 12: Environment Canada's Data Management Service

What does the Input Data Look Like?

Natchem holds 100s of these NARSTO files: organized by study or monitoring network across Canada (+100’s

sites)

~35 years of data at each location/region

~500 instrument and sampling measurement procedures

Time Series logged data can be in - days, hours, or minute

NARSTO files are (TXT/CSV)

With some (not complete) accessory Program Reference Data (CSV, XLS)

All stored on a file share drive.

Page 13: Environment Canada's Data Management Service

Input file – header info*

NA

RS

TO

Varia

ble

s

Contacts

File Description/Name

File Abstract / Versioning Info

n…

.File Begins

Page 14: Environment Canada's Data Management Service

Input file – Monitoring site information

Site Location(s)

Table Schema/Metadata(uom)

….

n…

.

Table Info

Page 15: Environment Canada's Data Management Service

Input file – Observation data & metadata

Time Series by Site Observations(data point records)

Table Schema/Metadata

Observation Table Name & Notes

….

Column Metadatainstrument/sampling procedures

Page 16: Environment Canada's Data Management Service

Project Planning

ETL

Data RequestBy timeBy locationBy substance…

Web Services(controlled user driven quality data products )

(centralization & cleaning within DB)

Master Data Recast

(conversion & migration transactions)

Reporting- Data Profiling- QA/QC - Internal Business needs- Data Process logs

Resource Intensity(time & resource)

Quality of Data

Page 17: Environment Canada's Data Management Service

Task Breakdown

1. Parse NARSTO formatted csv sources and load into MS SQL Database.

2. Reference Data

i. Develop data profiling & reporting methods to QA/QC the reference data between submitted observation files.

ii. Centralize Program data master reference data for – bilingual definitions, contacts, sites, variables (procedures), and observations.

iii. Data mapping of reference data to OGC WaterML2.0– convert, store, and publishing processes.

3. Time Series Data

1. Create physical data model within MSSQL for storage and also for the TimeSeries XML output.

2. Join/Link reference data to 34 years of observations (semantic web relationships).

3. Produce, validate & publish to online COMP viewer

Page 18: Environment Canada's Data Management Service

Data Publication System Architecture

ETL

Data RequestCOMP Viewer& Conversion

Web Services(controlled user driven quality data products )

(prepping & cleaning within DB)

Master Data Recast

(conversion & migration transactions)

Reporting- Data Profiling- QA/QC- Internal Business needs- Data Process Logs

COMP WaterML2.0

XML

Data Sources

…n

XLS

COMP Templates

(data entry)

+

SME(NatChem)

Page 19: Environment Canada's Data Management Service

NARSTO Parser using FME

Reader: TEXTLINE (Line by line)

Transformer:

StringSearcher, StringReplacers, AttributeSplitter, ListExploder, ListSearcher, AttributeTrimmer, AttributeRemover. NARSTOFileMetadata (custom)

Writer: MSSQL tables ->

File header, observation, site, lookup tables NARSTO information

FME workbench was HUGE!

Mostly due to the complexity of the NARSTO custom structure.

Using Lists were my friend.

Able to preform batch import on folders!

Once Run able to Query and validate across files within MSSQL

Page 20: Environment Canada's Data Management Service

Database System View

CSV

1. NARSTO Files

2. Query data content across imported txt files.

3. Create TABLES: sites, & observations

B. List Values Parsing/ Table Schema Extraction

A. Custom File Parser & Batch File Importer

Create TABLES: sites, file header, variable name, lookup tables, & observations

C. Create QA/QC Tables, (reports)

4. Data Consolidation & Assessment

D. Data ValueConsolidation & Assessment E. Reference

Data Creation

6b. Join References files

5b. Upload Reference COMP Templates (terms, contacts)

6a. Join Reference Value

F. Build & Map XML

7. Store XML

G. Publish XML to Website

5a. Clean Reference Values

8. COMP Viewer

XMLFINISH

START

Page 21: Environment Canada's Data Management Service

Data Quality Feedback Loop…n

ETL

Data RequestCOMP Viewer(Internal)

Web Services(controlled user driven quality data products )

(prepping & cleaning within DB)

Master Data Recast

(conversion & migration transactions)

Reporting- Data Profiling- QA/QC - Internal Business needs

COMP WaterML2.0

XML

Data Sources

XLS

…n

COMP Templates

(data entry)

Program - QA/QC

+

SME

Data Quality Improvement process feedback loop…

Page 22: Environment Canada's Data Management Service

Remember This?

COMP Logical Model(WaterML2.0)

Page 23: Environment Canada's Data Management Service

Mapping Tables to WaterML and store.

n….

Page 24: Environment Canada's Data Management Service

Semantic Web Data Uniform Resource Identifiers (URIs):

<om:name

xlink:href="../def/natchem/1-0/natchem-skos.rdf#ObservationType"

xlink:title="Category Parameter"

owns="false"

xlink:type="simple"

/>

Links to semantic values:

</skos:Concept>

<skos:Concept rdf:about="http://intranet.ec.gc.ca/donnees-data/comp/def/natchem/1-0/natchem-skos.rdf#ObservationType">

<skos:prefLabel xml:lang="en-CA">Observation type</skos:prefLabel>

<skos:prefLabel xml:lang="fr-CA">Type d'observation</skos:prefLabel>

<skos:inScheme rdf:resource="http://intranet.ec.gc.ca/donnees-data/comp/def/natchem/1-0/natchem-skos.rdf" />

</skos:Concept>

Page 25: Environment Canada's Data Management Service

Unexpected Challenges: Converting Tabular Values to Semantic Web Data

Due to the source data complexity and huge volumes of descriptive reference data the transformations required:

Lots of StringSearchers & StingReplacer of the tabular values with the URI reference location on the web.

Lots of FeatureMergers (>100) due to source data complexity.

With Semantic Web Values have to deal with relative vs. absolute URI paths.

Where do all these values go within WaterML2.0 logical components? XMLTemplater – was a big help!

Across many workbenches (~20- fmw).

Overall lots of time, effort reworking of the data, transformations and facilitation with program to ensure quality over ~6 months of effort.

Page 26: Environment Canada's Data Management Service

Using FME Benefits

FME Workspace transformation diagram helps communicate areas of improvement required back to data owners.

Similar to a Data Model Diagram, Can demonstrate the data transformation complexes and issues

Once Workbenches are set up. Enabling Programs to run the FME Workbenches as new or updated data comes.

Improved overall data quality management and reporting.

Supports all of data consumers needs of air quality data, now and in the future.

Page 27: Environment Canada's Data Management Service

Next Steps…

API

WFS Service

Query

ResponseCOMP XML PayloadAudience

EC GOC International

Built-in Functionality COMP Viewer Data Point

Downloads

Data Warehouse

Query Dimensions

Temporal extent Spatial extent Sites Variables Techniques

Indexed SQL Tables

XML-Relational Hybrid

Query-specific Collections of COMP components

are assembled on-the-flyfor the API

XML CLOBs

Pointing to

FME Server

Page 28: Environment Canada's Data Management Service

Thank You!

Questions?

For more information:

James Doyle - [email protected]

Christopher Thorne – [email protected]