Geospatial ETL with Stetl

Preview:

DESCRIPTION

Stetl, Streaming ETL, is a toolkit for the transformation (ETL) of geospatial data. Stetl is based on existing ETL tools like GDAL/OGR and XSLT. Stetl processing is driven from a configuration (.ini) file. Stetl is written in Python and in particular suited for processing GML. Several INSPIRE transformations have been successfully performed with Stetl. This is an introductory presentation given at the OSGeo Bolsena Codesprint on June 4, 2013. Find more info, downloads and documentation on Stetl at http://stetl.org

Citation preview

Geospatial ETL with Stetl-

“Taming Your Rich GML”

Just van den BroeckeOSGeo Bolsena Codesprint 2013, Bolsena, Italy

June 4, 2012www.justobjects.nl

About MeIndependent Open Source Geospatial Professional

Secretary OSGeo Dutch Local Chapter Member of the Dutch OpenGeoGroep

Just van den Broeckejust@justobjects.nl www.justobjects.nl

OSGeo - Bolsena - 2010

BOLSENA2012

ALLES VORBEI ?

BOLSENA2012

BOLSENA2012

We have a Problem

The Rich GML Problem

Rich GML = Complex Mess

INSPIREDutch National DSsAFIS-ALKIS-ATKIS

.

.

“Semi GML” e.g. Dutch Addresses & Buildings (BAG)

The Streetname!

Application Schema GML e.g. INSPIRE Addresses

Complex Model

Transformations

100+ MBGML Files

Millionsof

Objects

10s of Millionsof

<Elements>

MultipleTransformation

Steps

Solution is Spatial ETL

A.K.A.

Thank You for your

Attention!

But what about.......FOSS ? ... Stetl?

FOSS ETL - Lower Level

Each Powerful by Itself

ogr2ogr

FOSS ETL - High Level

FOSS ETL - DIY ? (No!)

FOSS ETL - How to Combine?

=+ + ?ogr2ogr

Example - 2011 INSPIRE-FOSS

http://inspire.kademo.nl/doc/design-etl.html

Good ideas buthard to scale

and reuse. Need Framework

FOSS ETL - Add Python to Equation

=+ + ?( )ogr2ogr

=+ +

Stetl

( )ogr2ogr

Stetl=

SimpleStreaming

SpatialSpeedy

ETL

Process Chain

Input Filter Outputgml

Filter

Stetl concepts

Speed: Streaming

Input Filter Output

gml

Stetl concepts

Speed: Going Native

Input Filter Outputgml

ogr2ogr sETLsETL

Native C Libs/Progs

Calls

Stetl concepts

Example: GML to PostGIS

ReaderXML

Splitter ogr2ogr

gml

Stetl concepts

Example: INSPIRE Model Transform

ogr2ogr XSLT Writergml

Stetl concepts

Example: deegree Store

ogr2ogr XSLTdeegreeWriter

Stetl concepts

Process Chain - How?

Input Filters Output

Stetl concepts

Example: XML to Shape

The Source

Example: XML to Shape

The XSLT Script

Example: XML to Shape

XSLT Transform to GML

Example: XML to Shape

XMLInput

XSLTFilter

ogr2ogrOutput

Example: XML to Shape

The SETL Chain Config File

ProcessChain

Reader

XSLT

ogr2ogr

Example: XsltFilter Pythonfrom util import Util, etreefrom filter import Filterfrom packet import FORMAT

log = Util.get_log("xsltfilter")

class XsltFilter(Filter): # Constructor def __init__(self, configdict, section): Filter.__init__(self, configdict, section, consumes=FORMAT.etree_doc, produces=FORMAT.etree_doc)

self.xslt_file_path = self.cfg.get('script') self.xslt_file = open(self.xslt_file_path, 'r') # Parse XSLT file only once self.xslt_doc = etree.parse(self.xslt_file) self.xslt_obj = etree.XSLT(self.xslt_doc) self.xslt_file.close()

def invoke(self, packet): if packet.data is None: return packet return self.transform(packet)

def transform(self, packet): packet.data = self.xslt_obj(packet.data) log.info("XSLT Transform OK") return packet

Example Components

Input Filters Output

Stetl concepts

XMLFile XSLT GMLFile

ogr2gml GMLSplitter gml2ogr

LineStream XMLValidator WFS-T

deegree* FeatureExtractor deegree*

YourInput YourFilter YourOutput

[etl]chains = input_xml_file|my_filter|output_std

[input_xml_file]class = inputs.fileinput.XmlFileInputfile_path = input/cities.xml

# My custom component[my_filter]class = my.myfilter.MyFilter

[output_std]class = outputs.standardoutput.StandardXmlOutput

class MyFilter(Filter): # Constructor def __init__(self, configdict, section): Filter.__init__(self, configdict, section, consumes=FORMAT.etree_doc, produces=FORMAT.etree_doc)

def invoke(self, packet): log.info("CALLING MyFilter OK!!!!") return packet

Your Own Components

Stetl concepts

Step 1- Define Class

Step 2- Config Class

Data Structures

Stetl concepts

✴ Components exchange Packets✴ Packet contains data and status✴ Data formats:

xml_line_stream etree_docetree_feature_arrayxml_doc_as_stringany

deegree Integration

Stetl concepts

✴Input DeegreeBlobstoreInput✴Output DeegreeBlobstoreInput DeegreeFSLoaderOutput WFSTOutput

Cases✴INSPIRE Download Services publish to deegree store (WFS) GML files (for Atom Feed)

✴National GML Datasets GML to PostGIS (Top10NL, BGT)

[etl]chains = input_sql_pre|schema_name_filter|output_postgres, input_big_gml_files|xml_assembler|transformer_xslt|output_ogr2ogr, input_sql_post|schema_name_filter|output_postgres

# Pre SQL file inputs to be executed[input_sql_pre]class = inputs.fileinput.StringFileInputfile_path = sql/drop-tables.sql,sql/create-schema.sql

# Post SQL file inputs to be executed[input_sql_post]class = inputs.fileinput.StringFileInputfile_path = sql/delete-duplicates.sql

# Generic filter to substitute Python-format string values like {schema} in string[schema_name_filter]class = filters.stringfilter.StringSubstitutionFilter# format args {schema} is schema nameformat_args = schema:{schema}

[output_postgres]class = outputs.dboutput.PostgresDbOutputdatabase = {database}host = {host}port = {port}user = {user}password = {password}schema = {schema}

# The source input file(s) from dir and produce gml:featureMember elements[input_big_gml_files]class = inputs.fileinput.XmlElementStreamerFileInputfile_path = {gml_files}element_tags = featureMember

Top10NL Extract

Case: INSPIRE DL Services - Dutch Addresses

Source<GML>

NLExtractStetl deegree

WFS

INSPIRE<GML>

AtomFeed

INSPIREAddresses

DutchAddresses+

Buildings

deegreeblobstore

Stetl

Recommended