Taming Rich GML with Stetl-
A lightweight Python Framework for Geospatial ETL
Just van den BroeckeFOSS4G Nottingham 2013
Sept 21, 2013www.justobjects.nl
1
About MeIndependent Open Source Geospatial Professional
Secretary OSGeo Dutch Local Chapter Member of the Dutch OpenGeoGroep
Just van den [email protected] www.justobjects.nl
2
We have a Problem
3
The Rich GML Problem
4
Rich GML = Complex Mess
5
INSPIRE Dutch National Datasets
Germany: AFIS-ALKIS-ATKISUK: OS Mastermap
.
.6
“Semi GML” e.g. Dutch Addresses & Buildings (BAG)
ArbitraryNesting
7
The Street Name!
A Street Element in an INSPIRE Annex I Address..
8
Complex Model
Transformations
9
100+ MBGML Files
10
11
Millionsof
Objects
12
10s of Millionsof
<Elements>
13
MultipleTransformation
Steps
14
Solution is Spatial ETL
15
But How ?
16
FOSS ETL - DIY ? Maybe
17
FOSS ETL - High Level
18
FOSS ETL - Lower Level
Each powerful individually but cannot do the entire ETL
ogr2ogr
19
FOSS ETL - How to Combine?
=+ + ?ogr2ogr
20
Example - 2011 INSPIRE-FOSS
http://inspire.kademo.nl/doc/design-etl.html
Good ideas buthard to scale and reuse. Need Framework
21
FOSS ETL - Add Python to Equation
=+ + ?( )ogr2ogr
22
=+ +
Stetl
( )ogr2ogr
23
Stetl=
SimpleStreaming
SpatialSpeedy
ETL24
GML1
GML2
Stetl
From Barrels of GML to Maps
25
26
StetlConcepts
27
Process Chain
Input Filter OutputFilter
Stetl concepts
Source Target
28
Process Chain
Input Filter Outputgml
Filter
Stetl concepts
29
Example: GML to PostGIS
Reader ogr2ogr
gml
Stetl concepts
30
Example: INSPIRE Model Transform
ogr2ogr XSLT Writergml
Stetl concepts
Simple Features
Complex Features
31
Example: deegree Store
ogr2ogr XSLTdeegreeWriter
Stetl concepts
Or viaWFS-T
32
Process Chain - How?
Input Filters Output
Stetl concepts
33
Example: XML to Shape
XMLInput
XSLTFilter
ogr2ogrOutput
34
Example: XML to Shape
The Source
35
Example: XML to Shape
XMLInput
36
Example: XML to Shape
XMLInput
XSLTFilter
37
Example: XML to Shape
Prepare XSLT Script
38
Example: XML to Shape
XSLT GML Output39
Example: XML to Shape
XMLInput
XSLTFilter
ogr2ogrOutput
40
Example: XML to Shape
The Stetl Config File
ProcessChain
XMLInputXSLT
Filter
ogr2ogrOutput
41
Running Stetl
stetl -c etl.cfg
42
Result Shapefile viewed in QGIS
43
Installing Stetl
via PyPi
Deps•GDAL+Python bindings•lxml (xml proc)•psycopg2 (Postgres)
sudo pip install stetl
44
Speed: Streaming
Input Filter Output
gml
Stetl concepts
45
Speed: Going Native
Input Filter Outputgml
ogr2ogr StetlStetl
Native C Libs/Progs
Calls
Stetl concepts
46
Example Components
Input Filters Output
Stetl concepts
XMLFile XSLT GMLFile
ogr2ogr XMLAssembler ogr2ogr
LineStream XMLValidator WFS-T
deegree* FeatureExtractor deegree*
YourInput YourFilter YourOutput
47
Example: XsltFilter Pythonfrom util import Util, etreefrom filter import Filterfrom packet import FORMAT
log = Util.get_log("xsltfilter")
class XsltFilter(Filter): # Constructor def __init__(self, configdict, section): Filter.__init__(self, configdict, section, consumes=FORMAT.etree_doc, produces=FORMAT.etree_doc)
self.xslt_file_path = self.cfg.get('script') self.xslt_file = open(self.xslt_file_path, 'r') # Parse XSLT file only once self.xslt_doc = etree.parse(self.xslt_file) self.xslt_obj = etree.XSLT(self.xslt_doc) self.xslt_file.close()
def invoke(self, packet): if packet.data is None: return packet return self.transform(packet)
def transform(self, packet): packet.data = self.xslt_obj(packet.data) log.info("XSLT Transform OK") return packet
48
[etl]chains = input_xml_file|my_filter|output_std
[input_xml_file]class = inputs.fileinput.XmlFileInputfile_path = input/cities.xml
# My custom component[my_filter]class = my.myfilter.MyFilter
[output_std]class = outputs.standardoutput.StandardXmlOutput
class MyFilter(Filter): # Constructor def __init__(self, configdict, section): Filter.__init__(self, configdict, section, consumes=FORMAT.etree_doc, produces=FORMAT.etree_doc)
def invoke(self, packet): log.info("CALLING MyFilter OK!!!!") return packet
Your Own Components
Stetl concepts
Step 1- Define Class
Step 2- Config Class
49
Data Structures
Stetl concepts
• Components exchange Packets• Packet contains data and status• Data formats, e.g. :
xml_line_stream etree_docetree_element (feature)etree_element_arraystringany..
50
deegree Integration
Stetl concepts
•Input DeegreeBlobstoreInput•Output DeegreeBlobstoreInput DeegreeFSLoaderOutput WFSTOutput
51
Cases - The Netherlands
•INSPIRE Download Services publish to deegree store (WFS) generate GML files (for Atom Feed)
•National GML Datasets GML to PostGIS (Top10NL, BGT)
52
[etl]chains = input_sql_pre|schema_name_filter|output_postgres, input_big_gml_files|xml_assembler|transformer_xslt|output_ogr2ogr, input_sql_post|schema_name_filter|output_postgres
# Pre SQL file inputs to be executed[input_sql_pre]class = inputs.fileinput.StringFileInputfile_path = sql/drop-tables.sql,sql/create-schema.sql
# Post SQL file inputs to be executed[input_sql_post]class = inputs.fileinput.StringFileInputfile_path = sql/delete-duplicates.sql
# Generic filter to substitute Python-format string values like {schema} in string[schema_name_filter]class = filters.stringfilter.StringSubstitutionFilter# format args {schema} is schema nameformat_args = schema:{schema}
[output_postgres]class = outputs.dboutput.PostgresDbOutputdatabase = {database}host = {host}port = {port}user = {user}password = {password}schema = {schema}
# The source input file(s) from dir and produce gml:featureMember elements[input_big_gml_files]class = inputs.fileinput.XmlElementStreamerFileInputfile_path = {gml_files}element_tags = featureMember
Top10NL Extract
ParameterSubstitution
53
Top10NL+BAG (Dutch Topo + Buildings)
54
BGT - Dutch Large Scale Topo
55
Case: INSPIRE DL Services - Dutch Addresses
Source<GML>
NLExtractStetl deegree
WFS
INSPIRE<GML>
AtomFeed
INSPIREAddresses
DutchAddresses+
Buildings
deegreeblobstore
Stetl
56
Project Status - Sept 21, 2013
• v1.0.4 installable via PyPi• Documentation on www.stetl.org • Real world transforms done• Seeking feedback, support and contributors
57
Rich GML Problem Solved?
58
Thank You !
www.stetl.orggithub.com/justb4/stetl
59