Slovak Public Procurement Announcements - Extraction, Transformation and Loading

Embed Size (px)

Citation preview

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    1/23

    [email protected] www.knowerce.sk

    Slovak Public ProcurementAnnouncmenets

    Extraction, transformation and Loading Process July 2010

    knowerce

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    2/23

    Document information

    Creator Knowerce, s.r.o.Vavilovova 16851 01 Bratislava

    [email protected] www.knowerce.sk

    Author tefan Urbnek, [email protected]

    Date of creation 20.7.2010

    Document revision 2

    1.Document RestrictionsCopyright (C) 2010 Knowerce, s.r.o., Stefan Urbanek

    Permission is granted to copy, distribute and/or modify this document under the terms of the GNUFree Documentation License, Version 1.3 or any later version published by the Free SoftwareFoundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of thelicense is included in the section entitled "GNU Free Documentation License".

    Slovak Public Procurement Announcements ETL knowerce

    2

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    3/23

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    4/23

    2. Introduction

    This document describes extraction, transformation and loading process of public procurementdocuments in Slovakia. Objective of the VVO project was to transform unstructured publicprocurement announcement documents into structured form.

    Source code: http://github.com/Stiivi/vvo-etl

    Data source URL: http://www.e-vestnik.sk/

    Application using the data: http://vestnik.transparency.sk

    raw open dataunstructuredHTML

    Slovak Public Procurement Announcements ETL knowerce

    4

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    5/23

    3. Overview

    3.1. The ProcessPublic procurement announcement documents are being processed in a chain of ETL jobs. The jobsare:

    Reasons for creating several jobs instead of single monolithic processing script are mainly: bettermaintainability, ability to re-run failed part of the chain, ability to plug-in other sources into the chain in

    the future.If a part of the chain fails, it is not necessary to run whole chain again, just the part of the chain fromfailed part. This lowers processing load and network load on source servers. For example, cleansingfails, it is not necessary to download the les again.

    In addition to the processing jobs, there are three required, however independent jobs:

    Job Type Description

    Download core Download HTML documents from the source

    Parse core Parse HTML documents into structured form

    Load source core Load structured form into database table

    Cleanse core Cleanse data, x values, map corrections

    Create cube core Create analytical structure: fact table and dimensions

    Create search index core Create search index for full-text searching with support for Slovak/ASCII searching

    Regis Extraction suppor t Extract list of all Slovak organisations

    Geography loading suppor t Load data from Slovak post-o ffice about regional break-down

    CPV loading support Load CPV (common procurement vocabulary) data

    Source Extraction Transformation Analytical Transformation

    Load source CleanseDownload Parse Create cubeCreate

    search index

    RegisExtraction

    Geography Loading CPV Loading

    Slovak Public Procurement Announcements ETL knowerce

    5

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    6/23

    4. Jobs

    4.1. Download

    Inputs: HTML documents stored on public procurements website

    Outputs: HTML les stored locally

    Conguration: public procurements web site root, path to bulletin index, document encoding

    Options: incremental mode (default), full mode (download all announcements)

    At site root one can nd paginated list of bulletins:

    http://www.e-vestnik.sk/#EVestnik/Vsetky_vydania

    By following a bulletin link, there is list of announcement types:

    http://www.e-vestnik.sk/#EVestnik/Vestnik?date=2010-08-07&from=Vsetky_vydania

    Download

    raw sources HTML les

    Slovak Public Procurement Announcements ETL knowerce

    6

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    7/23

    By clicking on a link with desired public procurement type (procurement results) list is expanded andwe get list of all announcements within the bulletin:

    http://www.e-vestnik.sk/#EVestnik/Vestnik?cat=7&date=2010-08-07

    Situation: no data API provided by website no single list of all public procurements, only paginated browsing of bulletins no proper HTML id attributes, nor non-ambiguous class attributes layout by table

    Process

    1. Download and parse document index at specied site root, get number of pages2. Download and parse all bulletin list page pages, output is name and URL of each bulletin

    3. Compare list of available bulletins with list of already downloaded bulletins and generate list of bulletins to be downloaded (all if full download is requested)

    4. Download all announcements found on each bulletin page and save into download directory 5. Store list of downloaded bulletins

    4.2. Parse

    HTML les

    Parse

    YAML les

    Slovak Public Procurement Announcements ETL knowerce

    7

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    8/23

    Inputs: HTML documents with announcements, stored locally

    Outputs: YAML structured les with parsed elds, one YAML per announcement

    Conguration: none

    Options: none

    Situation: very messy HTML structure ambiguous class attributes, mis-use of class attributes no usable id attributes heavy table layout with nested tables, level 3 is common (table in table in table) sometimes broken layout, causing many parsing exceptions not reliably indexable values by referencing row number non-consistent table layout might and might not contain tbody Document example:

    http://www.e-vestnik.sk/EVestnik/Detail/16563

    Example of layout with emphasised contrast CSS for better layout visibility:

    Slovak Public Procurement Announcements ETL knowerce

    8

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    9/23

    Example of broken layout, where the cyan values in left column were supposed to be in the rightcolumn:

    Example of an element nesting within a document with 24 levels of nesting:html > body > #page > #container > #main > #innerMain > div >

    > table > tbody > tr > td >> table > tbody > tr > td >

    > table > tbody > tr >td >> table > tbody > tr >td > span.hodnota

    Having situation like described above makes parsing of public procurement documents tricky.Rough document structure (as seen by user/human): document title basic announcement information parts of announcement each part of announcement contain sections each section contains list of information pieces (I would not call that key-value pairs, as they are

    not)

    Slovak Public Procurement Announcements ETL knowerce

    9

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    10/23

    Process

    The whole document was parsed as HTML document tree.

    Strategies used:

    Unicode regular exception matching element references by element index (unstable, but su fficient for most cases) - instead of using

    proper id/class attribute (which were missing) we used index of the element that we wanted toparse

    because structure was not consistent, sometimes searching for elements was necessary instead of directly referencing by path, which made processing little bit slower

    1. read basic announcement information: date, announcement number, type2. nd table with document parts and split HTML document subtrees for each part

    3. parse each part

    Part parsing:The main body of the document is a table containing cells which contain optional part title and partbody in the form of a table. The part body table contains anonymous rows with section contents in

    two columns. The left column is used mostly for padding and might contain section number. The rightcolumn contains information to be extracted. How the part and sections look like is depicted in thefollowing picture:

    part body

    part title

    part body

    ... more parts

    number section title

    (empty) cell with content

    (empty) cell with content

    (empty) cell with content

    number section title

    (empty) cell with content

    number section title

    (empty) cell with content

    part title

    part body

    part body

    Slovak Public Procurement Announcements ETL knowerce

    10

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    11/23

    It was not possible to reliably nd sections in parts by referencing rows directly. Each part was brokeninto list of table rows and rows were parsed sequentially as on a tape:1. prepare section structure2. get next row

    3. if left column contains value, then it is beginning of next section3.1. process previous section, if there is any 3.2. prepare new section structure

    3.3. save next section name into section structure4. if left column is empty then:

    4.1. add right column to list of section rows in the section structure

    5. repeat from 2 until all rows are processed

    Section parsing:

    After parsing parts, the section structure contains section title, section number and list of rows (cellsfrom left column of a part table). The rows are processed sequentially as well.Each set of section rows were parsed into eld/value pars using unicode regexp matching. Becausenaming of values was non consistent, multiple values/matches had to be used or more complex regular

    expressions. The value keys had di ff erent wordings or used di ff erent words to describe the same value.Examples of section rows:

    Rendered Document HTML

    Slovak Public Procurement Announcements ETL knowerce

    11

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    12/23

    Rendered Document HTML

    Part V. contained list of contracts and required separate parsing.

    No heavy data cleansing is performed. Only xing numerical values and trimming text strings.

    Issues

    elds with currency amounts were in many forms: one amount (expected) or two amounts (expected and nal) single amount or from-to range with or without currency with or without VAT included ag with or without VAT rate

    there were no eld name prexes (such as name:, phone:) in all contacts, eld order was usedin that case (not 100% reliable)

    empty/bogus HTML nodes, sometimes preventing proper parsing

    4.3. Load Source

    Inputs: YAML structured les with parsed elds, one YAML per announcementOutputs: populated staging database table with contracts

    Conguration: none

    Options: default mode (just load data), create mode (create DB structures)

    Process

    Simple mapping of structured les into DB table:

    load structured le and for each contract do: insert contract record into table

    Load source

    YAML les contracts table(staging)

    Slovak Public Procurement Announcements ETL knowerce

    12

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    13/23

    Table contains mostly unprocessed raw text values and numerics only for currency amounts. Contentof the table mostly matches information from source documents.

    4.4. Cleanse

    Inputs: populated staging database table with contracts

    Outputs: cleaned staging data with consolidated suppliers

    Conguration: none

    Options: default mode (just load data), create mode (create DB structures)

    ProcessGoal of this job is to cleanse data taken from source and consolidate them. More specically:

    cleanse organisation number (ICO) format (without validity checking) coalesce values of short enumerations consolidate date formats add procurer additions into procurers table consolidate suppliers and add additions into suppliers table

    Suppliers Consolidation

    Requirements:

    table with suppliers that might contain more information than present in REGIS database possibility to automatically correct errors in source documents, such as invalid IDs collect all unknown IDs for further correction in separate table

    Presence and validity of organisation identication number (ICO) in the source does not match quality requirements. There are cases when ICO does not match with any organisation in the organisationdatabase. For those cases a mapping table is created where one can specify mapping of invalidcompany identications to valid ones. There are two ways of corrective mapping:

    map directly organisation within specic announcement:

    [announcement , organisation ID] [correct organisation ID]

    staging clean dataelds with appropriate type and format

    Cleanse

    contracts table(staging)

    "unknown"suppliers map

    REGIS (SK organisations)

    Slovak Public Procurement Announcements ETL knowerce

    13

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    14/23

    map unknown organisations:

    [country, organisation ID, organisation name] [correct organisation ID]

    The process is depicted in the following image:

    1. Try to nd unknown suppliers2. Coalesce supplier name: use org.id from suppliers table if found, otherwise use from suppliers

    table by mapping.3. Append newly found suppliers

    Reason for having separate suppliers table is, that it might be extended with more necessary

    information than provided by the organisations database REGIS.

    sta_vvo_vysledky sta_regis

    sta_suppliers

    -

    +

    +

    map_suppliers

    +

    tmp_coalesced_suppliers_sk

    new suppliers

    -

    unknown suppliers

    1

    2

    3

    ?

    -

    Slovensko

    Slovak Public Procurement Announcements ETL knowerce

    14

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    15/23

    4.5. Create Cube

    Inputs: cleaned staging data

    Outputs: fact table, dimension tables, analytical model description

    Conguration: none

    Options: default mode (just load data), create mode (create DB structures)

    This step creates and loads all structures for analytical processing:

    fact table fact is contract dimensions:

    supplier procurer process type contract type evaluation type account sector supplier geography

    Process

    1. create dimension for suppliers2. create dimension for procurers

    3. create fact table (see below)4. x unknown dimension values - if there are values in the source data that are not found in the

    dimensions, mark them as unknown and add them into dimension tables as new value additions5. create table with issues (for quality monitoring) and identify issues, such as empty or unknown

    elds

    Create Fact Table

    Fact table is created simply by transforming cleansed data and joining with prepared dimension tables.

    staging clean data

    fact tableCreate cube

    dimension tables

    analytical modeldescription

    Slovak Public Procurement Announcements ETL knowerce

    15

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    16/23

    4.6. Create search index

    Inputs: dimension tables

    Outputs: Sphinx search index

    Conguration: none

    Options: none

    This step creates index of dimension values at searchable levels and indexes them with Sphinx full-textsearch indexer. Index is created using Slovak character mapping, to be able have search queries in plainASCII (without carrons and accents).

    The analytical model is multidimensional cube in star schema 1 with hierarchical dimensions that havemultiple levels. It would be not su fficient to create full-text search index for each table, as we need toknow at what level the searched eld was found. For this purpose a dimension index table is created.

    The dimension index contains elds:

    dimension dimension key (reference to dimension row - whole dime nsion point) level (for example: county, region or country in geogra phy) level key value of level key attribute (for example: county code) indexed eld name indexed eld valueSphinx indexes the dimension index table.

    Use example for search query: Bystri*. There are more cities called Bystrica, such as BanskaBystrica, however there is also a region called Banskobystricky that will match the same query andwe want to get both results higher level (region) and detailed level (city).

    4.7. Regis DownloadInputs: documents at website of Statistics O ffice of Slovak Republic

    Outputs: table with list of organisations in Slovakia

    Conguration: source URL, document ID range, number of concurrent processing threads

    Options: incremental download (default), full reload

    dimension tables

    Createsearch index

    search index

    dimension index

    Slovak Public Procurement Announcements ETL knowerce

    16

    1

    Fact table joined with dimension tables with no deeper references. All tables are joined to the fact tabledirectly, there are no joins: FT - T1 - T2.

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    17/23

    Process

    Documents are being downloaded sequentially by document ID from source URL. The downloading isbeing done in batches of 50k documents (congurable) and in 20 parallel threads (congurable).

    In-spite of the documents being labeled as HTML, they contain no valid HTML code and can beconsidered as text documents with HTML tags. The downloaded documents are stripped of HTML

    tags and then are parsed with regular expressions as plain-text documents.

    The process of downloading and processing all documents takes 2 hours in average, therefore it isadvised to run the process on a weekly basis.

    4.8. Geography LoadingInputs: list of municipalities and counties from Slovak Post O ffice

    Outputs: single de-normalised table with hierarchical geography information about Slovakia

    Conguration: none

    Options: none

    Process

    Records are simply being mapped with mapping tables containing ISO 3166-2:SK division codes andregion names into a single de-normalised table.

    4.9. CPV LoadingInputs: Multilingual wide CPV code table

    Outputs: single de-normalised table with hierarchical CPV structure

    Conguration: none

    Options: none

    Process

    Common Procurement Vocabulary (CPV) code table provided by EU institutions is in linear structurewith tree-structure properties. This table is being transformed into de-normalised table with treehierarchy levels in multiple columns.

    Slovak Public Procurement Announcements ETL knowerce

    17

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    18/23

    5. Data

    Overview:

    Source Mirror Staging Data Datamart

    source documents

    There are three data stores:

    source mirror on a le system staging data database schema datamart database shcemaMore detailed view:

    Source Mirror

    Staging Data

    Datamart

    source documents HTML les

    YAML les

    source contractdata

    staging contractdata

    fact table dimensions

    logical model(metadata)

    staging datalists

    mappings temporary tables

    contracts cube

    parsedownload

    load source

    cleanse

    create cube

    Slovak Public Procurement Announcements ETL knowerce

    18

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    19/23

    5.1. Source MirrorThe source mirror contains downloaded original documents and parsed structured version of the

    documents in YAML format. If the source becomes unavailable and it is desired to parse the les again(more attributes gathered, di ff erent parsing method, bug x), it can be done on locally stored les.

    Documents are not parsed directly into database. Reasons:

    required YAML text le storage structured documents can be processed with other tools without any database server connection

    5.2. Staging DataStructured les are loaded into database into staging data datastore (preferably separate schema). Theles are loaded without any or very minor transformations. The table should be 1:1 copy of thestructured les.

    The staging data store contains:

    lists/enumerations, for example ISO country region subdivision copies of various sources or preprocessed datasets, such as geography from SK post o ffice,

    registry of organisations (REGIS) staging data for procurers and suppliers might contain more information than provided by

    registry of organisations (REGIS) maps for mapping source values to desired values, coalescing and unifying

    map of unknown organisations map unknown org. names and org. codes into existingorganisations

    map of region names di ff erent region naming in REGIS than in o fficial post office region

    registry map of reference codes map of fulltext values, such as names of procurement types into

    short codes (identiers) that will be used as keys. Also unies similar names into same code. temporary tables tables being used during transformation process that are created only for the

    purpose of the single transformation run (for example: coalesced suppliers according to REGIS,mapped unknown organisations and existing registered organisations)

    Some tables are being appended with new data during the transformation process. New data arebeing added into:

    map of unknown organisations for further xing new known organisations for further update with additional information

    5.3. Datamart DatastoreThe Datamart Datastore, separate database schema, contains nal data ready for analysis andreporting. Structures in the schema are:

    logical model metadata description of the OLAP cube for contracts (Brewery framework objects)

    dimension tables tables with dimension values (hierarchical) fact table cleansed table with procurement contracts, joinable with dimensions

    Slovak Public Procurement Announcements ETL knowerce

    19

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    20/23

    The dimension tables with fact table in this schema form snowake schema. 2

    Brewery OLAP is using the structures in the datamart datastore to denormalize the snowake schemainto wide fact table suitable for analysis, aggregation and reporting. That means, that the end-user

    the analyst does have to know about physical structures behind the procurement contracts. He hasonly one logical fact table where one row is one fact, that is one contract. The logical metadata enables

    the analyst to perform analysis on multidimensional hierarchical structure.

    Slovak Public Procurement Announcements ETL knowerce

    20

    2 http://en.wikipedia.org/wiki/Snowake_schema

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    21/23

    6. Search Index

    One of the requirements for the public procurements portal was to be able to search through thedata by many diff erent elds. The nature of nal data is:

    many elds, described by metadata we should not rely on xed data structure, hierarchical structure we need to know at what level the value that we are searching for can be

    found

    Example of a search query: chemical. The word chemical might be contained in subject type,however at di ff erent levels: division, category, subcategory We have to know exact level where thework appeared. If the word chemical is found at division level, we want report at division level, if theword is found at category level, we want to aggregate at the category level, etc.

    The sphinx searching engine can create one index for a table for known set of elds. While searching,we do not know in which eld the value was found, only document number (row). To make search inmultiple elds and through hierarchies possible we had to pre-index data with enough metadata. Thenal table that is being indexed contains:

    string value of indexed searchable eld dimension of the eld (cpv, organisation, region, ) dimension level of the eld (division/category/subcategory, region/county...) level key of the indexed eld some index document id that will be returned by sphinx

    Slovak Public Procurement Announcements ETL knowerce

    21

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    22/23

    7. Installation

    7.1. Software Requirements PostgreSQL database server ruby 1.9 (does not work with version 1.8) gems: sequel, data-mapper, nokogiri Sphinx Brewery from http://github.com/Stiivi/brewery/

    7.2. Preparation

    I. create a directory where working les, such as dumps and ETL les, will be stored, for example:

    /var/lib/vvo-lesII. initialize and congure Brewery (see Brewery installation instructions)III. create two database schemas: vvo_staging for staging tables and vvo_data for analytical data

    7.3. ETL Database initialisationTo initialize ETL database schema run the Brewery ETL tool:

    etl initialize

    This will create all necessary system tables. If you try to initialise a schema which already contains ETLsystem tables you will get an error message. This prevents you to overwrite existing data. To recreate

    the schema and start with empty tables execute initialize command with --force ag:

    etl --force initialize

    Slovak Public Procurement Announcements ETL knowerce

    22

  • 8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

    23/23

    8. Running ETL Jobs

    8.1. Launching

    Manual Launching

    Jobs are being run with simply launching the etl tool:

    etl run job_name

    To manually run all daily jobs, you might use following script:

    #!/bin/bash#

    DEBUG='--debug'

    etl $DEBUG run vvo_downloadetl $DEBUG run vvo_parseetl $DEBUG run vvo_load_sourceetl $DEBUG run vvo_cleanseetl $DEBUG run vvo_create_cubeetl $DEBUG run vvo_search_index

    If a job fails, you have to run only the jobs after the failed job.

    To do full download, instead of incremental, do:etl run vvo_download all

    Slovak Public Procurement Announcements ETL knowerce