S-DWH Modular Workflow · Web viewAllan Randlepp,Antonio Laureti Palma, Francesco AltaroccaValerij ŽavoronokPedro Cunha

in partnership with

Title: S-DWH Modular Workflow

WP: 3 Deliverable: 3.2

Version: 6.0 - Final Date: October 2013

Autors: Allan Randlepp,

Antonio Laureti Palma, Francesco AltaroccaValerij ŽavoronokPedro Cunha

NSI:Statistics EstoniaIstatIstatStatistics LithuaniaINE Portugal

ESS - NET

ON MICRO DATA LINKING AND DATA WAREHOUSING IN PRODUCTION OF BUSINESS STATISTICS

1

2

@ S-DWH Modular Workflow

Version: 1.0 February 25, 2013 Allan RandleppVersion: 2.0 February 27, 2013 Allan RandleppVersion: 3.0 March 1, 2013 Antonio Laureti PalmaVersion: 4.0 March 4, 2013 Allan RandleppVersion: 4.1 June 17, 2013 Valerij ŽavoronokVersion: 4.2 October 12, 2013Francesco AltoroccaVersion: 5.0 October 10, 2013Pedro CunhaVersion: 6.0 October 15, 2013 Antonio Laureti Palma Final Version

3

4

Index

1 Introduction...................................................................................................................................4

2 Statistical production models........................................................................................................62.1 Stovepipe model....................................................................................................................62.2 Integrated model....................................................................................................................72.3 Warehouse approach..............................................................................................................9

3 Integrated Warehouse model......................................................................................................103.1 Technical platform integration............................................................................................103.2 Process integration...............................................................................................................113.3 Warehouse – reuse of data...................................................................................................12

4 S-DWH as layered modular system............................................................................................154.1 Layered architecture............................................................................................................154.2 Layered approach of a full active S-DWH..........................................................................164.3 Source layer.........................................................................................................................164.4 Integration layer...................................................................................................................184.5 Interpretation layer..............................................................................................................214.6 Access layer.........................................................................................................................26

5 Workflow scenarios....................................................................................................................295.1 Scenario 1: full linear end-to-end workflow........................................................................295.2 Scenario 2: Monitoring collection.......................................................................................305.3 Scenario 3: Evaluating new data source..............................................................................305.4 Scenario 4: Re-using data for new standard output.............................................................315.5 Scenario 5: re-using data for complex custom query..........................................................325.6 Generic workflow suitable for reuse of components...........................................................325.7 A simple statistical process..................................................................................................335.8 CORE services and reuse of components............................................................................36

6 Conclusion..................................................................................................................................39

References..........................................................................................................................................40

1. Introduction

Statistical system is a complex system of data collection, data processing, statistical analyses, etc. The following figure (by Sundgren (2004)) shows a statistical system as precisely defined, man-designed system that measures external reality. Planning and control system on the figure corresponds to phases 1–3 and 9 in GSBPM notations and statistical production system on the figure corresponds to phases 4–8 in GSBPM.

This is a general, standardized view of the statistical system and it could represent one survey or the whole statistical office or even an international organization. How such a system is built up and organized in real life varies greatly. Some implementations of statistical system have worked quite well so far and others not so well. Local environments of statistical systems are slightly different but big changes in environment are more and more global. It does not matter anymore how well the system has performed so far, some global changes in environment are so big that every system has to adapt and change.

5

6

This paper presents the strengths and weaknesses of the main statistical production models based on W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009) “Terminology Relating To The Implementation Of The Vision On The Production Method Of EU Statistics”. This is followed with proposal how to combine integrated production model and warehouse approach. This corresponds to a metadata-driven data warehouse which is well-suited for supporting the management of modules in generic workflows. This modular approach can reduce the “time to market”, i.e. the length of time it takes from a product being conceived until its availability for use. As following, an overview is provided how statistical warehouse layered architecture gives modularity to the statistical system as a whole. In order to suggest a possible roadmap towards process optimization and cost reduction, we will introduce a possible simple description of a generic workflow, which links the business model with the information system.

2. Statistical production models

2.1 Stovepipe model

Today’s prevalent production model in statistical systems is the stovepipe model. That is the outcome of a historic process in which statistics in individual domains have developed independently. In stovepipe model a statistical action or survey is independent form other actions in almost every phase of statistical production value chain.

Advantages of the stovepipe model (from W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009)):

1. The production processes are best adapted to the corresponding products.

2. It is flexible in that it can adapt quickly to relatively minor changes in the underlying phenomena that the data describe.

3. It is under the control of the domain manager and it results in a low-risk business architecture, as a problem in one of the production processes should normally not affect the rest of the production.

7

8

Disadvantages of the stovepipe model (from W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009)):

1. First, it may impose an unnecessary burden on respondents when the collection of data is conducted in an uncoordinated manner and respondents are asked for the same information more than once.

2. Second, the stovepipe model is not well adapted to collect data on phenomena that cover multiple dimensions, such as globalization, sustainability or climate change.

3. Last but not least, this way of production is inefficient and costly, as it does not make use of standardization between areas and collaboration between the Member States. Redundancies and duplication of work, be it in development, in production or in dissemination processes are unavoidable in the stovepipe model.

The stovepipe model is the dominant model in ESS and is reproduced and added at Eurostat level as well, called as augmented stovepipe model.

2.2 Integrated model

Integrated model is the new and innovative way of producing statistics. It is based on the combination of various data sources. This integration can be horizontal or vertical.

1. “Horizontal integration across statistical domains at the level of National Statistical Institutes and Eurostat. Horizontal integration means that European statistics are no longer produced domain by domain and source by source but in an integrated fashion, combining the individual characteristics of different domains/sources in the process of compiling statistics at an early stage, for example households or business surveys.” (W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))

2. “Vertical integration covering both the national and EU levels. Vertical integration should be understood as the smooth and synchronized operation of information flows at national and ESS levels, free of obstacles from the sources (respondents or administration) to the final product (data or metadata). Vertical integration consists of two elements: joint structures, tools and processes and the so-called European approach to statistics.” (W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))

(W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))

Integrated model is created to avoid the disadvantages of stovepipe model (burden on respondents, not suitable for surveying multi-dimensional phenomena, inefficiencies and high costs). “By integrating data sets and combining data from different sources (including administrative sources) the various disadvantages of the stovepipe model could be avoided. This new approach would improve efficiency by elimination of unnecessary variation and duplication of work and create free capacities for upcoming information needs.” (W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))

A task to go from the stovepipe model to the integrated model is not an easy one at all. In his answer to UNSC about the draft of the paper on “Guidelines on Integrated Economic Statistics” W. Radermacher writes: “To go from a conceptually integrated system such as the SNA to a practically integrated system is a long term project and will demand integration in the production of primary statistics. This is the priority objective that Eurostat has given to the European Statistical System through its 2009 Communication to the European Parliament and the European Council on the production method of the EU statistics ("a vision for the new decade").”

The Sponsorship on Standardization, a strategic task force in the European Statistical System, has compared traditional and integrated approach to statistical production. They conclude that “in the current situation, it is clearly shown that there are high level risks and low level opportunities” and that “the full integration situation is more balanced than the current situation, and the most interesting point is that risks are mitigated and opportunities exploded.” (The Sponsorship on Standardisation (2013)) It seems that it is strategically wise to move away from stovepipes and partly integrated statistical systems toward fully integrated statistical production systems.

9

10

2.3 Warehouse approach

In addition to the stovepipe model, augmented stovepipe model and integration model, W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009) describe also the warehouse approach: “The warehouse approach provides the means to store data once, but use it for multiple purposes. A data warehouse treats information as a reusable asset. Its underlying data model is not specific to a particular reporting or analytic requirement. Instead of focusing on a process-oriented design, the underlying repository design is modelled based on data inter-relationships that are fundamental to the organisation across processes.”

Conceptual model of data warehousing in the ESS (European Statistical System)

(W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))

“Based on this approach statistics for specific domains should not be produced independently from each other, but as integrated parts of comprehensive production systems, called data warehouses. A data warehouse can be defined as a central repository (or "storehouse") for data collected via various channels.” (W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))

3. Integrated Warehouse model

Integrated Warehouse model combines the integrated model and the warehouse approach into one model. To have integrated warehouse centric statistical production system, different statistical domains should have more consistency on methodology and share common tools and distributed architecture. First we look at integration followed by warehouse and then combine both into one model.

3.1 Technical platform integration

Let’s look at classical production system and try to find key integration points, where statistical activities meet each other. Classical stovepipe statistical system looks like this:

Let’s begin integration of the platform from the end of the production system. Each well integrated statistical production system has the main dissemination database, where all detailed statistics are published. One for in-house use and the other for public use. To produce rich and integrated output, especially cross domain output, we need warehouse where data are stored once, but can be used for multiple purposes. Such a warehouse should be between process and analyse phases. And of course there should be a raw database.

Depending on specific tools used or other circumstances, one may have more than one raw database or warehouse or dissemination database, but less is better. For example, Statistics Estonia has three integrated raw databases. The first is a web based tool for collecting data from enterprises. The second is a data collection system for social surveys. And the third one is for administrative and other data sources.

But this is not all, let’s look at planning and design phases. Descriptions of all statistical actions, all classificators that are in use, input and output variables, selected data sources, descriptions of output tables, questionnaires and so on, all these meta-objects should be collected during design and build phases into one metadata repository. And needs of clients should be stored into central CRM database.

11

12

These are the main integration points in database level, but this is not something new or revolutionary. Although, software tools could be shared between statistical actions. How many data collection systems do we need? How many processing or dissemination tools do we need? Both in local and international level? Do we need different processing software for every statistical action or for every statistical office? This kind of technological database and software level integration is important and is not an easy task, but this is not good enough. We must go deeper into processes and find ways to standardize sub-processes and methods. One way to go deeper into process is to look at variables in each statistical activity.

3.2 Process integration

“Integration should address all stages of the production process, from design of the collection system to the compilation and dissemination of data.” (W. Radermacher (2011)) Each statistical action designs sample and questionnaires according to own needs and uses variations of classificators as needed, selection of data sources is done according to the needs of the action, etc.

In the statistical system there is a number of statistical actions and each action collects some input variables and produces some output variables. One way to find some common ground between different statistical actions and sources is to focus on variables. Especially input variables because data collection and processing are most costly phases of statistical production. Standardizing on these phases gives us the fastest and biggest saving. Output variables will be standardized in SDMX initiative.

Statistical actions should collect unique input variables not just rows and columns of tables in a questionnaire. Each input variable should be collected and processed once in each period of time. This should be done so that the outcome, input variable in warehouse, could be used for producing various different outputs. This variable centric focus triggers changes in almost all phases of statistical production process. Samples, questionnaires, processing rules, imputation methods, data sources, etc., must be designed and built in compliance with standardized input variables, not according to the needs of one specific statistical action.

The variable based on statistical production system reduces the administrative burden, lowers the cost of data collection and processing and enables to produce richer statistical output faster. Of course, this is true in boundaries of standardized design. If there is a need for special survey, one

can design his/her own sample, questionnaire, etc., but then this is a separate project with its own price tag. But to produce regular statistics this way is not reasonable.

3.3 Warehouse – reuse of data

To organize reuse of already collected and processed data in statistical production system, the boundaries of statistical actions must be removed. What will remain if statistical actions are removed? Statistical actions are collection of input and output variables, processing methods, etc. When we talk about data and reuse then we are interested in variables, samples or estimation frame and timing of surveys.

The following figure represents a typical scenario with two surveys and one administrative data source. Survey 1 collects with questionnaires two input variables A and B and may use the variable B’ from the administrative source. Survey 1 analyses variables A and B*, where B* is easier B form questionnaire or imputed B’ from administrative source. Survey 2 collects variables C and D and analyses B’, C* and D.

This is a statistical action based on stovepipe model. In this case it is hard to re-use data on interpretation layer, because imputation choices in integration layer for B* and C* are made “locally” and there is great choice of similar variables in interpretation layer, like B* and B’. Also samples of Survey 1 and Survey 2 may be not coherent, which means that the third survey, wanting to analyse variables A, B’ and D in interpretation layer without collecting them again, has a problem of coherence and sample.

To solve the problem we should invest some time and effort to planning and preparing Surveys 1 and 2, so that they would be coherent in a unique integrated variable-sampling centric warehouse.

In addition to analysing data and generating output-cubes, interpretation layer can be used for accessing to the production data. In interpretation layer statisticians can plan and prepare Surveys 1 and 2 by coordinating surveys and archives for a common evaluation frame and defining unique

13

14

variables. Information gained during this phase is basis for developing and tuning regular production processes in integration layer.

This means that a coherent approach can be used if statisticians plan their actions following a logical hierarchy of the variables estimation in a common frame. What the IT must support is then an adequate environment for designing this strategy.

Then, according to a common strategy, Surveys 1 and 2 which collect data with questionnaires and one administrative data source serve as examples. But this time, decisions being in design phase, like design of the questionnaire, sample selection, imputation method, etc., are made “globally”, in view of the interests of all three surveys. This way, integration of processes gives us reusable data in the warehouse. Our warehouse now contains each variable only once, making it much easier to reuse and manage our valuable data.

Another way of reusing data already in the warehouse is to calculate new variables. The following figure illustrates the scenario where a new variable E is calculated from variables C* and D, loaded already into the warehouse.

It means that data can be moved back from the warehouse to the integration layer. Warehouse data can be used in the integration layer in multiple purposes, calculating new variables is only one example.

Integrated variable based warehouse opens the way to any new possible sub-sequent statistical actions that do not have to collect and process data and can produce statistics right from the warehouse. Skipping the collection and processing phases, one can produce new statistics and analyses are very fast and much cheaper than in case of the classical survey.

To design and build a statistical production system according to the integrated warehouse model takes initially more time and effort than building the stovepipe model. But maintenance costs of integrated warehouse system should be lower and new products which can be produced faster and cheaper, to meet the changing needs, should compensate the initial investments soon.

15

16

4. S-DWH as layered modular system

4.1 Layered architecture

In a generic S-DWH system we identified four functional layers in which we group functionalities. The ground level corresponds to the area where the external sources are incoming and interfaced, while the top of the pile is where produced data are published to external user or system. In the intermediate layers we manage the ETL functions for the DWH in which coherence analysis, data mining, design for possible new strategies or data re-use are carried out.

Specifically, from the top to the bottom of the architectural pile, we define:

IV access layer, for the final presentation, dissemination and delivery of the information sought;

III interpretation and data analysis layer, is specifically for statisticians and enables any data analysis, data mining and support to design production processes or data re-use;

II integration layer where all operational activities needed for any statistical production process are carried out;

I source layer the level in which we locate all the activities related to storing and managing internal or external data sources.

S-DWH layers are in specific order and the data go through layers without skipping any layers. It is impossible to use the data directly from the other layer. If the data are needed, they have to be moved to the layer where they are needed. And they cannot be moved so that some layers are skipped. The data can be moved between neighbouring layers.

4.2 Layered approach of a full active S-DWH

The layered architecture reflects a conceptual organization in which we will consider the first two levels as pure statistical operational infrastructures, functional for acquiring, storing, editing and validating data, and the last two layers as the effective data warehouse, i.e. levels in which data are accessible for data analysis.

These reflect two different IT environments, an operational where we support semi-automatic computer interaction systems and an analytical, the warehouse, where we maximize human free interaction.

17

18

4.3 Source layer

The Source layer is the gathering point for all data that is going to be stored in the Data warehouse. Input to the Source layer is data from both internal and external sources. Internal data is mainly data from surveys carried out by the NSI, but it can also be data from maintenance programs used for manipulating data in the Data warehouse. External data means administrative data which is data collected by someone else, originally for some other purpose.

The structure of data in the Source layer depends on how the data is collected and the designs of the various direct and internal to any NSI data collection processes. The specifications of collection processes and their output, the data stored in the Source layer, have to be thoroughly described. Vital information are names and meaning, definition and description, of any collected variable. Also the collection process itself must be described, for example the source of a collected item, when it was collected and how.

When data are entering in the source layer from an external source, or administrative archive, data and relative metadata must be checked in terms of completeness and coherence.

From a data structure point of view, external data are stored with the same data structure as they arrive. The integration toward the integration layer should be then realized by a mapping of the source variable with the target variable, i.e. the variable internal to the S-DWH.

The mapping is a graphic or conceptual representation of information to represent some relationships within the data; i.e. the process of creating data element mappings between two distinct data models.

The common and original practice of mapping is effective interpretation of an administrative archive in term of S-DWH definition and meaning.

Data mapping involves combining data residing in different sources and providing users with a unified view of these data. These systems are formally defined as a triple <T,S,M> where T is the target schema, S is the heterogeneous set of source schemas, and M is the mapping that maps queries between the source and the target schemas.

Queries over the data mapping system also assert the data linking between elements in the sources and the business register units.

Data mapping

ADMIN DATA

Metadata of source layer

All the internal sources doesn’t need mapping since the data collection process is defined in an S-DWH during the design phase using internal definitions.

19

Data mappingADMIN DATA TARGET

SCHEMA

ADMIN DATA

Metadata of source layer

20

4.4 Integration layer

From the Source layer, data is loaded into the Integration layer. This represents an operational system used to process the day-to-day transactions of an organization. These systems are designed to process efficient and integrity transactional. The process of translating data from source systems and transform it into useful content in the data warehouse is commonly called ETL. In the Extract step, data is moved from the Source layer and made accessible in the Integration layer for further processing.

- In the Extract step, data is moved from the Source layer and made accessible in the Integration layer for further processing.

IntegrationLayer

Source Layer

- The Transformation step involves all the operational activities usually associated with the typical statistical production process.

- As soon as a variable is processed in the Integration layer in a way that makes it useful in the context of data warehouse it has to be Loaded into the Interpretation layer and the Access layer.

The Transformation step involves all the operational activities usually associated with the typical statistical production process, examples of activities carried out during the transformation are:

• Find, and if possible, correct incorrect data;• Transform data to formats matching standard formats in the data warehouse;• Classify and code;• Derive new values;• Combine data from multiple sources;• Clean data, that is for example correct misspellings, remove duplicates and handle missing

values.

To accomplish the different tasks in the transformation of new data to useful output, data already in the data warehouse is used to support the work. Examples of such usage are using existing data together with new to derive a new value or using old data as a base for imputation.

Each variable in the data warehouse may be used for several different purposes in any number of specified outputs. As soon as a variable is processed in the Integration layer in a way that makes it useful in the context of data warehouse output it has to be loaded into the Interpretation layer and the Access layer.

21

IntegrationLayer

Interpretation

IntegrationLayer

22

The Integration layer is an area for processing data; this is realized by operators specialized in ETL functionalities. Since the focus for the Integration layer is on processing rather than search and analysis, data in the Integration layer should be stored in generalized normalized structure, optimized for OLTP (Online transaction processing, is a class of information systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing), where all data are stored in similar data structure independently from the domain or topic and each fact is stored only in one point in order to makes easier maintain consistent data.

It is well known that these databases are very powerful responding to data manipulation as inserting, updating and deleting, but are very ineffective when we need to analyse and deal with a large amount of data. Another constraint in the use of OLTP is their complexity. Users must have a great expertise to manipulate them and it is not easy to understand all of that intricacy.

Some of OLTP characteristics are:

Source of data Operational data

Purpose of data To control and run fundamental business tasks

Processing Speed Typically Very Fast

Database Design Highly normalized with many tables

Backup and Recovery Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liability

Age Of Data Current

Queries Relatively standardized and simple queries.Returning relatively few records

Data Base Operations Insert, Delete and Update

What the data Reveals A snapshot of on-going business processes

During the several ETL process a variable will likely appear in several versions. Every time a value is corrected or changed by some other reason, the old value should not be erased but a new version of that variable should be stored. That is a mechanism used to ensure that all items in the database can be followed over time.

4.5 Interpretation layer

This layer contains all collected data processed and structured to be optimized for analysis and as base for output planned by the NSI. The Interpretation layer is specially designed for statistical experts and is built to support data manipulation of big complex search operations. Typical activities in the Interpretation layer are:

• Basis analysis• Correlation and Multivariate analysis • Hypothesis testing, simulation and forecasting, • Data mining, • Design of new statistical strategies,• Design data cubes to the Access layer.

Its underlying data model is not specific to a particular reporting or analytic requirement. Instead of focusing on a process-oriented design, the repository design is modelled based on data inter-relationships that are fundamental to the organization across processes.

23

24

Data warehousing became an important strategy to integrate heterogeneous information sources in organizations, and to enable their analysis and quality.

The Interpretation layer will contain micro data, elementary observed facts, aggregations and calculated values, but it will still also contain all data at the finest granular level in order to be able to cover all possible queries and joins. A fine granularity is also a condition to manage changes of required output over time.

Besides the actual data warehouse content, the Interpretation layer may contain temporary data structures and databases created and used by the different on-going analysis projects carried out by statistics specialists.

The ETL process in integration level continuously creates metadata regarding the variables and the process itself that is stored as a part of the data warehouse.

Although data warehouses are built on relational database technology, the design of a data warehouse database differs substantially from the design of an online transaction processing system (OLTP) database.

OnLine Analytical Processing (OLAP):

• Subject orientated• Designed to provide real-time analysis• Data is historical• Highly De-normalized

OLAP are multi-dimensional and are optimised for processing very complex real-time ad-hoc read queries

Some of OLAP characteristics are:

Source of data Consolidated data; OLAP data comes from the various OLTP Databases

Purpose of data To help with planning, problem solving, and decision support

Processing Speed Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes

Database Design Typically de-normalized with fewer tables; use of star schemas.

Backup and Recovery Regular backups

Age Of Data Historical

Queries Often complex queries involving aggregations

Data Base Operations Read

What the data Reveals Multi-dimensional views of various kinds of statistical activities

In this layer a specific type of OLAP should be used - ROLAP - Relational Online Analytical Processing - uses specific analytical tools on a relational dimensional data model which is easy to understand and does not require pre-computation and storage of the information.

In a relational database, fact tables of the Interpretation layer should be organized in dimensional structure to support data analysis in an intuitive and efficient way. Dimensional models are generally structured with fact tables and their belonging dimensions. Facts are generally numeric, and dimensions are the reference information that gives context to the facts. For example, a sales trade transaction can be broken up into facts, such as the number of products moved and the price paid for the products, and into dimensions, such as order date, customer name and product number.

A fact table consists of measurements, metrics or facts of a statistical topic. Fact table in the DWH are organized in a dimensional model, built on a star-like schema, with dimensions surrounding the fact table. In the S-DWH, fact table are defined at the higher level of granularity with information organized in columns distinguished in dimensions, classifications, and measures. Dimensions are the descriptions of the fact table. Typically dimensions are nouns like date, class of employ, territory, NACE, etc. and could have hierarchy on it, for example, the date dimension could contain data such as year, month and weekday.

The definition of a star schema would be realized by dynamic ad hoc queries from the integration layer, by the proper metadata, in order to realize, generally, a data transposition query. With a dynamic approach, any expert user should define their own analysis context starting from the already exiting materialized DM, virtual or a temporary environment derived from the data structure of the integration layer. This method allows users to automatically build permanent or temporary data marts in function of their needs, leaving them free to test any possible new strategy.

Figure 1 – Star-schema

25

26

A key advantage of a dimensional approach is that the data warehouse is easy to use and operations on data are very quick. In general, dimensional structures are easy to understand for business users, because the structures are divided into measurements/facts and context/dimensions related to the organization’s business processes.

A dimension is sometimes referred to as an axis for analysis. Time, Location and Product are the classic dimensions.

A dimension is a structural attribute of a cube that is a list of members, all of which are of a similar type in the user's perception of the data. For example, all months, quarters, years, etc., make up a time dimension; likewise all cities, regions, countries, etc., make up a geography dimension.

A dimension table is one of the set of companion tables to a fact table and normally contains attributes or (fields) used to constrain and group data when performing data warehousing queries.

Dimensions correspond to the "branches" of a star schema.

The positions of a dimension organised according to a series of cascading one to many relationships. This way of organizing data is comparable to a logical tree, where each member has only one parent but a variable number of children.

For example the positions of the Time dimension might be months, but also days, periods or years.

Dimension could have hierarchy, wich are classified into levels. All the positions for a level correspond to a unique classification. For example, in a "Time" dimension, level one stands for days, level two for months and level three for years.

The dimensions could be balenced, unbaleced or ragged. In balanced hierarchies, the branches of the hierarchy all descend to the same level, with each member's parent being at the level immediately above the member. Unbalenced hierarchies all of the branches of the hierarchy don't reach to the same level but each member's parent do belong to the level immediately above it. In ragged hierarchies, the parent member of at least one member of a dimension is not in the level immediately above the member. Like unbalanced hierarchies, the branches of the hierarchies can descend to different levels.

Ussualy, unbalanced and ragged hierarchys must be transformed in balanced hierachies.

Figure 2: Balanced Hierarchy

Figure 3: Unbalanced Hierarchy

Figure4: Ragged Hierarchy

27

28

4.6 Access layer

The Access layer is the layer for the final presentation, dissemination and delivery of information. This layer is used by a wide range of users and computer instruments. The data is optimized to effectively present and compile data. Data may be presented in data cubes and different formats specialized to support different tools and software. Generally the data structure are optimized for MOLAP (Multidimensional Online Analytical Processing) uses specific analytical tools on a multidimensional data model.

Multidimensional structure is defined as “a variation of the relational model that uses multidimensional structures to organize data and express the relationships between data”. The structure is broken into cubes and the cubes are able to store and access data within the confines of each cube. “Each cell within a multidimensional structure contains aggregated data related to elements along each of its dimensions”. Even when data is manipulated it remains easy to access and continues to constitute a compact database format. The data still remains interrelated. Multidimensional structure is quite popular for analytical databases that use online analytical processing (OLAP) application.

Analytical databases use these databases because of their ability to deliver answers to complex business queries swiftly. Data can be viewed from different angles, which gives a broader perspective of a problem unlike other models.

Some Data Mart might need to be refreshed from the Data Warehouse daily, whereas user groups might want refreshes only monthly.

Each Data Mart can contain different combinations of tables, columns and rows from the Statistical Data Warehouse. For example, a statistician or user group that doesn't require a lot of historical data might only need transactions from the current calendar year in the database. The analysts might need to see all details about data, whereas data such as "salary" or "address" might not be appropriate for a Data Mart that focuses on Trade.

Three basic types of data marts are dependent, independent, and hybrid. The categorization is based primarily on the data source that feeds the data mart. Dependent data marts draw data from a central data warehouse that has already been created. Independent data marts, in contrast, are standalone systems built by drawing data directly from operational or external sources of data or both. Hybrid data marts can draw data from operational systems or data warehouses.

The Data Mart in ideal information system architecture of a full active S-DWH, are dependent data marts: data in a data warehouse is aggregated, restructured, and summarized when it passes into the dependent data mart. The architecture of a dependent data mart is as follows:

Figure 5 – Dependent data mart Figure 6 – Independent data mart

29

Data Warehouse

Dependent Data Marts

Operational databases

Independent Data Marts

Operational databases

30

There are benefits of building a dependent data mart:

Performance: when performance of a data warehouse becomes an issue, build one or two dependent data marts can solve the problem. Because the data processing is performed outside the data warehouse.

Security: by putting data outside data warehouse in dependent data marts, each department owns their data and has complete control over their data.

5. Workflow scenarios

The metadata-driven system of a S-DWH is well-suited for supporting the management of modules in generic workflows. This modular approach can reduce the “time to market”, i.e. the length of time it takes from a product being conceived until its availability for use. In order to suggest a possible roadmap towards process optimization and cost reduction, in this paragraph we will introduce a possible simple description of a generic workflow, which links the business model with the information system.

Layered architecture, modular tools and variable based warehouse is powerful combination that can be used for different scenarios. Here are some examples of workflows that S-DWH supports.

5.1 Scenario 1: full linear end-to-end workflow

To publish data in access layer, raw data need to be collected into raw database in source layer, then extracted into integration layer for processing, then loaded into warehouse in interpretation layer and after that someone can calculate statistics or make an analyse and publish it in access layer.

31

32

5.2 Scenario 2: Monitoring collection

Sometime it is necessary to monitor collection process and analyse the raw data during the collection. Then the raw data is extracted from the collection raw database, processed in integration layer so that the data can be easily analysed with specific tools in use for operational activities, or loaded to interpretation layer, where it can be freely analysed. This process is repeated as often as needed – for example, once a day, once a week or hourly.

5.3 Scenario 3: Evaluating new data source

When we receive a dataset from new data source, it should be evaluated by statisticians. Dataset is loaded by the integration layer from the source to the interpretation layer, where statisticians can make their source-evaluation or, due to any changes on the administrative regulations, define new variables or new process-up-date for existents production process. From technical perspective, this workflow is same as described in scenario 2. It is interesting to note that this update must be included in the coherent S-DWH by proper metadata.

5.4 Scenario 4: Re-using data for new standard output

Statisticians can analyse data already prepared in integration layer, compile new products and load them to access layer. If S-DWH is built correctly and correct metadata is provided, then compiling new products using already collected and prepared data should be easy and preferred way of working.

33

34

5.5 Scenario 5: re-using data for complex custom query

This is variation from scenario 4, where instead of generating new standard output from data warehouse, statistician can make ad-hoc analyse using data already collected and prepared in warehouse and prepare custom query for customer.

5.6 Generic workflow suitable for reuse of components

A workflow identifies a collection of actions, operations and procedures with a predetermined order. Each activity starts, or may start, only if all the activities that precede it in the order are accomplished. Typically, workflows model or represent a process, this involves the way in which activities have to be completed in order to carry out the process.

There are many ways to describe a workflow. In this document the Directed Acyclic Graph (DAG) is used to facilitate immediate interpretation. According to the DAG definition, an activity is represented by a node and a dependence by an arrow (figure below).

A node represents a well-defined activity (input, processing mechanism and output), while an arrow represents a dependence relationship. The figure above depicts a simple workflow with two explicit dependences (node B depends on node A and node C depends on node B), and the transitive

dependence (node C depends on node A). The semantic of the term “dependence” is: if activity B depends on A, then B cannot start if A is not complete.

5.7 A simple statistical process

This paragraph gives some examples of the concepts introduced above. The first one represents a simple statistical process from a high level perspective. A generic statistical process, in accordance with the Generic Statistical Business Process Model, can be subdivided into nine phases: specify need, design, build, collect, process, analyse, disseminate, archive and evaluate. Each of them can be broken down into sub-processes. For instance the Collect phase is divided into: select sample, setup collection, run collection and finalize collection.

Therefore, a generic workflow is:

where every phase has to end before the next one can start.

Clearly not all phases and processes in the GSBPM have to be used: it depends on the purpose and the characteristics of the survey.

This is an example of a high level point of view and therefore does not show the intrinsic complexity of a statistical survey because it hides single processes and because every phase is sequential.

Sometimes a process in a subsequent phase could start even though all the previous phases have not completely ended. This leads to a more complex web of relationships between single processes.

In more depth, the next example focuses on the Process phase of the statistical production. The Process step comprises several activities. Adapting the same model approach used before, this section shows a few examples.

Looking at the Process phase in more detail, there are sub-processes. These elementary tasks are the finest-grained elements of the GSBPM.

35

36

We will try to sub-divide the sub-processes into elementary tasks in order to create a conceptual layer closer to the IT infrastructure. With this aim we will focus on “Review, validate, edit” and we will describe a possible generic sub-task implementation in what follows.

Let's take a sample of five statistical units (represented in the following diagram by three triangles and two circles) each containing the values from three variables (V1, V2 and V3) which have to be edited (checked and corrected). Every elementary task has to edit a sub-group of variables. Therefore a unit entering a task is processed and leaves the task with all that task's variables edited.

We will consider a workflow composed of 6 activities (tasks): S, starting , F, finishing, and S1, S2, S3, S4, editing data, activities. Suppose also each type of unit needs a different activity path, where triangle shaped units need more articulated treatment on variables V1 and V2. For this purpose a “filter” element F is introduced (the diamond in the diagram), which diverts each unit to the correct part of the workflow. It is important to note that only V1 and V2 are processed differently because in task S4 two branches re-join.

During the workflow, all the variables are inspected task by task and, when necessary, transformed into a coherent state. Therefore each task contributes to the set of coherent variables. Note that every path in the workflow meets the same set of variables. This incremental approach ensures that at the end of the workflow every unit has its variables edited. The table below shows some interesting attributes of the tasks.

Task Input Output Purpose Module Data source Data target

S All units

All units Dummy task

- TAB_L_I_START TAB_L_II_TARGET

S1 Circle units

Circle units (V1,V2

corrected)

Edit and correct V1 and V2

EC_V1(CU, P1)

EC_V2(CU, P2)

TAB_L_II_TARGET

S2 Triangle units

Triangle units (V1

corrected)

Edit and correct V1

EC_V1(TU, P11)

TAB_L_II_TARGET

S3 Triangle units (V1

corrected)

Triangle units (V1,V2

corrected)

Edit and correct V2

EC_V2(TU, P22)

TAB_L_II_TARGET

S4 All units (V1,V2

corrected)

All units (all variables corrected)

Edit and correct V3

EC_V3(U, P3) TAB_L_II_TARGET

F All units

All units Dummy task

- TAB_L_II_TARGET

TAB_L_III_FINAL

The columns in the table above provide useful elements for the building and definition of modular objects. These objects could be employed in an applicative framework where data structures and interfaces are shared in a common infrastructure.

The task column identifies the sub-activities in the workflow: the subscript, when present, corresponds to different sub-activities.

Input and output columns identify the statistical information units that must be processed and produced respectively by each sub-activity. A simple textual description of the responsibility of each sub-activity or task is given in the purpose column.

The module column shows the function needed to fulfil the purpose. As in the table above, we could label each module with a prefix, representing a specific sub-process EC function (Edit and Correct), and a suffix indicating the variable to work with. The first parameter in the function indicates the unit to treat (CU stands for circle unit, TU for triangle unit), the second parameter indicates the procedure, i.e. a threshold, a constant, a software component.

Structuring modules in such away could enable the reuse of components. The example in the table above shows the activity S1 as a combination of EC_V1 and EC_V2 where EC_V1 is used by S1 and also S2 and EC_V2 is used by S1and also S3. Moreover, because the work on each variable is similar, single function could be considered like a skeleton containing a modular system in order to reduce building time and maximize re-usability.

Lastly, the data source and target columns indicate references to data structures necessary to manage each step of the activity in the workflow.

37

38

5.8 CORE services and reuse of components

There are three main groups of workflows in S-DWH. One workflow updates data in the warehouse, the second one updates in-house dissemination database and the third one updates the public dissemination database.

These three automated data flows are quite independent from each other. Flow 1 is the biggest and most complex. It extracts raw data from the source layer, processes it in integration layer and loads to interpretation layer. And on the other hand, it brings cleansed data to the source layer for pre-filling questionnaires, prepares sample data for collection systems, etc. Let’s name this flow the processing flow.

Flow 2 and Flow 3 are very similar, both generate standard output to dissemination database. One updates data in in-house dissemination database and the second one in public database. Both are unidirectional flows. Let’s call Flow 2 as generate cube and Flow 3 as publish cube. In this context the cube is a multidimensional table, for example .Stat or PC-Axis table.

Processing flows should be built up around input variables or groups of input variables to feed variable based warehouse. Generate and publish cube flows are built around cubes, i.e. each flow generates or publishes a cube.

There are many software tools available to build these modular flows. S-DWH’s layered architecture itself provides the possibility to use different platforms and software in separate layers, i.e. to re-use components already available in-house or internationally. In addition, different software can be used inside the same layer to build up one particular flow. The problems arise when we try to use these different modules and different data formats together.

Take a deeper look into CORE services. They are used to move data between S-DWH layers and also inside the layers between different sub-tasks (e.g. edit, impute, etc.), then it is easier to use software provided by statistical community or re-use self-developed components to build flows for different purposes.

Generally CORE (COmmon Reference Environment) is an environment supporting the definition of statistical processes and their automated execution. CORE processes are designed in a standard way, starting from available services; specifically, process definition is provided in terms of abstract statistical services that can be mapped to specific IT tools. CORE goes in the direction of fostering the sharing of tools among NSIs. Indeed, a tool developed by a specific NSI can be wrapped according to CORE principles, and thus is easily integrated within a statistical process of another NSI. Moreover, having a single environment for the execution of all statistical processes provides a high level of automation and a complete reproducibility of processes execution.

NSI produce Official Statistics sharing very similar goals, hence several activities related to the production of statistics are common. Nevertheless, such activities are currently carried on in an independent way, without relying on shared solutions. Sharing a common architecture would result in a reduction of costs due to duplicated activities and, at the same time, in an improvement of the quality of produced statistics, due to the adoption of standardized solutions.

The main principles underlying CORA design are:

Platform Independence. NSIs use various platforms (e.g., hardware, operating systems, database management systems, statistical software, etc.), hence architecture is bound to fail if it endeavours to impose standards at a technical level. Moreover, the platform independence allows to model statistical processes at a “conceptual level”, so that they do not need to be modified when the implementation of a service changes.

Service Orientation. The vision is that the production of statistics takes place through services calling other services. Hence services are the modular building blocks of the architecture. By having clear communication interfaces, services implement principles of modern software engineering like encapsulation and modularity.

Layered Approach. According to this principle, some services are rich and are positioned at the top of the statistical process, so, for instance a publishing service requires the output of all sorts of services positioned earlier in the statistical process, such as collecting data and storing information. The ambition of this model is to bridge the whole range of layers from collection to publication by describing all layers in terms of services delivered to a higher layer, in such a way that each layer is dependent only on the first lower layer.

CORE principal objective is the design and implementation of an environment supporting the definition of statistical processes and their automated execution. CORE processes are designed in a standard way, starting from available services; specifically, process definition is provided in terms of abstract statistical services that can be mapped to specific IT tools. CORE goes in the direction of fostering the sharing of tools among NSIs. Indeed, a tool developed by a specific NSI can be wrapped according to CORE principles, and thus easily integrated within a statistical process of another NSI. Moreover, having a single environment for the execution of all statistical processes provides a high level of automation and a complete reproducibility of processes execution.

39

40

For us it is very important to make some transitions and mappings between different models and approaches. Unfortunately mapping a CORE process to a business model is not possible because the CORE model is an information model and there is no way to map a business model onto an information model in a direct way. The two models are about different things. They can only be connected if this connection is in some way a part of the models.

The CORE information model was designed with such a mapping in mind. Within this model, a statistical service is an object, and one of its attributes is a reference to its GSBPM process. Considering the GSBPM a business model, any mapping of the CORE model onto a business model has to go through this reference to the GSBPM.

Usually different services use different services with their own tools which expect different data formats. So for service interactions we need conversions. Evidently conversions are expensive. Using for interactions CORE services number of conversions can be reduced noticeably.

In a general sense, an integration API permits to wrap a tool in order to make it CORE-complaint, i.e. a CORE executable service. CORE service is indeed composed by an inner part, which is the tool to be wrapped, and by input and output integration APIs. Such APIs transform from/to CORE model into the tool specific format.

As anticipated, CORE mappings are designed for classes of tools and hence integration APIs should support the admitted transformations, e.g. CSV-to-CORE and CORE-to-CSV, Relational-to-CORE and CORE-to-Relational, etc.

Basically, the integration API consists of a set of transformation components. Each transformation component corresponds to a specific data format and the principal elements of their design are specific mapping files, description files and transform operations.

In order to provide an input to a tool (inner part of a CORE service) the Transform-from-CORE operation is invoked. Conversely, the output of the tool is converted by Transform-to-CORE operation. For each single input or output file a transformation must be launched.

In that way reusing of components can be performed in a very easy and efficient way.

6. Conclusion

Today, prevalent model for producing statistics is the stovepipe model. But there is also integrated model and warehouse approach. In this paper integration model and warehouse approach was put together. Integration can be looked at form three main viewpoints:

1. Technical integration – integrating IT platforms and software tools.

2. Process integration – integrating statistical processes like survey design, sample selection, data processing and so on.

3. Data integration – data is stored once, but used for multiple purposes.

When we put all these three integration aspects together, we get S-DWH, which is built on integrated technology, uses integrated processes to produce statistics and reuses data efficiently.

We made recommendations also about data models of each layer:

Source layer don’t have a specific data model but a mapping assistance is needed when external data is used:

In integration layer for ETL functionalities and processing, data should be stored in generalized normalized structure, optimized for OLTP, where all data are stored in similar data structure independently from the domain or topic and each fact is stored only in one point in order to makes easier maintain consistent data.

Since interpretation layer contains all collected data processed and structured to be optimized for analysis and as base for output planned by the NSI and is specially designed for statistical experts and is built to support data manipulation of big complex search operations an OLAP (OnLine Analytical Processing) with a star-schema design.

In access layer, data marts can be built where data may be presented in data cubes with different formats, specialized to support different tools and software.

In a S-DWH, the information is organized using a defined data model, which enables a structured modular approach. This is because, a S-DWH is a metadata-driven system, which can also be easily extended to manage operational tasks.

The main advantage of the workflow approach given here resides in the decomposition and articulation of complex activities by elementary modules. These modules can be reused, reducing effort and costs in the implementation of statistical processes.

41

42

References

B. Sundgren (2010a) “The Systems Approach to Official Statistics”, Official Statistics in Honour of Daniel Thorburn, pp. 225–260. Available at: https://sites.google.com/site/bosundgren/my-life/Thorburnbokkap18Sundgren.pdf?attredirects=0

W. Radermacher (2011) “Global consultation on the draft Guidelines on Integrated Economic Statistics”.

UNSC (2012) “Guidelines on Integrated Economic Statistics”. Available at: http://unstats.un.org/unsd/statcom/doc12/RD-IntegratedEcoStats.pdf

W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009) “Terminology Relating To The Implementation Of The Vision On The Production Method Of EU Statistics”. Available at: http://ec.europa.eu/eurostat/ramon/coded_files/TERMS-IN-STATISTICS_version_4-0.pdf

European Union, Communication from the Commission to the European Parliament and the Council on the production method of EU statistics: a vision for the next decade, COM(2009) 404 final. Available at: http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=COM:2009:0404:FIN:EN:PDF

ESSnet CORE (COmmon Reference Environment) http://www.cros-portal.eu/content/core-0

http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=COM:2009:0404:FIN:EN:PDF

http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=COM:2009:0404:FIN:EN:PDF

http://ec.europa.eu/eurostat/ramon/coded_files/TERMS-IN-STATISTICS_version_4-0.pdf

http://unstats.un.org/unsd/statcom/doc12/RD-IntegratedEcoStats.pdf

https://sites.google.com/site/bosundgren/my-life/Thorburnbokkap18Sundgren.pdf?attredirects=0

https://sites.google.com/site/bosundgren/my-life/Thorburnbokkap18Sundgren.pdf?attredirects=0

Documents

S-DWH Modular Workflow · Web viewAllan Randlepp,Antonio Laureti Palma, Francesco AltaroccaValerij ŽavoronokPedro Cunha