Recipes 12 of Data Warehouse and Business Intelligence - How to think agile

Recipes of Data Warehouse and Business Intelligence

How to think agile

A project of Data Warehouse and Business Intelligence, is a long and complex work that requires many months, often years, especially if we are talking about Enterprise Data Warehouse, to be able to see the light.Indeed, I think we should stop to call it project, but we should call it process. But it is not a process whatever: it is the process that transforms data into knowledge, knowledge into prediction, the prediction in action.We apply for example, this process in the world CRM (Customer Relationship Management).The raw data of the customers who come from different systems, are transformed into greater knowledge of customers and their preferences. From the knowledge of the customers we can predict their future attitudes. The knowledge of thefuture allows us to act to change or adapt new business strategy. This is what allows us the process of Data Warehouse and Business Intelligence.You can well imagine how this process is essential for any company that wants to compete in the global market.Unfortunately what scares more investment is time. In fact, the time factor is crucial, and in busy life today, you do not want to wait and you look for shortcuts to get the desired results in the shortest time possible.That's why we talk about Agile Data Warehouse.

Introduction

Agile Data Warehouse and Business Intelligence

What is and what is not the Agile Data Warehouse

So let us just what it should be not.It should not be a commercial product or a solution sold by some companies.It should not be a database or a different design of the logical and physical structure. It must be a methodology, thought as a design philosophy, to apply to the entire life cycle of the process.


Build

Ideally, and, quite simply, we can divide the process of Data Warehouse in three main phases.I stress simply, because behind these phases, there are several design steps that we know well (requirements gathering, analysis, programming, ..).

Build TestMaintenance and Iterative

evolution

Build: all activity that leads to the test phase.Test: all activity of verification and control that, before and after the deploy in production, ending with the acceptance of the system made by end users.Maintenance and evolution iterative: all activity relating to the management and growth of the Data WarehouseTo successfully implement a process of Agile Data Warehouse, we have to be "agile" in each of these components.

We need to be agile in the Build phase. There is little to explain. This need is easily understood. We must try to minimize the time of the ETL process that, historically, is the most time-consuming phase of the process.


Test

We need to be agile in the test phase. This step is critical because it is the phase in which end users are starting to see the data and they begin to evaluate the result. This means provide fast response time to end users.Look Out. I'm not talking about the response times of a report or a query to the Data Warehouse (I will take it for granted), but about the response time to the causes of faults and of problems. Let me explain.As stated at the beginning, we have to be agile in the whole life cycle of the Data Warehouse. Many of you will think that "agile" it means only to reach quickly the deploy into production .In practice able to accelerate as much as possible the ETL process in order to provide end-users the Data Mart for their analysis. But this is only part of the story.In my opinion, the most important moment in which we must be "agile" is AFTER having concluded the build phase.The real success of the Data Warehouse will depend on how we will be quick to answer to questions of end-users, to their contestation of the displayed data. How we will be quick to identify the problems of the loading process, in knowing where they are occurring and why. And we have to be fast in solving problems

Maintenance and iterative evolution

Finally we need to be agile in the maintenance and the iterative evolution. This means that we have to answer quickly to requests for modification of the system, and especially of its evolution. Do not forget that it is a process.Do not forget that on the basis of an initial Data Warehouse, little by little they will be added over time, new dimensions of analysis and new Data Mart to analyze. It most likely will need to add new information to the dimensions and to the facts already built.


I hope now it is clear what we want to achieve when we speak of Agile Data Warehouse. But the essential point is how to reach these objectives. As mentioned above, you do not need a product, but only a good methodology.Here are some personal advice based on my experience.We can act on various aspects, many of which have already been the subject of reflections on my blog or on my Slideshare.

Agile in the Build - Naming Convention

I never tire of emphasizing the importance of setting a precise naming convention for all objects of the project.We must do this now, before creating any type of information structure. This will allow us to have a clear and simplified management of all the logical and physical components (tables, sequences, views, files, documents, etc.) that constitute the Data Warehouse.Not only. Follow a specific naming convention allows us to create configuration, creation and control mechanisms, very quickly.

Agile in the Build - Reduction of the computing chain

Another point to consider is the modelling philosophy of the Data Warehouse. Indeed, it is probably the first thing to consider. I will not go in the historical debate on the approach: Inmon against Kimball. Both are valid with their strengths and weaknesses.But if we speak of "agile", for me the choice of the Kimball philosophy is crucial. All what can reduce the computational and structural chain present in the ETL process, is undoubtedly an important factor.I think having an ODS (Operational Data Store), that is basically a duplication historicised almost all the structures already present in Staging Area, before of the structures dedicated to the analysis, is an activity that costs time and money.


Agile in the Build - Simplification of data types

Another way to be "agile" is a consequence of the general rule to always think in a simplified way.We need to reduce to a minimum the types of data (in the sense of the database) in the Data Warehouse. An RDBMS such as Oracle, and the same goes for the other manufacturers, has more than 30 different data types (NUMBER of various types, CHAR, VARCHAR, DATE, etc.): we can not think of having this variety of types inside the Data Warehouse . Too many complications in their treatment and conversion.Try to think to the semplicity of the source files: except for some special cases, are all text files.With fixed length record or with terminator, they are always streams of data that you can easily open it with any text editor.The ultimate in simplicity. My advice is to keep almost intact this simplicity inside the Data Warehouse using only two data types• Numeric - just to represent amounts, quantities, percentages, etc.• Alphanumeric - for all other data.We can use the DATE data type, only for technical fields, such as insertion date, last update date, etc.Although in the source systems the data representing codes, indicators, flags, etc. are numeric, we must see them alphanumeric inside the Data Warehouse.Transform all the data that represent dates, in alphanumeric and in the standard format YYYYMMDD.

Agile in the Build - Sequentiality

We must try to think, and in 90% of cases you can do, that every component of the process is connected to the next, and that their sequential execution leads to final loading of the Data Warehouse.


Mind you, I'm not saying that you can not work in parallel, but to identify which components are completely independent of each other to the point that they can run in parallel, it is not an easy task; not counting all the arguments necessary for their synchronization.The parallelism also requires specific hardware configurations, and specific settings of the database, to actually get a performance boost that, I speak from experience, it is not obvious.Certainly, the dimension tables may be loaded in parallel (if there are not logical connections between them), but in a "agile" world we must try to think in a simple and sequential.Do not forget that the ETL process, by its nature is inherently sequential.You can not upload a Data Mart of Level 1 before you have loaded those of level 0. You can not load a Data Mart Level 0 if you have not loaded the dimensions, which, in turn, can not be loaded unless you have loaded the staging area tables , and so on.

Agile in the Build - Reduction of the external tools

It 'a design choice, dependent on many factors, whether and which tool to use for the implementation of the Data Warehouse.Each company has its own rules and, above all, a budget. If you have plenty of money available to buy the tools (and especially a lot of time learning how to use them) , there is no problem.If your budget is low, my advice is to use the least possible number of instruments. Often we tend to look for specific tools to do specific jobs such as quadrature, process control, quality control, job scheduling, etc. Do not forget that each of them has its own structures, which then need to communicate with all other structures, increasing the complexity of the entire system.My opinion is to invest much more in having a very good knowledge of the programming language of the database, a good editor and a good interface to access the database. These three elements will save us a lot of time.


Agile in the Test – Configuration and log

To be agile at this stage, we have to build a very accurate control architecture. I have already written a lot about how to report the system faults automatically and how to have the control of the modules of an ETL process. My advice is to always have this magical pair of structures (tables): configuration and log. At minimum:Configuration tables of the Staging Area - Logging tables of the Staging Area loading.Configuration tables of the dimensions - Logging tables of the dimensions loading.Configuration tables of the facts - Logging tables of the facts loading

Agile in the Test – Data Lineage

Have a structure of Data Lineage means to be able to travel all the way of information, seen by the end user, back until the origin of the data. Complicated, is not always possible (see the data calculated) but essential to prove the correctnessof the loading process. To put it simply, we must be able to prove that the problem was already present inside the feeding source. So you need to use some metadata tables to manage the data lineage.

Agile in the Maintenance and Iterative evolution – Modularity (and uncertainty)

To be agile at this stage we have to be modular. is the uncertainty that forces us to be modular.Uncertainty not in the sense that it is allowed us to be uncertain how to proceed, but in the sense of being aware that anything will change. Let me explain.In a process of Data Warehouse, it is rare that all logics are well defined from the start.


We should not necessarily think about deficiencies of analysis (which sometimes we have) or errors in the requirements gathering.The problem is that the logic evolve while you progress in the work. I think it is a natural process, linked to the complexity of the system, with which we have to live with no dramas. The source systems provide data that is not sure to be exactly those expected from the analysis, both as size and as content.This often is discovered later, when the data begin to be analyzed (and then after loading them).Business users change their minds, sometimes the business strategies changes. It turns out, later, that also served another data not provided by the analysis. Users want to make the comparison with other data that were not foreseen, etc ..There is a saying very eloquent on the needs of end users. The saying is: "I will know when I will see." I'll know what I want when I see it. Absolutely true.This requires us to continuously modify the programs to meet the new design requirements.Logic (and programs) to add, to change, to be removed; logic that are to be added, but in two months will be removed, in short, anyone with a little 'experience, will certainly have to face these situations.To limit the consequences of the uncertainty, it is essential to the principle of modularity. That's why to every business need must correspond to a single processing unit, simple or complex it is.If I load a table of Staging Area, there must be some modules that they do it, and they do only that.If I have to run a check quadrature between three key performance indicators, there must be a module that does it, and does just that.When it turns out that the KPI to check are 4, we will add new modules. If I have to add the calculation the price of a derivative financial product, there must be a module that does it; no matter if I send to develop that module to a programmer who lives in another part of the world. The important thing is the immediacy with which I insert it in the system. In this way, do not pretend to eliminate uncertainty, but with the modularity, I manage it better.


The last tip is the clear separation between the business and the infrastructure.You have already seen it in action, in some of my previous articles.The simple techniques exposed about messaging and control are independent of the context. They are infrastructure, not business.That the business related to the Data Warehouse is about the financial environment, automotive, or for large retail chains, does not affect in any way the use of those techniques.We must use the configuration and log tables, absolutely independent of the context in which they work.This allows us, for example, to add a new Data Mart focusing exclusively on business related to the Data Mart, and reusing the infrastructural software for the process monitoring.

Agile in the Maintenance and Iterative evolution – Separation between business and infrastructure


Build

Maintenance and Iterative evolution

Test

Data Lineage

Modularity (and uncertainty)

Configuration and log

Reduction of the computing chain

Naming Convention

Simplification of data types

Sequentiality

Reduction of the external tools

Separation betweenbusiness and infrastructure

Agile

Agile

Agile


Conclusion

Be agile in a process (or project) of Data Warehouse and Business Intelligence is possible. You just have to be guided by a correct methodology that I tried to summarize in the points described.

http://www.slideshare.net/jackbim/recipe-9-techniques-to-control-the-processing-units-in-the-etl-processhttp://www.slideshare.net/jackbim/recipes-6-of-data-warehouse-naming-convention-techniqueshttp://www.slideshare.net/jackbim/recipes-8-the-naming-convention-part-2http://www.slideshare.net/jackbim/recipe-7-of-data-warehouse-a-messaging-system-for-oracle-dwh-1http://www.slideshare.net/jackbim/recipe-7-of-data-warehouse-a-messaging-system-for-oracle-dwh-2

References

http://www.slideshare.net/jackbim/recipe-9-techniques-to-control-the-processing-units-in-the-etl-process

http://www.slideshare.net/jackbim/recipe-9-techniques-to-control-the-processing-units-in-the-etl-process

http://www.slideshare.net/jackbim/recipes-8-the-naming-convention-part-2




http://www.slideshare.net/jackbim/recipe-7-of-data-warehouse-a-messaging-system-for-oracle-dwh-1

http://www.slideshare.net/jackbim/recipe-7-of-data-warehouse-a-messaging-system-for-oracle-dwh-1

Technology

Recipes 12 of Data Warehouse and Business Intelligence - How to think agile