Top 10 ETL Design Tips

1

Data, Data, Everywhere, and Not a Byte in Synch: ETL in the Real World – Top 10 ETL Design Tips

A Methodology for Designing Reliable and Flexible ETL

Will Martin

2

ABSTRACT

Any organization with even a small IT team probably expends some effort moving data from one system to another, usually doing something to the data while moving it. This is often referred to as ETL, for Extract, Transform, and Load. When a business manager asks someone on his team to get business data from somewhere and build a report, that manager is asking for ETL: for example, he may want someone to “download” (Extract) the latest sales figures from operations, clean it up and organize it by sales team (Transform it), and put it into an Excel file (Load it) so he can review it. ETL is ubiquitous, and as many IT organizations will know, business users always want the data “yesterday”.

While many commercial ETL tools exist, it is our experience that the majority of organizations ultimately custom-build most or all of their major ETL systems, usually for good reasons related to the cost, limitations of such commercial systems, or the desire to avoid establishing a dependency on an external toolset. Understanding this, herein we discuss a repeatable and generalized methodology for designing ETL processes in a way that enhances reliability and flexibility, thereby decreasing overall costs while increasing productivity and value to business owners.

3

Data, Data, Everywhere: What is ETL, and Who Uses It?

ETL stands for “Extract, Transform, and Load”: Extracting data from one or more source data systems, Transforming that data into a form that is suitable for whatever it is you are going to do with it, and Loading it into its final destination system, where it can be accessed and used by your business customers.

ETL could be something as simple as loading the contents of a CD-ROM onto a computer once a quarter, sorting the data by date or geography, and saving the result in a spreadsheet. Or it could be as complex as downloading terabytes of daily sales transaction data from Linux dataservers distributed worldwide, downloading customer data from several Windows servers at headquarters, downloading monthly inventory data from tape drives, applying complex business rules and combining the data into a single data warehouse on yet another dataserver, and finally running some SQL queries from the data warehouse to produce several customer-specific datamarts in a user-access database, and doing all of this automatically every night.

The reason ETL is so common is that no one data system can do everything. Order-taking systems are designed to record customer orders, so the data they gather is formatted to suit that system. Inventory systems are designed to manage inventory. And even if an organization uses a vendor's integrated system which automates the communications between these OLTP systems, there is probably a separate reporting system. Or maybe there are legacy systems that are still online for parts of the business. For whatever particular reasons, most businesses need to get multiple systems to “talk to” each other, and these systems are often quite different in data structure and format. Someone who tried to produce a management report showing metrics involving multiple systems simply downloading information separately from each system and combining that information into a useable output manually would probably be faced with a daunting task, particularly if it is a regular occurrence.

Instead, what most organizations do is create an ETL process that will extract the relevant data from all the relevant systems, transform the various datasets into a common coherent set of information which includes applied business rules, and then load that data into a report, reporting database, or other output system for consumption. This process will usually be automated, or at least repeated.

4

Fig 1. The need for ETL

5

Building a repeatable ETL process will probably prove much more cost effective than manually looking up data and building an output database over and over. However, if an ETL process is poorly designed or implemented, simply running and maintaining it will also prove costly. There is also always a productivity cost when information does not move freely within an organization – when one system cannot get data from another system, or when management cannot see the latest reports, the business suffers. Finally, businesses are constantly changing, resulting in new business rules and new data structures, and any data-centric process needs to be flexible enough to adapt to these changes.

Therefore, well-designed ETL processes are essential in any information-driven organization.

In this paper we discuss a methodology for designing and building ETL systems that are robust, reliable, and flexible, all features that usually result in cost-effective systems. We do not discuss specific software tools or hardware and environment choices – the "technology stack" – as these factors are typically driven by many factors which have little to do with ETL. But the methodology described here does not depend on the use of any specific software or hardware; rather it is a design strategy that can be applied in any system environment, precisely because it embodies a repeatable design strategy rather than a specific tool.

6

Top 10 ETL Design Tips Let's get right to it: here's our list of things to do that will result in a great ETL system:

1. Design your system in three logical "layers": an extract layer, a warehouse layer, and a publish layer, all persistent.

2. Make each layer operate independently, so that you can refresh one without affecting the others.

3. Design your extract to touch and depend on source systems as little as possible and as infrequently as possible. Get in and get out, and don't alter data while you extract it: extract it, then alter it.

4. If a source system is unreliable, treat it that way and design for fault tolerance. 5. Make your load steps within a layer modular so that each step can be run by itself. 6. Make every step of your ETL process non-destructively repeatable, so that if something

happens, you can just run that step again as many times as you want without creating data problems like data duplication or repeated summing.

7. Break up long-running processes. Long-running processes are a barrier to testing, discourage change, and create their own failure modes.

8. Code your business rules once and only once, in the warehouse layer. Don't duplicate code just because you produce multiple reports involving the same data.

9. Be wary of custom-coded incremental loads. Use existing tools or avoid incremental loads altogether if practical.

10. Design with both performance AND maintainability in mind. Faster is usually better, but not always.

The discussion that follows will demonstrate the validity of these recommendations.

7

What Makes “Good” ETL?

A good ETL system is:

• Reliable. Business users need to know the data will be there when they expect it. • Fast ENOUGH (as opposed to as fast as possible). There’s always a time requirement. • Flexible. Business rules constantly change, and systems are upgraded and replaced.

The ETL system will have to change as a result.

Let’s discuss these factors in more detail.

Reliable: A reliable system does what it’s supposed to when it’s supposed to do it, and the end users get their data when they expect it. But what features make an ETL process reliable?

It is fast and only connects to external systems for the minimum time needed, in order to minimize vulnerability to external system outages, primarily source system outages. Source data systems have their own lives, and they may have outages, scheduled and otherwise, and may have variable reliability. We want to “get in and get out” quickly, so that we don’t have to worry any more than absolutely necessary about source systems going offline and interrupting our ETL process.

The overall process is broken up into sub-processes that individually run in manageably short time periods. Ultimately, your process will take as long as it takes, limited by hardware, both yours and your source systems’, and by data volume. But you can break long-running processes into multiple sub-processes. This gives you visibility to how the overall process is going along the way. It also provides many options for recovery in the case of a failure: if you have a 10-hour process that fails after 9 hours, you can re-start at, for example, step 7 of 10 rather than just re-starting the entire 10-hour process from the beginning.

It is modular, with process steps being separated and independent. This pays off in many ways: ease of testing, ease of modification, fault recovery. When each process is non-destructively repeatable, the payoff is multiplied.

It is non-destructively repeatable. THIS IS EXTREMELY IMPORTANT. This means that we can run our ETL process over and over without creating problems in our final data. If multiple runs of a process will corrupt final data, via repeated summing or duplication of data etc., then we will have a big problem if that process ever fails partway through. We want to design processes so that if we’re not sure they ran correctly and to completion, we simply run them again. ASSUME that you WILL need to run loads repeatedly, and consider this a design requirement.

8

It is easy to test. If it’s easy to test, you’ll test it often and more thoroughly, and it will be more reliable. If it is fast, repeatable, and modular, it’s easy to test.

It includes custom logging. If you design your own automated logging, which can be very simple to do, you will be able to log whatever you want: step-by-step completion, step timing, custom error messages and diagnostic information.

It is fault-tolerant. If there is a problem, you want to have access to information as to what happened, and options as to how you recover. Custom logging and modular, repeatable design lead to fault-tolerance.

Fast ENOUGH:

The overall process completes with time to spare. If your customers need the data by 8AM and your process finishes at 7:55AM every day, you have no room for contingencies.

Beyond this, do not be obsessed with performance. In IT, there is often an over-emphasis on performance, with every process being designed to be as fast as possible. This often comes at a cost in terms of up-front design cost, lost flexibility, lost reliability, and increased maintenance costs. Yes, customers always want the data yesterday. But when the REALLY get upset is when the data doesn’t come at all, or when it’s not right.

Flexible:

The system is designed so that it can readily change in response to:

• Business rule changes. • Changes in source systems: data structures, hardware, data volumes. • Hardware changes on your systems.

9

Architecture: How Do We Meet These Goals? We have found that a three-layer system provides an ideal environment to achieve the goals we’ve identified. Each layer contains data tables with different purposes. These layers are logical layers, not necessarily physical layers, with that decision being based like many others on hardware and performance concerns.

10

Fig 2. A Three-Layer ETL System

11

The three-layer ETL system is composed of an extract layer, a warehouse layer, and the final publish layer.

The Extract Layer

Goal: someplace to put and hold the extracted data where you we control it. This layer serves to hold the data we’ve extracted from our source systems, largely un-altered (filtered down to just what we need, but otherwise un-altered). The data here is persistent – now that we have the data, WE control it; not from a business sense, but tactically, for the purposes of our processing.

12

Fig 3. Sample Extract Layer Data Tables

13

But why store the data ourselves? Why not just pull it from the source systems when we need it?

• You may want to run your post-extract process multiple times for a variety of reasons: testing, diagnosing problems, business rule iterations, and you don’t want to worry about what goes on in source systems between runs. This is particularly useful for testing and debugging: for repeated test runs, you need to know that the input data hasn’t changed.

• Source systems go up and down, and update, on their own schedule, sometimes unpredictably. You may load data once per week, but you may make a business rule change on Wednesday and want to see the prior Monday’s data with the new business rules.

• You don’t want to traverse the network every time you want to use some source system data.

• Once in the extract layer, the data is static. You never have to have to wonder whether an apparent change is due to a change in the source data or due to something in your process. This is extremely useful in production, but also during development (including enhancements) and testing.

• Finally, your source system administrators will appreciate you minimizing the impact on their systems by only accessing them when needed. This is a win-win situation, and it will ultimately be to your benefit to have happy source system personnel.

The “Warehouse” Layer

Goal: A place that combines all the source data in a form where it is standardized (as much as is practical) and in a form that is easy for you to work with when building the publish layer.

This is where you start applying business rules. In particular, you apply as many of your customers’ business rules as you can, but only those rules that apply to ALL of your customers (you may have multiple customer sets, each with some different business rules).

The warehouse layer is also a great place to check and address data quality issues if needed.

14

Fig 4. Sample Warehouse Layer Data Tables

15

The end result is a “warehouse” of data, possibly from different source systems, which is self-consistent, fairly normalized, and in a format that is convenient for YOU to work with. This effectively becomes the source system for all of your final data, and serves as the “one truth” represented by the unified data coming from disparate systems.

The Publish Layer

Goal: the data is in its final form for user consumption. The data could be organized in any number of ways: one (or several) star schemas, a set of relational tables, a set of flat file feeds; whatever is most useful to the end users. Business rules that are specific to particular user groups are applied only to the data that will be accessed by those groups. Any given user group may have visibility to some or all of the data as appropriate for each user group. But the publish layer is the only place any users will typically have access to any system data.

16

F

17

ig 5. Sample Publish Layer Data Tables

18

Fig 6. A Three-Layer ETL System

More Detail

The Extract Layer

The extract layer typically looks like a database with a bunch of data tables, with one or more tables from each source system. The data in each table generally looks a lot as it did in the source system. Except for filtering down to only the rows and columns we actually need, and maybe joining several source tables into one extracted table, also to reduce data volume, it’s usually best to alter the source data too much during extract. There are several benefits to minimizing data alteration during extract:

• Minimal data manipulation often means maximum performance. • Any change you make to the data is a change you’re stuck with. • When validating your final data, it can be very convenient to have raw source data

available. • Extract code complexity is minimized. • Source system data can be inconvenient to work with or may require “clean-up”, and it’s

usually easiest to do this somewhere outside the actual extract process.

The primary reason to have a dedicated extract layer is to de-couple your system from the source systems, so that they can go online and offline, refresh, update, or otherwise change without affecting your system. By extracting the information you need from the source systems and storing it persistently in your environment, you essentially establish your own source data system: your extract layer.

Another important reason to have a dedicated extract layer is that it makes your extracted data static and persistent – after you extract your data, source systems can go offline, but the data you extracted from that system is still available to you, and never changes. By making the extracted data persistent, we do not mean permanently persistent, in which case you’d be duplicating the function of the source systems. We mean persistent between extracts: we’re creating a persistent snapshot of source system data, not a mirror of the source data.

Having the extracted data remain static has several significant benefits. Because the data only changes when you run a new extract (as opposed to when a source system changes), there will never be any question of whether changes further downstream (in the warehouse or publish layers) are the result of source changes or ETL system processing. If no extract has been run, then any changes are the result of ETL system processing. This knowledge is crucial during testing and development, including after production go-live. Also, an un-changing extract layer provides a stable dataset to use for performance testing and for iterative testing of downstream processes.

19

How often do we refresh the “snapshot”? Exactly as often as suits our needs. If our customers need monthly data refreshes, we refresh our extract layer monthly. (But as we will discuss below, because our warehouse and publish layers will be independent of our extract layer, if we want to refresh the extract layer more frequently, or repeatedly for testing, we can do so without affecting the other layers).

The processes that refresh the extract layer stand alone from our other processes. This principle holds regardless of the specific technologies and tools used to extract data: no matter what tools or techniques you use, make the extract process modular and repeatable. We want to be able to run and re-run the extract whenever we want, without necessarily affecting anything else in our system: make it non-destructively repeatable.

By designing the extract layer to be non-destructively repeatable, and by creating a persistent extract layer, we remove ourselves from dependence on the source systems as much as possible – once we extract the data, it’s under our control.

The Extract Layer Load Process

Loading information into the extract layer consists of accessing the data in the source system(s), filtering it down to what you need, and loading it into your environment. The actual processes and tools you use will depend on the source systems, your own system, and your preferences and skill sets. But from a design perspective, the key is to make your processes modular and non-destructively repeatable.

Making your processes modular means building a set of independent sub-processes, or modules. A module could be a script, stored procedure, class, a subroutine; whatever makes sense in your operating environment. A good practice is to have one module for each set of extracted data, meaning one module for each XT_ table. So if you have 15 XT_ tables in your extract layer database, you will have 15 modules, perhaps 15 stored procedures, in your code base. You will probably want to encapsulate those 15 modules into one “parent” calling module, but that is a detail left to preference, and it doesn’t matter as long as the underlying modules exist.

Making your processes non-destructively repeatable means designing each module, including parent modules, so that it can be run by itself as often and as and repeated as many times as desired without any negative effect on the resulting extract layer data.

A very simple but illuminating example of non-destructively repeatable design involves table drops. Consider a system with 15 extracted data tables, where each table is dropped and re-created during every system refresh. A very common design would be to create a process that drops all 15 tables, and then one or more processes to re-create all the tables. The problem with this design becomes clear in the case when a subset of the 15 tables needs to be refreshed, for whatever reason: testing, a partial-load failure, a change to just a few tables. In that case, in order to refresh one table, it first needs to be dropped. But the normal process drops all 15 tables: the table drop process is unnecessarily destructive. While the affected

20

tables can always be dropped manually, that means someone needs to write a script to do that – perhaps in a production environment, when no system expert is available.

This inconvenient situation is circumvented by simply designing each table-refresh module to include the drop of that specific table, rather than having a separate process that drops all of the tables. With this design, dropping the table is simply part of the refresh process for that table, and it becomes something that does not need to be a consideration: it takes care of itself. Now you can refresh one table, several tables, or all tables, as often as desired and in whatever order is convenient (ignoring relationship-based data dependencies for the moment), without any concern about the drop process. This becomes extremely valuable during testing, enhancements, and when dealing with any system anomalies.

For processes that load tables, non-destructive repeatability often means that rather than inserting extracted data into an existing table, you either drop and re-create the entire table, or you truncate the table before inserting data. This statement implies a preference for full-extract systems, which extract the entire dataset from the source systems every time, rather than incremental or “delta-load” systems, which only extract changed or new data. It may seem odd to prefer a design that extracts the same data over and over again, but the benefit is reliability and simplicity. Unless your source system provides well-implemented tools to help you identify changed data records, allowing incremental data extracts, if you want to extract only new or changed data, you need to explicitly design processes to detect that data, and that means added code. Similarly, you will need to add processes to ensure that extracted data is not duplicated via some problem in identifying new data correctly, or a problem of inserting new data more than once. These processes add additional code and introduce failure points that have potentially severe consequences. Therefore we recommend full-extract whenever possible, and incremental loads only when truly necessary, except in cases where there are already well-established tools available to you to facilitate reliable incremental loads.

This is not to say that custom incremental loading schemes are never desirable. In cases where data volumes are too large for full extracts and no incremental load tools are at hand, you will need to design your own. In this case, you still must design your extract processes to be non-destructively repeatable. In modern database systems, this generally can be achieved by using explicit transaction control (commit/rollback). Other solutions involve developing ways to precisely identify the data records you are extracting in each run, and adding a step to delete any such data from your extract-layer tables (it may be there from a previous, incomplete extract) and then insert it again. Identification and management of incremental data is never trivial, so again, we recommend avoiding incremental loads when practical.

In any case, the key is to explicitly design your load processes to be repeatable. Any design that requires a load to be run once and only once (per day, or month, or whatever) will be problematic in the long term. Whether you expect it or not, it is almost certain that your load processes will be repeated within an extract cycle for one reason or another, and you don’t want that to be a problem in and of itself.

21

Finally, it is important that the extract load processes be distinct from warehouse and publish load processes. Without a doubt there will be many times you will want to run the extract process without affecting the warehouse or publish layers, for testing or in the case of anomalies.

The Warehouse Layer

The term “data warehouse” is used frequently, and like the similar term “datamart” it can have several meanings. A data warehouse can be a “star-schema” de-normalized dimensional database; it can be a set of “datamarts”, each of which can be a star-schema system or a normalized system; or it can be a single normalized database. This last case is what we’re talking about. What makes our database a “warehouse” is that it combines data from disparate systems into a single unified database.

By “unified”, we simply mean that no piece of information in the warehouse layer contradicts another piece of information, and also that no piece of information in the warehouse layer is duplicated elsewhere in the warehouse layer.

Data Unification in the Warehouse Layer

It might seem obvious that no self-contradictory information would exist in the warehouse layer, but such may not be the case in the extract layer. This is because the extract layer potentially contains data from several source systems that may not “talk to each other”. For example, the inventory system and the sales system both record information about products; however, being two different systems, they may use different identifiers for the same product. In Figure 3, the order detail table from the Oracle sales system, XT_Sales_Lineitem, identifies a product by its PartNumber.

On the other hand, the inventory system from vendor XYZ, in table XT_Inventory_Product, identifies a product by its ProductID. Because both systems serve the same organization, they’re both talking about the same product. But because they come from different vendors, they may use different keys with different formats. While our end users don’t care about identifier formats, which after all will probably not show up in any UI, we do care, and we will want to use only one or the other in our system. We make that unifying choice in our warehouse layer, where we simply create the new identifier ItemID, which represents a product whether the context is sales, inventory, or anything else.

If, as in this hypothetical case, the same product has two identifiers in the source systems, a mapping of these two identifiers will already exist somewhere, or there will be a means of readily determining it. The organization would have encountered this potential problem long ago and will have dealt with it as a necessity, probably by establishing a PartNumber-to-ProductID mapping, or defining a de-facto mapping by enforcing a rule like uniqueness of product name, which readily leads to a mapping of products in one system to the same products in another system. We simply exploit that mapping in our warehouse layer, using it to determine a unified ItemID for each product.

22

The mapping of PartNumber or ProductID to ItemID is represented in the ProductMaster table in the warehouse layer. We construct this table using the known relationship between inventory ProductID’s and sales PartNumbers, and use it to determine a unique ItemID for every product.

Occasionally, source systems will contain duplicate information. For example, both the sales system and the CRM system have the concept of an “account”, so they may both have master account lists. If so, these master account lists probably list many, or most, of the same accounts. We may need to extract both lists from the source systems, but in our warehouse layer we want only one. If, as is usually the case, there are discrepancies between the two account lists in two source systems – perhaps the same account’s address differs from one system to the next - we will need to reconcile these differences. This is not to say that we are in the business of data quality; however, if we are asked to produce data and reports containing accounts, we have to have a single definition of what each account is. The problem of HOW we determine which of two pieces of duplicated information is “correct” is to be solved by the source system owners. Our task is simply to implement the “correct” choice. The end result is that, in our warehouse layer, we have only one definition for each data element, and nothing is duplicated. In data modeling terms, this (among other things) means our warehouse layer is somewhere close to Third Normal Form.

The construction of this unified self-consistent data layer certainly seems to be a worthy goal. It represents our “one version of the truth” for the information in the organization. Nothing is duplicated, nothing is contradictory or ambiguous. But surely this will be the case in our publish layer as well; it will contain only one version of the truth. So why do we need a separate warehouse layer, or put differently, why can’t our warehouse layer serve as the publish layer?

The answer is that a distinct warehouse layer gives us an ideal environment in which to apply the bulk of our business rules. As we will see, for a variety of reasons we may duplicate substantial amounts of data and data structures in our publish layer; the publish layer may be somewhat “messy”. That’s fine, and is driven by our end users’ needs. But we don’t want to duplicate the implementation of business rules. We want, wherever possible, to code each business rule only once. And while the publish layer may be intentionally “messy”, with overlapping datasets and strange formatting, we will want to have a clean, clear unified dataset somewhere, and we will: the warehouse layer.

Data Quality in the Warehouse Layer

Because the data in the warehouse layer is self-consistent and organized compared to the extract layer, and compact and normalized compared to the publish layer, the warehouse layer is a great place to identify and address data quality issues. For example, if customer names contain leading spaces, stripping them out here is easier than in the extract or publish layers because in both of those cases the same customer name field may exist in multiple tables. In the extract layer it may even exist in several different datatypes. But the biggest reason to apply data quality fixes in the warehouse layer is so that you don't have to repeatedly apply the same fix multiple times in the publish layer, every time the field shows up on a published table, report, or UI.

23

Business Rules in the Warehouse Layer

The warehouse layer is where we apply most of our business rules. This is primarily for two reasons. First, the data in the warehouse layer is easy to work with: it is fairly normalized, it has consistent identifying keys, there is no duplicated data, and there is no data inconsistency. Second, when a business rule is applied in the warehouse layer, its effects will cascade downstream into the publish layer, where the affected data may be used in several places, by simply coding the rule one time in the warehouse layer.

It is also true that business rule cascading would occur if we implemented business rules in the extract layer. But for the reasons given in the extract layer notes above, that is usually not the best approach.

In the warehouse layer, we apply all of the business rules that apply to ALL of our various end user groups. For example, all of our user groups may agree that Central America is treated as a single geography, and sales made in Honduras are simply added to sales made in Nicaragua and the rest of Central America. Given this fact, we would probably want to go ahead and apply this rule to our warehouse data, adding all the sales up and creating a “master” geography of Central America in our sales tables, grouping all our Honduras and Nicaragua accounts into a set of Central America accounts in our CRM tables, etc., even though that data is broken out more granularly in the source sales and CRM systems.

But wait, you say – if you do that, you lose visibility to the more granular data available to us in the source systems: why throw that away? The answer is: so that we don’t have to code that Central-America-Grouping rule repeatedly in our publish layer, once for every reporting object that refers to geography. We may have three sales reporting tables and four CRM reporting tables in the publish layer, each of which refers to geography. In that case, if we didn’t implement this rule in the warehouse layer, we’d need to implement it seven times in the publish layer.

But wait, you say – what if they change the rule? Not a big problem: because we minimize the alteration of data in the extract layer, we still have the more granular data if we need it. And because we coded the Central-America-Grouping rule once, in the warehouse layer, if it changes we only have to change it once (as opposed to changing it seven times in the publish layer).

The warehouse layer embodies your “cleanest”, most tightly-organized database, with all of your major business rules applied. You will want to be very careful with the format of the data here and use your best database-design and modeling skills. Every step you take to create clean, consistent data in the warehouse layer will pay off later by making everything you do with this data easier: implementing and altering business rules, adding new complementary datasets, adding new functionality. But more importantly, when you have built a clean, logical, and clear warehouse layer, it becomes an ideal data source for the publish layer, and the publish layer is what your customers are paying you to build, and where they’ll derive their business value.

The Warehouse Layer Load Process

24

The load process for the warehouse layer is similar to that for the extract layer except that the data source is now the extract layer, and also you will be implementing business rules in the warehouse layer. That will surely mean that the underlying code will be more complex, and you may have extra processes beyond the one-per-target table design in the extract layer, for example processes that update warehouse data based on calculations that can’t be done until the entire warehouse base dataset is loaded.

Like the extract layer, it is critical that the warehouse load processes be modular, repeatable, and distinct from the publish layer load processes.

The Publish Layer

The publish layer is the final product, the goal of the entire endeavor. This layer contains all the data your customers want to see, arranged the way they want to see it, and it’s really the only thing your customers care about (and what they’re paying for). We call it the publish layer because typically it is the only layer that is visible to your customers.

The publish layer will often consist of a number of data tables, each formatted specifically to suit one or several customers or reporting processes. It could also include or consist of a set of exported flat files, or any other kind of data-driven objects. The format doesn’t really matter, and it is precisely because we’ve built the warehouse layer the way we did that we can readily supply data in whatever format our customers need.

Like the extract layer, the data in the publish layer may be “un-tidy” in some ways; it may be highly de-normalized or embody redundancies if several groups want similar but not identical datasets. Again, this would be because particular customers need the data to be a particular way. The publish layer could even contain data inconsistencies, as long as they are the intentional result of customer requirements. For instance, customer group A might have one name for a certain type of CRM lead, while customer group B might call it something else, or may even ignore that type of lead. Both situations are completely legitimate, and need to be supported equally. (The organization may want to consider reaching a compromise on an issue like this, but beyond perhaps pointing it out, that is not our concern).

Publish Layer Load Processes

The publish layer load process is very much like that of the warehouse layer, with the warehouse layer now being the data source. The formatting of the data will be different, because while the warehouse layer data is highly normalized, the publish layer data will likely not be.

Regardless, key design factors remain the same: modularity, repeatability, and independence from the other layers.

25

Key Qualities of the Completed System • The system consists of three logical layers: extract layer, warehouse layer, and

publish layer, with the publish layer being the layer that is exposed to the users, the consumers of the data.

• Data flowing into the extract layer is largely unchanged compared to the source systems, aside from filtering and grouping/joining.

• Extracted data is persistent between system refreshes. • The extract load, warehouse load, and publish layer loads each can run

independently of the others. • Individual load processes don't take hours and hours to complete. If they do,

break them into multiple processes. • Load processes are non-destructive and repeatable – a given process can be

repeated as often as desired without causing data quality problems (duplication etc.). This way, if an error occurs, or if you’re not sure for whatever reason that a process completed (or that it started), just run that single process again.

• Drop and re-create target tables whenever possible. Use “delta-load” designs only if truly necessary or if tools already exist to facilitate it without adding custom code.

• Delete datasets before insert processes. If a module inserts North America sales data, delete all the existing North America sales data automatically before the insert, as part of the same module.

• Any time you create an object during processing, code a pre-emptive drop of that object into the same module that creates the object (rather than having a separate script that drops all objects, for instance).

• Unreliable source systems are separated from reliable source systems and treated as non-essential. (This is discussed in section 6 below.)

• The warehouse layer is the “cleanest” most self-consistent data you can create. Use your best data modeling and normalization here.

• Business rules that apply globally are applied in the warehouse layer. This avoids duplication of business-rule code and maximizes data consistency.

• Business rules that apply to only some users are applied in the publish layer. • Each business rule is coded in only one place. Implement rules once and only

once – in a single module that is called by other modules, or in a view or table that is accessed by other tables.

• The publish layer is formatted for end-customer usability, not data-modeling ideals. Data may be de-normalized and repeated, but this is not a big problem because the complex relationships – the business rules – are implemented in the warehouse layer, so repeating data in the publish layer, where needed for user convenience, is not difficult.

26

Roadmap to Building the System In the preceding sections, we have described our ETL system. Here we describe how to actually build such a system, step by step.

1. Gather business requirements.

As with any major development project, requirements-gathering is one of the most critical steps in the entire project. While we will not attempt to describe good requirements-gathering practices here, we will list some requirements-gathering goals that are particular to an ETL project:

• Make sure you have a good specification of the end result your customer is looking for, whether it is an online database for ad-hoc queries, a set of online reports, a set of flat files, or a dump to disk or tape.

• Determine what data elements you need. This information comes from your customer, who may be able to essentially list all the data elements and tell you where they come from (which source system(s)), or may instead show you a set of reports or other end-user interfaces.

• Determine where the data comes from, i.e. which source systems do you need to access.

2. Analyze your source systems.

• Determine a contact for each source system, a person who represents that system’s operational team. Gain online access to each source system.

• Learn some key things about each system: what is its uptime/downtime schedule, what is its refresh schedule. In particular, find out how robust and reliable is it by asking some key users (some of your customers will probably be key users). For instance, systems based on flat-file uploads are notoriously unreliable both in terms of downtime and particularly data quality. The reliability of each source system will help you make important decisions about how you design your extract processes.

3. Analyze your source data

Like requirements-gathering, data analysis is a broad topic. Some analysis tasks particular to ETL include:

• Learn how the data is structured in your source systems. Run some SQL queries or otherwise access the source data and find the inherent data relationships and business rules. You can do this using tests that are based on

27

the actual data you will be extracting, so that you will essentially be prototyping some extract code while you investigate.

• Learn about data volume in the source systems. If you’ll be extracting 12 months worth of sales data, is that 100MB or data or 100TB?

• Understand how the data volume is expected to change. It may be unwise to design an extract process that works well for 1GB but crumbles at 100GB if 100GB is expected a year from now.

4. At this point, you are actively working on the project. It is worth remembering that while you are working on extract and warehouse layers, the only thing your customers think they care about is the publish layer. They really care about the extract and warehouse layers too, because without those the publish layer won’t be very flexible or reliable, but that won’t be clear to them unless they have personal ETL experience. Describe the publish layer, in business terms, to your customers. Describe the extract process and the application of business rules, but do not emphasize the extract or warehouse layers themselves because they will not be visible to the customer and the customer should not have to worry about them (and probably won’t see the need to pay for them).

5. Determine your physical architecture.

Determine hardware: servers, network, etc. (if it has not been determined for you) and where you will put things. Remember that this ETL system design is a logical design, not a physical design, and layers can co-exist on servers or in database instances if desired. You can put all three layers on a single server without any inherent problem. However, you may likely choose not to do so for performance or security reasons.

6. Start to build the extract layer:

• Start by “pulling” some data from a reliable source system. Start with a dataset that looks like one of the datasets you’ll need in the final system. Start by filtering down to a small data volume, using time or geography filtering or something similar, and prototype your extract code using small-volume test runs.

• Make your extract datasets modular; that is, for each dataset you extract, like sales data vs. inventory data vs. CRM data, and CRM accounts data vs. CRM leads data vs. CRM contacts data, have a dedicated code module (subroutine, class, stored procedure, script). This makes development and testing easier, and for improved manageability you can always ultimately create parent “wrapper” procedures that call many single procedures. We typically have a single code procedure for each extract layer data table, named something like XT_Sales_LineItem_Load. Then we would have a wrapper procedure called XT_Sales_Tables_Load that would call this XT_Sales_LineItem_Load, XT_Sales_Order_Header_Load, and the other sales procedures. Again, this

28

makes for easy development and testing, scales well as the system grows, and is self-documenting.

• Build in custom logging. The tools and O/S you are using will doubtless have some sort of logging built in, but we have found that writing a simple procedure that inserts a record into a central log table, and then calling that procedure as needed from each module is easy and invaluable. This procedure can be called at the beginning and end of every module, and should include automatically adding a timestamp to each log entry. With this simple step, we will now have a central, persistent log in which we can see every step that is initiated and when it started and finished (if it finished – just as importantly, we can see that a step never finished). There is no reason that this log cannot be made available to processes system-wide – it does not have to be particular to the extract layer. Over time the value of this log only increases – for instance, by building a SQL query against the log table, you can see how an individual process step’s completion time has changed over the past year as source data volume changes. At a minimum, your log table should include the following:

• Event (text): a free-text column. • Event_ts: date/time column, automatically populated with current

date/time. • Several generic text columns, like “note1”, “note2”, “note3”. These can

contain different information depending on the context under which an event was logged.

• Several generic numeric columns. • As you increase the amount of data you are extracting, start some performance

testing to identify where your performance bottlenecks are likely to be. Gradually open up your filtering until you have full-data-volume test runs. It is very important to avoid very-long-running single processes, as they are difficult to manage and are prime sources of system unreliability. Whenever you have a single process that runs for more than an hour or two, strongly consider breaking it into several smaller processes. For example, if, as a simple result of data volume, it takes five hours to run XT_Sales_LineItem_Load, break this process down into several sub-processes, with each one loading a subset of the data. You can divide data by time, geography, product category, groups of orders, whatever. Even if in the end it still takes five hours to load all the Sales LineItem data, you will be much better off, and your system much more flexible and reliable, if this really happens in five one-hour steps.

• Segregate unreliable source systems. As you start to build your extract process, design it so that unreliable source systems will not cause your entire extract to fail if they have an outage. You are already building a modular extract process – make it even more robust by “fencing off” any process that depends on an unreliable source system. There are several ways to do this:

i. Make each extract step that depends on an unreliable system a “non-essential” step by enclosing it in error-handling routines that are set to allow the overall process to continue if the step fails.

29

ii. Run all such steps as a completely separate process.

The approach to take with unreliable source systems is to ASSUME that they WILL fail from time to time, and design accordingly.

It may seem that there is no such thing as a non-essential step in a data extract. In the larger sense this is probably true, or else you wouldn’t be attempting to extract that data in the first place. But it often is the case that some data is more essential than other data, particularly when refresh frequency is considered. For instance, while your customer may need that returned-product list, he may be able to get by with that list being a week old, while that probably won’t be true with the orders-in-process list. In that case, you won’t want to stop your daily refresh simply because the returned-product flat-file download failed. In addition, you probably will find that the more critical a set of data is, the more reliable its supporting source system will be, and vice versa. Don’t let obsolete systems that contain secondary data bring down your entire system. Remember too that because everything in the extract layer is persistent between refreshes, when an extract fails, the affected data will still be available from the prior extract. So for example in the case of the failed or skipped returned-product list, the week-old returned-product list data is still in the extract layer, so you can continue to use that until a more recent list is extracted, while all your other data is up-to-date in the meantime.

7. Start to build the warehouse layer.

To some extent, this can be done in parallel with building the extract layer: as soon as you have enough extract data to build any of your warehouse tables, you can do so. Remember that even the completed system will doubtless continue to evolve and expand, so the concept of a “finished” layer doesn’t really apply anyway.

a. Apply business rules. As discussed above, the data in the warehouse layer will reflect the majority of the business rules you will apply. You will generally apply these rules via some kind of SQL or other DML code that runs as you load your warehouse data objects.

i. When a business rule results in changes to source data, consider also keeping the original data in another field. For example, if a business rule states that orders with negative product quantities (representing returns) have the Quantity field set to 0, consider keeping the original quantity also, in a field called Orig_Quantity, even if the end users say they will never want to see this value. This can be very useful when apparent data anomalies occur, when debugging, or for adhoc analysis in production. You can choose not to expose this information in the publish layer if desired.

b. As with the extract layer, use a modular design.

30

c. Use your highest level of care and sound data modeling practices in designing the warehouse layer. The warehouse layer is ultimately the heart of what you are producing, even though it will not be directly visible to your customers.

8. Build publish layer objects as needed.

There is no need to have a detailed strategic plan devoted to building the publish layer. Instead, build it directly according to customer needs. There can be little doubt that it will change greatly over time, even if the underlying warehouse and source data doesn’t change a lot. This is because customers are constantly finding new ways to use information. In fact, any system that exposes new information invariably ends up generating a new set of customer requests as a direct result.

a. Saying that there is no need for a grand strategic plan for the publish layer is not to say that it should be designed and built without care. But rather than spending a lot of up-front analysis time and effort in, for instance, minimizing the storage space required by the publish layer, we recommend applying your effort towards making the publish layer flexible and easy to manage:

i. Avoid code duplication. For example, it will often be the case that several customers will need similar, but not identical, data tables. In that situation, rather than building two similar tables from scratch, you can often build one table and then build the second table as a select-and-insert from the first, or as a view (if performance is sufficient).

ii. Use user-friendly object and field names. Make your dataset self-documenting.

iii. Anticipate performance issues and design accordingly. For instance, build indexes on key data fields, even though users may not know to ask for this.

Summary of Steps to Build

1. Gather business requirements. 2. Analyze source systems.

a. Who is the contact? b. What is the downtime/refresh schedule? c. How reliable is this system?

3. Analyze source data. a. Run some SQL/downloads – you’re prototyping at the same time.

4. Describe the publish layer, the structure of which will be driven directly by customer requirements, to your customer; set expectations.

5. Determine your physical architecture. 6. Start building the extract layer.

a. Extract and hold source data, largely “as-is”.

31

b. Build custom logging. c. Performance-test as you go. d. Segregate unreliable source systems.

7. Start building the warehouse layer. a. Pay the most attention to data modeling and design here.

8. Build publish layer objects as needed. a. The publish layer doesn’t have to be pretty. It just has to be clear to your

customers and easy for them to use. If it contains enough data to be confusing, separate it into subsets visible to specific customers.

Best Practices • Use transparent and consistent naming conventions. For example, if you have a

procedure that loads a data table, call it <table_name>_load. • Write your own simple logging – this is as simple as a procedure that inserts a row into a

log table (with timestamp), which is called by each of your other procedures. • Don’t sacrifice code readability, maintainability, or simplicity for performance unless you

really need to. If a process runs in 2 minutes and runs only once or twice, the fact that it could run twice as fast is often not relevant.

• Don’t duplicate code. If you find yourself cutting and pasting code from one module to another, you probably should do a small redesign and add a grouping table, a work table, or write a procedure to encapsulate the duplicated code. For example, if you always group 10 products into a super-group, implement that as a grouping table you can join to, rather than doing the grouping explicitly in SQL whenever those products are invoked.

• Don’t duplicate code. I just wanted to say that again. The real reason not to duplicate code is this: when your user changes his mind about a rule, you’ll need to change your code. You don’t want to make the same change in 10 places. With rare exceptions, if you have to duplicate code, you’ve designed something wrong.

32

Putting it all Together

Use Cases: Why Does This Work?

Let’s discuss some scenarios that demonstrate how the features of this ETL design can be exploited for greatly enhanced productivity.

Asynchronous business rule development

Consider a system used for reporting monthly sales progress. Data will be extracted from source systems, processed, and presented to users once per month.

Despite this schedule, there will undoubtedly be many reasons that you will want to run processing during the month, between monthly refreshes. For instance, your users may want to apply new or changed business rules for next month. You will need to implement and test these rules in the warehouse layer.

Because the warehouse layer load is independent from the publish layer load, you can code the new rules in the warehouse load code and re-load the warehouse, now reflecting the new rules, without affecting the publish layer or your users. Because the warehouse load is repeatable and non-destructive, you can repeat the warehouse load process as often as you like for testing and debugging. And because the extract layer data is static and persistent, you don’t need to worry about source data changing between repeated warehouse loads – you are not running the source extract. This is a great benefit during testing. You can even roll back the changed warehouse load code and re-load the warehouse with production, un-changed data whenever you want.

Monthly refreshes with asynchronous extract performance testing

Similarly, because you can run the source extract without affecting the warehouse or publish layers, you can change your extract code and run the extract repeatedly for performance testing. In this case, unless you create a backup of the original extracted data, you will not be able to restore the extract layer to its pre-changed format, but this doesn’t matter to your users because the extract layer can be changed (or dropped) without affecting the publish layer.

Varying source system refreshes (partial refreshes)

Suppose one, but not all, of your source systems refreshes mid-month, providing some data your users want to see before next month’s refresh. Because your extract is modular and each step is non-destructively repeatable, you can extract new data from only that particular source system without running the entire extract process. Also, because your extracted data is static and persistent, you can then run the entire warehouse load process (or part of it – it also is modular) and then the entire publish load process, to refresh the publish layer with the new

33

data. In this case, even though you ran the complete warehouse and publish layer load processes, the only change will result from the single source system you re-extracted. Finally, because all of your load processes are non-destructively repeatable, this mid-month refresh won’t cause any data problems in any of the un-changed data.

Building one xt_ table from another – the rest of the process is untouched

Suppose your system reports on sales data by week, by product. You have one XT_Sales_Product table in the extract layer with product sales aggregated by week, one corresponding SALES_ table in the warehouse layer, with a set of complex business rules, and several REP_ tables in the publish layer for the same data, organized in different ways. A customer now needs data for several products to be expanded from weekly to daily, and also to product category, which you were not previously extracting. Finally, the customer wants to apply some business rules that apply to these products only. The customer wants several new REP_ tables representing this new daily-product-category information with the new rules. All your existing weekly REP_ tables need to be maintained unchanged.

Given that this change requires making changes all the way “upstream” to the extract layer, and because of the existing complex business rules involved in organizing the sales data (in the warehouse layer), this might seem a very disruptive change. However, design features of the system can be exploited.

Because you will be extracting new source data (daily data and product category), you need to do something in the extract layer. You could simply create a new extract process for this data. But noting that it would be exactly the same as the existing sales by product extract, but with lower granularity and a new attribute, you realize you’d be duplicating some code and extracting the same data twice.

Running some performance tests in the extract layer (which won’t affect other layers), you verify that you can afford to extract ALL product data by category and at a daily level within your extract time window. (It may seem to be overkill to extract this data for all products. But a) fast-enough is good enough, even if it’s not as fast as possible, and b) the customer may well later want to add to his product-list request – something that will not surprise anyone - so you might need the data after all).

So the strategy becomes the following:

1. Alter the XT_Sales_Product_Load module so that it extracts daily data and includes category data, and loads it into a table called XT_Sales_Product_Daily.

2. Rename the module to XT_Sales_Product_Daily_Load. 3. Build a new module, XT_Sales_Product_Load (replacing the original module)

that sums out the daily and category data to produce the original XT_Sales_Product table just as it was being created before.

4. At this point, your existing system is completely intact again: while the extract of sales data has changed, the XT_Sales_Product table is unchanged. That means that the warehouse load that depends on XT_Sales_Product doesn’t

34

have to change, and subsequent publish layer load processes also are unchanged. The only difference is that there is a new table in the extract layer, XT_Sales_Product_Daily. (You can even stop here and postpone actually using the new daily data until a later date.) Note that despite the existence of new XT_ table that is very similar to another XT_ table, you have no duplicated extract code: the second table is created with a simple summing of the first table, rather than by its own extract.

5. Now you can continue building new processes to supply the requested new REP_ tables. The first step would probably be to build a warehouse table containing daily by-category sales data, only for the specific products required. The source for that warehouse table would be XT_Sales_Product_Daily, filtered by product, and with the new business rules, which apply only to those products, implemented. The REP_ tables will follow.

The key to note here is that you have significantly altered the source extract of sales data, but because you could make this change only in the extract layer and leave everything else untouched. Because of this, it would be entirely possible to make this change in the production system, while it is live: with completion of the last step of building the new publish layer objects, those objects would simply appear in the publish layer, with everything else un-touched. You probably would not choose to do this in the live production system, but the fact that it could be done points out how flexible this three-layer modular design is.

This particular solution, almost identical to one we actually implemented recently in a live system, relied on being able to extract daily by-category data for all products. If that were not the case, the solution would be a little different, but equally practical:

1. Build a new extract process, XT_Sales_Product_Daily_Load, that extracts daily by-category data for the particular products requested. Because weekly data is already being extracted for all products, there is some duplicated data here. But because it is impractical to extract daily data for all products, this is necessary – the customer needs this data.

a. Because the extract process is modular with each dataset being extracted independently, you can build and test this new dataset extract without running the overall extract process.

2. Continue with step 4 above.

Occasionally sacrificing performance to improve maintainability

Imagine that you have to code a process that a) determines the list of a specific subset of the multiple ProductID's that attach to an OrderID and b) populates a single field representing a concatenated list of those ProductID's. This ProductList field will then have a one-to-one relationship with OrderID and will be a single field on each Order record. Given the one-to-many relationship between and OrderID and its ProductID's, determining the list of products is not difficult, nor is concatenating the fields. However, considering that the number of products on an order is unknown and that there may be business rules that determine which products are to be included in the list, the resulting code to determine the list could be more than a few lines,

35

and may include nested IF's, CASE statements, or other control logic. Finally, if the list determination code occurs as part of a SELECT list in a SQL statement, the resulting SQL statement may be somewhat complicated. Of course, this SQL may already have other complicated elements unrelated to product lists. If, as is often the case, the business rules for the list determination change over time, the result could be that changing this code becomes a fairly involved process which also requires regression testing to identify unintended side effects.

One way to simplify the SQL in this case would be to use an inline function in the SELECT list:

SELECT OrderID , FN_Product_List_Get(OrderID) , … FROM Orders …

Now the SQL statement is compact and clean, and changes to the product list rules can be implemented in the function, without anyone touching the calling code. Going further, since the number of products in each list is unknown, the concatenation code must be done in a flexible way. A clean, easy to code and maintain solution might be to use a cursor loop to successively select each ProductID and concatenate it to the previous ones. So now we have an inline function that employs a cursor. The problem? Both inline functions and cursors are notoriously slow in many environments, and the combination of both is likely to be exhibit very poor performance. Many developers would recognize this and would not employ this solution. But would this actually be a poor design? Not necessarily. While it is true that a cursor called by an inline function will almost surely return data much slower than other solutions, the added calculation time may not matter much. What if the data volume for this particular subset of the data is small and will remain small? If the cursor/function solution completes in 2 minutes, what does it matter that another solution would complete in 15 seconds if it requires a day of testing (not to mention the presence of a developer who is familiar with the code, who may have left the project 3 months ago) every time a small change in list rules is required? What if this process only runs once per quarter, or it runs every night but you have a 4-hour processing window and your entire ETL process only takes 30 minutes to run? In these cases, that extra 2 minutes of processing time just doesn’t matter, and is well worth the benefit of having code that is easy to maintain and change. Another example of balancing performance against maintainability is the use of physical work tables rather than temporary work tables ("temp tables") that exist only in memory. (By "work tables", we mean data tables that are populated during processing to make subsequent processing easier, but that are not part of the final dataset that is being created). Temp tables are often much faster to access than physical tables. However, they only exist during execution of the process that creates them. Physical tables persist until they are dropped upon the next ETL run. This can be extremely useful during development and also in production when

36

anomalies occur: depending on how they are used, work tables can contain a wealth of information about how data looked during different phases of processing. This information can be extraordinarily helpful when one is trying to trace data changes through the processing path, and it all is lost if it exists in temp tables that are deleted once processing is complete. One strategy is to use physical work tables during development, adding a block of code which drops all the work tables before the code is transitioned to production, if the work tables require noticeable storage space. This block of code can easily be commented in and out depending on the situation, or the option to keep or drop work tables can be a parameter in production. Alternatively, when development and unit testing is nearing completion, physical work tables can easily be converted to temp tables in the code if a significant performance gain will be realized. As above, this sort of strategy only makes sense if a resulting performance loss (if any) is small in the context of your overall needs. But if it is, the benefit of having physical work tables available, even in production, can be great, so it is not always wise to simply use temp tables all the time for performance and to save disk space. This sort of compromise is a tough call, performance is extremely important, and we do not advocate anything like a "build now and tune later" approach. However, it does pay to remember that the fastest code is not always the best code. Keep this in mind and design a system that enhances both performance and maintainability.

37

Conclusion

In this discussion, we examined what the most important features of an ETL system are and how to design one that possesses those features. We have seen that by methodically building a three-layer system and using modular, non-destructively repeatable load processes, we can implement a system that is reliable, flexible, and manageable, and therefore cost effective. Finally, we noted some real-world examples of ETL challenges and demonstrated how the features of our preferred design can be exploited to overcome these challenges with a minimum of disruption.

We know that this methodology is effective, because it has been developed over time in the course of many implementations in diverse customer environments. It can be effective in your environment as well.

Documents

Top 10 ETL Design Tips