Data Migration White Paper

London - New York - Dubai - Hong Kong - Mumbai 2011

Technical Data Migration in Six Steps by

Wael O. Elrifai, July 2011

2

Confidential - not for redistribution

Abstract

Data Migration is a field that is often overlooked as a simple step between retiring an existing

platform and bringing to life a new one. The reality is that we are constantly migrating data between

systems, implementing new applications, setting up co-existence scenarios and sharing information

in a way that could be thought of as data migration.

The actual task of data migration is a complex one requiring experience and a detailed

understanding of the steps to be taken, both in terms of business process and technical

implementation. This includes a clear delineation of responsibility between IT and business, since far

too often the job of manipulating data is relegated to IT, well outside of their core competency.

Enabling the migration requires a team experienced in abstracting the technical details such that the

business team can implement their rules for migration without ever writing a single line of code, a

task for IT.

This document focuses on technical architecture rather than process. We describe an architecture

that we’ve found to be ideal in dozens of large data migration scenarios at organizations ranging

from global Fortune 500 companies to local Small Businesses. This architecture is portable between

different software toolsets; we’ve implemented it in hand-coded systems as well as open source and

best-of-breed ETL/DQ solutions.

This document is aimed at anyone who’s about to embark on a data migration task, or who is trying

to rescue an existing project that is “off-plan”. To gain the most from this document, you should only

require a rudimentary understanding of the terminology of data handling, terms such as ETL

(Extract, Transform, and Load) and Reference Data (e.g. ISO currency codes, lists of possible post

code).

3


Introduction

The Peak solution for data migration involves a 6-stage process designed to isolate and validate

elemental data changes at the lowest level, hierarchy changes, referential integrity checks, duplicate

checks and capture metadata into distinct steps such that they can be easily tracked, reversed, and a

complete lineage exists for audit purposes. The diagram describes the process at a high level, a more

detailed description can be found below. Please bear in mind that a “stage” refers to the staging

area and the preceding technical ETL/DQ process between the previous stage and that stage. This

document is by no means a complete description of the systems and processes required, but it’s a

good indicator as to the technical elements and requirements of a good data migration system.

This architecture is not tied to a particular ETL tool, but Peak does recommend the use of tools to aid

in re-usability and visibility of process. Some tools are better suited to the “pivot” tasks required at

some points though all major commercial and open source ETL tools can provide the necessary

functionality. No data is deleted at any point within the stages, any data that does not pass a

particular stage is excluded via the error logging mechanism and a “warning” or “error” flag attached

to the record by Stage and Surrogate ID (SID).

Stage 1: Data is read directly from source systems in the native code page and put into database

staging areas in a format and ER model that exactly matches the source systems, whether they are

ERPs, bespoke systems, flat files, or semi-structured data. Some metadata is added to aid in tracking

of records in the form of a surrogate ID (SID) as well as timestamp, source, and other metadata as

appropriate for the system.

Stage 2: Data is moved from the Stage 1 area to the Stage 2 area. Only that subset of columns

considered necessary in the data map to target systems are moved, including any fields necessary to

perform calculations but not necessarily designed to be loaded to the target, as well as all metadata.

SIDs are maintained to provide easy “back tracking”. Any filtering rules to remove data known to be

irrelevant to the migration is performed. Duplicates are removed as deemed appropriate (i.e.

customer master data might be removed while transactions are persisted). Data is still maintained in

the entity-relationship model of the source systems less the columns that have been deemed

extraneous. All actions are logged in the audit and error data model with error data being excluded

from the target.

Stage 3: Data is mapped from the old entity-relationship model to the new one, metadata is

transferred, SIDs maintained, and all actions logged while error records are excluded from the

target. It is important to note that no data translations or calculations have taken place yet. The data

is in the target ER model but not necessarily in the target data formats. All data types are defined as

VARCHAR (MAX LENGTH) to avoid rejecting any source data.

Stage 4: Data is translated to meet target system rules, implement functional requirements, and

checked against target system reference data. All metadata and SIDs are maintained and all actions

logged, both audit and error information. A second duplicate check is performed here to account for

the potential for “new” duplicates create by data translation, of course the concept of a duplicate is

one requiring a business and technical definition. The target data model including data types have

4


now been put in place and a column-by-column check is enforced to ensure that no data violate the

datatype constraints in the target system. Error records are excluded from the target and logged.

“Dictionary” translations, those translations where a given input from a given source system

translates to a particular output (e.g. “Source System A” uses a 3 character country code while

“Target System X” uses a 2 character one) will have the code abstracted such that the translation is

stored in a value mapping table. Those tables can be easily maintained by the business users via a

spreadsheet that gets loaded into the value mapping table. This ensures that the business maintains

responsibility for the data translation while the development team is responsible for correctly coding

the abstract rule.

Stage 5: Referential integrity checks are performed to ensure that the data structures are

completely ready to be written to the target system. Audit and error data is logged and metadata is

maintained. At this point the data should be error free, any “child” records that did not pass error

checks have been excluded and any parent records that didn’t pass error checks have also been

excluded along with any related child records such that orphan records cannot be written to the

target system. Alteration of the parent-child exclusion rules is possible but heavily dependent on the

nature of the target systems (i.e. SAP will not allow creation of a parent-child hierarchy if parent and

child legal entities don’t already exist in the system).

Stage 6: This stage involves invoking any APIs necessary and passing data in the structure required to

those methods. It also includes writing data directly in cases requiring a direct-write rather than API,

IDOCS, or similar calls.

Reconciliation: The stage 6 reconciliation is performed by reading the data structures from the

target system and comparing backwards to Stage 5 data, performing logical DELETEs against exact

matches.

Fault, fix, and follow-through: Using our internally developed audit and error data model along with

the technical steps detailed here allows a single view of all excluded data, why, and where it was

excluded, and also exposes a simple interface in which a business user can change the data (with

further audit information logged, of course). This system can be built to log exclusions based on

minor differences (i.e. variations of ‘-‘ vs. ‘/’ on dates) or to only indicate major failures (i.e.

variations in transaction values). Ultimately the system allows for “fixing” data in situ and “re-

migrating” the data to the new system.

This system is meant to support test data migrations such that all rules can be verified and enforced

with the goal of ZERO errors in the stages by the time of the final cutover.

5


Added value

The occasion of a data migration often involves work that can be used to support other initiatives

bringing new value to the enterprise.

Master Data Management As part of the data and application migration, it is often necessary to capture and validate

various “master data” including:

Customer data

Product data

Reference data

Value mappings for translation: “Dictionary translations” as described in Stage 4.

Having developed the data model and captured the requisite information, the possibility of

implementing a full featured Master Data Hub can be considered. Master data could be

created, maintained, or simply exposed via this central repository with all systems

synchronized in real-time or batch modes to this Single Version of Truth.

Co-existence At the time of writing the translations and transformations of data and hierarchies in the

migration work stream, “reverse functions” are written for those transformations that are

easily reversed (e.g. VAT added calculations can be easily reversed). Other not easily

reversible transformations (e.g. concatenations) are “reversed” by writing the rule as though

it was to support a migration from the target system(s) to the source system(s) with the

added benefit of being able to examine programmatically the audit trails. With the object-

oriented writing of all functions, they are easily re-used to support a co-existence scenario of

legacy and new systems that can be used to replicate data between those systems on a real-

time or batch basis as appropriate. These transformations can configured to be called via an

Enterprise Service Hub (ESB) or other methods, as required.

Archiving During any data migration, business requirements should be captured around the availability

of data; what data should be available on which SLAs in order to ensure that the operational

system continues to operate at maximum efficiency. A typical plan might envision

segregating former customers to a secondary system outside of the main physical database

(for performance reasons) 12 months after they terminate a client relationship. The same

could be true of older transactions and other data. The ideal technical system would have

the ability to treat data in an “object oriented” form, archiving off such “objects” like a

customer, automatically capturing all information related to that customer and moving it to

the secondary system. The data can still be interrogated but is no longer cause for

performance degradation as data volumes expand. Use of this tactic ensures that the sizing

of databases and therefore software and hardware requirements can be easily predicted

and is only variable insofar as operational behaviors vary, or the client base expands at

6


which time additional hardware or software can become necessary. This function can also

support a full Information Lifecycle Management system that includes creation of data for

system testing with features such as object movement (with referential integrity

maintained) and data masking for security. Additional work required to support this would

include creation of a security model that defines how datasets can be created, who can view

unmasked data (at the branch/column/customer/etc. level), and how the datasets can be

extracted or moved.

There are existing software products that can best leverage the work that will have already

been performed to support the Information Lifecycle Management system.

7


Getting Started

Our experience shows that if a well-defined business process is implemented, along with our

technical steps as described above, you can significantly reduce the cost of your data migration and

eliminate many sources of risk, especially when that migration is repetitive in nature (multiple

countries to be migrated independently over a number of months/years to a new system). A typical

cost curve levels out at 30% cost per migration after the second migration. Furthermore you can

leverage the artifacts created to gain new value as described in “Added Value” above.

You can contact me directly via email at [email protected] or via our website at

www.peakconsulting.eu to learn more about how Peak Consulting can help you successfully

complete your data migration or other “big data” initiative.

mailto:[email protected]

http://www.peakconsulting.eu/

8


diagram 1: Data Migration Architecture Diagram

Technology

Data Migration White Paper