16
1 Data Warehouse Testing White Paper Authored for: 13th International Conference, QAI Author 1: Prasuna Potteti Date: 11-Sep-2011 Email: [email protected] Deloitte Consulting India Private Limited Divyasree Technopolis, 124, Yemlur P.O Off Old Airport Road, Bengaluru 560037

Deolitte ETL Testing

Embed Size (px)

DESCRIPTION

Deolitte ETL Testing

Citation preview

Page 1: Deolitte ETL Testing

1

Data Warehouse Testing

White Paper Authored for: 13th International Conference, QAI

Author 1: Prasuna Potteti

Date: 11-Sep-2011

Email: [email protected]

Deloitte Consulting India Private Limited

Divyasree Technopolis, 124, Yemlur P.O

Off Old Airport Road, Bengaluru 560037

Page 2: Deolitte ETL Testing

2

Abstract

Testing is an essential part of design life-cycle of any software product. Data

warehouse testing is very important in projects because users need to trust the

quality of information they access. The reasons for this are increase in Enterprise

Mergers & Acquisitions, Compliance Regulations, Increased focus on data by

Management and data driver decision makings. Data warehouse is a collection of

large amount of data which is used by management for making strategic decisions.

This paper takes a look at the different strategies to test a data warehouse

application. It attempts to suggest various approaches that could be beneficial while

testing the ETL process in a DW. A data warehouse is a critical business application

and defects in it results is business loss that cannot be accounted for. Here, we walk

you through some of the basic phases and strategies to minimize defects.

Focus is on different components in Data warehouse architecture, its design and

aligning test strategy accordingly. Data storage has become cheaper and easier.

Data driven decisions have proved to be more accurate. In this context, testing data

warehouse implementations are also of utmost significance. Organization decisions

depend entirely on the Enterprise data and data has to be Utmost quality. Successful

data ware housing will help Investors, Business leaders and Project Managers.

Addressed Audience:

Test Practitioners and Engineers, Software and Test Managers, QA Managers and

Development Managers as well as other professionals interested in building and

delivering better software.

Objectives:

1. Provide an overview Data warehouse testing

2. Data warehouse Life cycle & Testing activities for DW Projects

3. Test Data needs for DW Projects

4. Address challenges for DW Testing like voluminous data and heterogeneous

sources

5. Highlight Business Case Studies of successful DW Testing & Implementations

Page 3: Deolitte ETL Testing

3

Contents

Table of Contents

1. Introduction to Data Warehouse: .............................................................. 4

2. Data Warehouse Testing Approach ............................................................ 5

3. Methodological Framework ....................................................................... 6

4. A Timeline for Testing .............................................................................. 7

5. Types of Data warehouse Testing .............................................................. 7

6. ETL Testing check points .......................................................................... 9

7. KEY hot points in ETL Testing ................................................................. 10

8. Database Testing VS. DW Testing ........................................................... 12

9. Challenges in DW Testing ....................................................................... 13

10. Tools: Data Warehouse Testing ............................................................... 13

11. Case Study: ......................................................................................... 14

The Client .................................................................................................. 14

The Solution ............................................................................................... 14

The Benefit ................................................................................................ 15

12. Conclusion and Lessons Learnt ............................................................... 15

References ................................................................................................. 16

Author Bio ................................................................................................. 16

Appendix ................................................................................................... 16

Page 4: Deolitte ETL Testing

4

1. Introduction to Data Warehouse: Extract Transform Load is the process to enable business to consolidate their data

while moving it from place to place. i.e, moving data from source systems into the

data warehouse. The data can arrive from any source.

Extract – Process of reading data from a source file.

Transform – Process of converting the extracted data from previous form to the

form needed so that it can be stored in desired database. Transformation

occurs by using rules or look up tables etc.

Load – Process of writing data into target database.

ETL testing mainly deals with how, when, where and what data we carry in our data

warehouse from which final reports need to be generated. Thus, ETL testing Spreads

across all and each stage of the data flow in the warehouse starting from source

databases to final target data warehouse.

]]]]]]]

Dimensional Data

Reporting

The Data Sources

Staging Area

Operat

ional

Data Source

Data Warehouse

Report 1 Report 2 Report 3

Page 5: Deolitte ETL Testing

5

2. Data Warehouse Testing Approach

There are at least 2 approaches that can be considered for data validation:

Approach1: Data movement directly from source to data warehouse. This

approach validates data in the source is moved to data warehouse according

to business rules.

Approach2: This approach validates data movement through each step in the

ETL and then to the data warehouse. Data will be validated at each step in

this approach.

o Source data stores to Staging tables

o If ODS is used, Staging tables to ODS

o Staging tables or ODS to Data warehouse

Approach to follow for testing is depends on the timeline. Approach 1 will take less

time to script and execute. This approach is not suggested because if issues are

uncovered then it will be more difficult and time consuming to determine origin. If

time and resources are available it’s always suggested to follow approach2. This

approach will help to determine where and when the issues are found. By validating

each step of the ETL, data is being moved and transformed as expected at each

stage.

Testing will focus on following points in this approach:

Extraction of source data

Population of staging tables

Transformation of data

Page 6: Deolitte ETL Testing

6

3. Methodological Framework

Below are the stages involved in the ETL process:

Requirements Elicitation: Requirements are elicited from users and

represented either in the informal or formal way.

Analysis and reconciliation: Data sources are inspected, normalized, and

integrated to obtain a reconciled schema.

Conceptual design: A conceptual schema for the data mart is designed

considering both user requirements and data available in the reconciled

schema.

Logical design: a logical schema for the data mart is obtained by properly

translating the conceptual schema.

Data staging design: ETL procedures are designed considering the source

schema, the reconciled schema and the data mart logical schema.

Physical design: This includes index selection, schema fragmentation, and all

other issues related to physical allocation.

Implementation: This includes implementation of ETL procedures and creation

of front-end reports.

Page 7: Deolitte ETL Testing

7

4. A Timeline for Testing

From a methodological point of view, the three main phases of testing are:

Create a test plan. The test plan describes the tests that must be performed

and their expected coverage of the system requirements.

Prepare test cases. Test cases enable the implementation of the test plan by

detailing the testing steps together with their expect results. The reference

databases for testing should be prepared during this phase, and a wide,

comprehensive set of representative workloads should be defined.

Execute tests. A test execution log tracks each test along and its results.

5. Types of Data warehouse Testing

Below are types of testing that will be suitable for Data Warehouse projects.

Unit testing: Traditionally this has been the task of the developer. This is a white-

box testing to ensure the module or component is coded as per agreed upon design

specifications. The developer should focus on the following:

a) That all inbound and outbound directory structures are created properly with

appropriate permissions and sufficient disk space. All tables used during the ETL

are

present with necessary privileges.

b) The ETL routines give expected results:

i. All transformation logics work as designed from source till target

ii. Boundary conditions are satisfied− e.g. check for date fields with leap

year dates

iii. Surrogate keys have been generated properly

iv. NULL values have been populated where expected

v. Rejects have occurred where expected and log for rejects is created with

sufficient details

vi. Error recovery methods

vii. Auditing is done properly

c) That the data loaded into the target is complete:

i. All source data that is expected to get loaded into target actually get

loaded− compare counts between source and target and use data profiling

tools

ii. All fields are loaded with full contents− i.e. no data field is truncated while

transforming

iii. No duplicates are loaded

iv. Aggregations take place in the target properly

v. Data integrity constraints are properly taken care of

System testing: Generally the QA team owns this responsibility. For them the

design document is the bible and the entire set of test cases is directly based upon

it. Here we test for the functionality of the application and mostly it is black-box. The

major challenge here is preparation of test data. Wherever possible use production-

like data. You may also use data generation tools or customized tools of your own to

create test data. We must test for all possible combinations of input and specifically

check out the errors and exceptions. An unbiased approach is required to ensure

maximum efficiency. Knowledge of the business process is an added advantage since

we must be able to interpret the results functionally and not just code-wise.

The QA team must test for:

Page 8: Deolitte ETL Testing

8

i. Data completeness− match source to target counts

ii. Data aggregations− match aggregated data against staging tables and/or

ODS4

iii. Granularity of data is as per specifications

iv. Error logs and audit tables are generated and populated properly

v. Notifications to IT and/or business are generated in proper format

Regression testing: A DW application is not a one-time solution. Possibly it is the

best example of an incremental design where requirements are enhanced and

refined quite often based on business needs and feedbacks. In such a situation it is

very critical to test that the existing functionalities of a DW application are not

messed up whenever an enhancement is made to it. Generally this is done by

running all functional tests for existing code whenever a new piece of code is

introduced. However, a better strategy could be to preserve earlier test input data

and result sets and running the same again. Now the new results could be compared

against the older ones to ensure proper functionality.

Integration testing: This is done to ensure that the application developed works

from an end-to-end perspective. Here we must consider the compatibility of the DW

application with upstream and downstream flows. We need to ensure for data

integrity across the flow. Our test strategy should include testing for:

i. Sequence of jobs to be executed with job dependencies and scheduling

ii. Re-startability of jobs in case of failures

iii. Generation of error logs

iv. Cleanup scripts for the environment including database

This activity is a combined responsibility and participation of experts from all related

applications is a must in order to avoid misinterpretation of results.

Acceptance testing: This is the most critical part because here the actual users

validate your output datasets. They are the best judges to ensure that the

application works as expected by them. However, business users may not have

proper ETL knowledge. Hence, the development and test team should be ready to

provide answers regarding ETL process that relate to data population. The test team

must have sufficient business knowledge to translate the results in terms of

business. Also the load windows, refresh period for the DW and the views created

should be signed off from users.

Performance testing: In addition to the above tests a DW must necessarily go

through another phase called performance testing. Any DW application is designed to

be scalable and robust. Therefore, when it goes into production environment, it

should not cause performance problems. Here, we must test the system with huge

volume of data. We must ensure that the load window is met even under such

volumes. This phase should involve DBA team, and ETL expert and others who can

review and validate your code for optimization.

Page 9: Deolitte ETL Testing

9

6. ETL Testing check points

Following is a checklist or testing standards that should be followed to ensure

complete and exhaustive testing of ETL process in a data-warehousing project. The

overall testing process is divided into the testing for data completeness, data quality,

data transformation and meta-data and the points to be taken care in these are as

follows:

Data Completeness: One of the most basic tests of data completeness is to verify

that all expected data loads In to the data warehouse. This includes validating that

all records, all fields and the full contents of each field are loaded. Strategies to

consider include:

Comparing record counts between Source database data, staging table data

and data loaded to target DW during testing for full load testing.

Comparing unique values of key fields between Source database, staging

database and target DW.

Populating the full contents of each field to validate that no truncation occurs

Testing the boundaries of each field to find any database limitations.

Data Transformation: Validating that data is transformed correctly based on

business rules can be the most complex part of testing an ETL application with

significant transformation logic.

One typical method is to pick some sample records and "stare and compare"

to validate data transformations manually.

Create a spreadsheet of input data and expected results and validate these

with the output of our test scripts.

During Incremental Testing - create perfect test data that includes all

scenarios which can ever occur in source stage?

Set up data scenarios that test referential integrity between tables.

Validate parent-to-child relationships in the data. Set up data scenarios that

test how orphaned child records are handled.

Data Quality: Data quality deals with "how the ETL system handles Staging table

data rejection, substitution, correction and notification without modifying data."

Reject the record if a certain decimal field has nonnumeric data.

Substitute null if a certain decimal field has nonnumeric data.

Compare accurate values to values in a lookup table

Determine and test exact points where to reject the data and where to send it

for error processing.

Page 10: Deolitte ETL Testing

10

7. KEY hot points in ETL Testing

There are several levels of testing that can be performed during data warehouse

testing and they should be defined as part of the testing strategy in different phases

of testing. Some examples include below:

Constraint Testing: During constraint testing, the objective is to validate unique

constraints, primary keys, foreign keys, indexes, and relationships. The test script

should include these validation points. Some ETL processes can be developed to

validate constraints during the loading of the warehouse. If the decision is made to

add constraint validation to the ETL process, the ETL code must validate all business

rules and relational data requirements.

Source to Target Counts: The objective of the count test scripts is to determine if

the record counts in the source match the record counts in the target. Some ETL

processes are capable of capturing record count information such as records read,

records written, records in error, etc.

Source to Target Data Validation: No ETL process is smart enough to perform

source to target field-to-field validation. This piece of the testing cycle is the most

labor intensive and requires the most thorough analysis of the data. There are a

variety of tests that can be performed during source to target validation. Below is a

list of tests those are best practices:

Threshold testing

Field-to-field testing

Initialization

Transformation and Business Rules: Tests to verify all possible outcomes of the

transformation rules, default values, straight moves and as specified in the Business

specification document. As a special mention, Boundary conditions must be tested on

the business rules.

Batch Sequence & Dependency Testing: ETL’s in DW are essentially a sequence

of processes that execute in a particular sequence. Dependencies do exist among

various processes and the same are critical to maintain the integrity of the data.

Executing the sequences in a wrong order might result in inaccurate data in the

warehouse. The testing process must include at least 2 iterations of the end–end

execution of the whole batch sequence. Data must be checked for its integrity during

this testing. The most common type of errors caused because of incorrect sequence

is the referential integrity failures, incorrect end-dating (if applicable) etc., reject

records etc.

Job restart Testing: In a real production environment, the ETL jobs/processes fail

because of number of reasons (say for ex: database related failures, connectivity

failures etc). The jobs can fail half/partly executed. A good design always allows for a

restart ability of the jobs from the failure point. Although this is more of a design

suggestion/approach, it is suggested that every ETL job is built and tested for restart

capability.

Page 11: Deolitte ETL Testing

11

Error Handling: Understanding a script might fail during data validation, may

confirm the ETL process is working through process validation. During process

validation the testing team will work to identify additional data cleansing needs, as

well as identify consistent error patterns that could possibly be diverted by modifying

the ETL code.

Taking the time to modify the ETL process will need to be determined by the project

manager, development lead, and the business integrator. It is the responsibility of

the validation team to identify any and all records that seem suspect.

Once a record has been both data and process validated and the script has passed,

the ETL process is functioning correctly. Conversely, if suspect records have been

identified and documented during data validation those are not supported through

process validation, the ETL process is not functioning correctly. The development

team will need to become involved in finding the appropriate solution. For example,

during the execution of the source to target count scripts suspect counts are

identified (there are less records in the target table than in the source table). The

records that are ‘missing’ should be captured during the error process and can be

found in the error log. If those records do not appear in the error log, the ETL

process is not functioning correctly and the development team needs to become

involved.

Negative Testing: Negative Testing checks whether the application fails and where

it should fail with invalid inputs and out of boundary scenarios and to check the

behavior of the application.

Page 12: Deolitte ETL Testing

12

8. Database Testing VS. DW Testing

The difference between a Database and a Data Warehouse is not just a data volume.

ETL is the building block for a Data warehouse. Data warehouse testing thus should

be aligned with the data modeling underlying a data warehouse. Specific test

strategies should be designed for Extraction, Transformation and for the loading

modules.

Database Testing

Smaller in Scale

Usually used to test data

at the source instead of

testing using GUI

Usually Homogeneous

Data

Normalized Data

CRUD Operations

Consistent Data

Data Warehouse Testing

Large Scale.

Voluminous Data

Includes several faces.

Extraction,

Transformation &

Loading mechanisms

being the major ones

Heterogeneous data

involved

De normalized Data

Usually Read-only

operations

Temporal Data

inconsistency

Page 13: Deolitte ETL Testing

13

9. Challenges in DW Testing

Voluminous Data from heterogeneous sources.

Data Quality not assured at source.

Difficult to estimate. Only volume might be available. No accurate picture of

the quality of the underlying data.

Business Knowledge. Organization wide Enterprise data knowledge may not

be feasible.

100% Data verification is not feasible. In such cases, the extraction,

transformation and loading components will be thoroughly tested to ensure all

types of data behave as expected.

Very High Cost of Quality. This is due to defect slippage will slip into

significantly high cost.

The Heterogeneous of data will be updated asynchronously.

Transaction level traceability will be difficult to attain in a Data warehouse.

10. Tools: Data Warehouse Testing

There are no standard guidelines on the tools that can be used for Data warehouse

testing. Majority of the teams will go with the tool that has been used for the data

warehouse implementation. A drawback of this approach is redundancy. The same

transformation logic need to be applied for DWH Implementation and also it’s

testing.

Tool selection also depends on the test strategy via exhaustive verification,

Sampling, Aggregation etc. Reusability & Scalability of the Test Suite being

developed is a very important factor to be considered.

Page 14: Deolitte ETL Testing

14

11. Case Study:

The Client

Our client is the investment advisory firm, managing assets for institutional and

private clients worldwide valued at $42.4 billion in assets and $34.2 billion in

Institutional/Private Client Assets.

The Challenge

Our client generated all financial information and reports via manual calculations

delivering their final reports in Excel format. This did not satisfy customer needs and

they threatened to take their business to competitors where they could get reports in

industry standard format. The challenge: Implement a Financial Reporting Solution

within a year. To achieve this, our client needed to build an infrastructure and a data

warehouse that would store data to generate ad hoc and canned reports in a timely

manner. But, the biggest challenge was how to validate the reports and ensure that

all the calculations were accurate. The solution was to find a Testing Partner with

extensive experience validating financial reports. This is exactly what SQA Solution

offered, experienced Financial Reporting QA Engineers with a strong financial

background and the technical skills to deliver a high-quality bullet-proof reporting

solution.

The Solution

The SQA Solution team began by assessing the work and offered a Free Rapid

Assessment to understand the scope of work, schedule, and the resourcing needs.

We assembled a team of eight: one QA Lead and seven Senior QA Engineers. Our QA

lead was responsible for the overall Test strategy, Test Planning, Daily Status

reporting, and day-to-day team management.

Our team assessed reporting requirements, data sources, and data target. We also

reviewed source to target maps and came up with 800+ test cases that focused on

ensuring:

Data Completeness – ensuring all expected data is loaded.

Data Transformation – ensuring all data is transformed correctly according to

business rules and/or design specifications.

Data Quality – Ensuring ETL application correctly rejects, substitutes default

values, corrects or ignores and reports invalid data.

Performance & Scalability – Ensuring data loads and queries perform within

expected time frames and that the technical architecture is scalable.

Reporting UI Testing Verify Reports User Interface

Data Calculations

Integration Testing

Compatibility Testing

User-acceptance Testing

Regression Testing

Page 15: Deolitte ETL Testing

15

The Benefit

High quality data warehouse and reporting solution helped Client to retain

their customers and acquire new customers.

Achieved optimal performance for complex reports.

Great user experience. Meet project deadlines.

12. Conclusion and Lessons Learnt

In this paper we proposed a comprehensive approach which adapts and extends the

testing methodologies proposed for general-purpose software to the peculiarities of

data ware- house projects. Our proposal builds on a set of tips and suggestions

coming from our direct experience on real projects, as well as from some interviews

we made to data warehouse practitioners. As a result, a set of relevant testing

activities were identified, classified, and framed within a reference design

methodology.

In order to experiment our approach on a case study, we are currently supporting a

professional design team engaged in a large data warehouse project, which will help

us better focus on relevant issues such as test coverage and test documentation. In

particular, to better validate our approach and understand its impact, we will apply it

to one out of two data marts developed in parallel, so as to assess the

extra-effort due to comprehensive testing on the one hand, the saving in post-

deployment error correction activities and the gain in terms of better data and design

quality on the other.

Page 16: Deolitte ETL Testing

16

References

Google

Wikipedia

Author Bio

Prasuna Potteti

| System Integration | Deloitte Consulting

Prasuna Potteti is a senior consultant with core competency in software testing, data

warehouse testing, test data management and testability. Familiar with many test

tools, she helps teams develop test plans, test cases and execution. She has helped

develop automated tests for functional, as well as performance tests for package

applications as well as custom development.

Prasuna has been actively involved in competency building in testing COE (Center of

Excellence), Deloitte.

Appendix

DW – Data Warehouse

ETL- Extract, Transform & Load

ODS – Operation Data Source

QA – Quality Assurance

IT – Information Technologies