Upload
rama
View
218
Download
0
Embed Size (px)
Citation preview
8/3/2019 Data Warehousing Best Practices Whitepaper
1/23
Data Warehouse Best Practices
April, 09, 2009
8/3/2019 Data Warehousing Best Practices Whitepaper
2/23
White Paper
Copyright 2009 Intrasphere Page 2 of 23 Confidential
Table of Contents
Introduction ................................................................................................................... 4
Planning ......................................................................................................................... 5
Readiness Assessment........................................................................................... 5
Corporate Sponsorship...................................................................................... 5
Appropriate Budget, Time and Resource Commitments.................................... 6
User Demand..................................................................................................... 6
IT Readiness...................................................................................................... 6
Project Planning....................................................................................................... 7
Realistic Timelines and Scope........................................................................... 7
Phased Approach .............................................................................................. 7
Communication Channels and Issue Tracking................................................... 8
Data Governance Committee ............................................................................ 8
Analysis ......................................................................................................................... 9
Data Analysis ........................................................................................................... 9
Data Profiling ..................................................................................................... 9
System Analysis ...................................................................................................... 9
Capacity Planning.............................................................................................. 9
Tool Assessment ............................................................................................. 10
Design .......................................................................................................................... 11
Data Modeling ........................................................................................................ 11Dimensional versus Relational......................................................................... 11
Data Mapping .................................................................................................. 11
Data Marts ....................................................................................................... 12
System Design ....................................................................................................... 12
Modular Design................................................................................................ 12
A/B Data Environments ................................................................................... 13
Development................................................................................................................ 14
Development Environment ................................................................................... 14
Multiple Dev Databases................................................................................... 14
Architecture Team ........................................................................................... 14
Golden Copy.................................................................................................... 15
Full Sets of Source Data.................................................................................. 15
8/3/2019 Data Warehousing Best Practices Whitepaper
3/23
White Paper
Copyright 2009 Intrasphere Page 3 of 23 Confidential
Test............................................................................................................................... 16
Test Environment................................................................................................... 16
Concurrent Testing .......................................................................................... 16
Regression Testing.......................................................................................... 17
Automated tools............................................................................................... 17
Deployment.................................................................................................................. 19
Production Build.................................................................................................... 19
Deployment Checklist ...................................................................................... 19
Initial System Burn in ....................................................................................... 20
Summary...................................................................................................................... 21
About Intrasphere ....................................................................................................... 22
For More Information ................................................................................................ 23
8/3/2019 Data Warehousing Best Practices Whitepaper
4/23
White Paper
Copyright 2009 Intrasphere Page 4 of 23 Confidential
Introduction
Data warehouses are large, expensive systems. They typically take years to fully implement, require the
efforts of large teams, and often fail to deliver on the initial promise due to technical, procedural and other
reasons. However, successful data warehouses that result in true business value and competitiveadvantage can be built if the correct approach and best practices are followed.
This white paper serves as a practical best practices guide to use for Data Warehouse initiatives. It draws
upon the authors years of experience designing and building Data Warehouses in the Pharmaceutical
Life Sciences industry.
This document is organized around the end-to-end process that includes the following key phases:
Planning
Analysis
Design
Development
Test
Deployment
8/3/2019 Data Warehousing Best Practices Whitepaper
5/23
White Paper
Copyright 2009 Intrasphere Page 5 of 23 Confidential
Planning
Data warehouses require lots of planning in order to achieve success. The following are some best
practice activities that help ensure the appropriate level of planning is done prior to building the system.
Readiness Assessment
In the planning phase, it is important to honestly assess the readiness of the organization for the
implementation of a data warehouse. A data warehouse readiness assessment is important to identify
areas of potential failure. There are several published methods for executing this assessment, such as
those found in Ralph Kimballs The Data Warehouse Lifecycle Toolkit. The important factors to consider
usually cover the following basic areas:
Corporate Sponsorship
Appropriate Budget, Time and Resource Allocations
User Demand
IT Readiness
Corporate Sponsorship
Corporate Sponsorship is a key factor to consider in assessing the readiness of the organization.
Successful data warehouses have strong senior sponsorship from the leadership of the organization.
Ideally, the senior sponsor will be a respected visionary with the clout to influence budgets and convince
others of the importance of the data warehouse. Strong sponsorship will help:
Get corporate buy in, generating the acceptance of the data warehouse within the organization
Allocate resources and budget, ensuring the data warehouse has the funding and support it
needs to be built
Assist with removing any roadblocks that may come up during the building and deployment of the
data warehouse
Bridge departments to garner cooperation across departmental lines
Create and share a vision or mission statement that will convince the company as a whole of the
importance of the data warehouse
If the senior sponsor is not fully committed to the expense, time, and effort of the data warehouse, it may
be difficult to get others to fully support the effort, especially if timelines or budgets run over. While it is
possible to build a data warehouse without strong sponsorship, it is much more difficult and risky. This
8/3/2019 Data Warehousing Best Practices Whitepaper
6/23
White Paper
Copyright 2009 Intrasphere Page 6 of 23 Confidential
step in the readiness assessment serves to identify or recruit the person(s) that will act as the sponsor(s),
and to gauge their commitment to championing the data warehouse effort.
Appropriate Budget, Time and Resource Commitments
Data warehouses require large investments in time, resources and money. They are usually implementedvia multi-year projects, and often have large teams both building and supporting them. A key to the
success of building a data warehouse is to make sure that the budget, time and resource needs are met.
It is critical to set realistic expectations and to gain commitments before starting. It is not atypical for a
data warehousing project to cost 50%-100% more than original estimates, or take two times longer to
complete. In addition to initial commitments on budgets, time and resources, it is wise to set aside
contingency amounts.
Many times, it is easier to break up the effort into several phases, and procure the budget and resource
commitments based on smaller efforts. Care should be taken to ensure that the areas of the datawarehouse that will produce the highest ROI receive the highest priority. This will help to prove the value
of the data warehouse and to acquire future funds and resources. This step in the assessment is to
understand how difficult it will be in raising the funds and getting resources committed to the effort of
building the data warehouse.
User Demand
Data warehouses need to meet the demands of its users, or it will not be adopted by the user community
and thus the proposed value of the data warehouse will not be realized and the project deemed a failure.
It is very important to make sure that the user community is open to changing its business operations to
include using the data warehouse. The key here is to get the user community eager to get involved and
excited in the potential of the data warehouse. Not only will the data warehouse be built to better answer
the types of questions the users want to ask, but it will be better accepted by the user community when
deployed. This step in the assessment is to interview a few users and ensure that there is a need, and
that if you build it, they will come.
IT Readiness
A data warehouse will usually be built and supported by a companys Information Technology
department. It is critical to evaluate the technical abilities of the IT department in hosting the data
warehouse. Several factors should be investigated to assess whether or not the IT department is ready to
support the effort:
Ability to acquire, deploy and host the necessary hardware
Ability to acquire the software licenses necessary
Necessary resources and budgets to host the system
8/3/2019 Data Warehousing Best Practices Whitepaper
7/23
White Paper
Copyright 2009 Intrasphere Page 7 of 23 Confidential
Technical skills and experience with the hardware and software platforms chosen to implement
the data warehouse (database, ETL, business intelligence tools, etc.)
Technical skills to restart the system in case of failure
Technical skills to back up and restore the system as needed
Technical skills to rebuild the system in case of disaster
If the IT department is already supporting data warehouses, it is a good idea to plan to use similar
platforms if possible. This assessment step is critical in identifying any gaps in the support the system will
need once built. These gaps, if any, should be addressed in the initial data warehouse project.
Project Planning
There are several activities conducted during the project planning phase of the data warehouse projectthat can help ensure success. While most are typical in any project, the size and complexity of a data
warehouse project can make them especially critical. Creating a realistic project timeline, ensuring clear
communication channels, setting up rigorous scope and change controls, issue tracking and escalation,
and frequent status checks are all important in any project, but critical to the success of a data
warehouse. In addition, a data governance board should be established to quickly resolve any data
issues found, or escalate them to proper resolution as fast as possible.
Realistic Timelines and Scope
It is highly recommended that the project timeline be driven by a bottom up approach. The scope for the
initial release (in a multi-phased approach) should be clearly identified. Each task needed to accomplish
the scoped result should be assigned an estimated effort, named resources (if possible), and identifiedconstraints. It is very risky to time-box a data warehouse project to meet a specific deadline. Usually,
attempting delivery with a top-down timeline will result in a very limited project scope.
Phased Approach
A phased approach is highly recommended. Data warehouses are easily chunked into work efforts
based on either sets of data source systems or sets of data marts addressing specific related business
function needs. A well designed data warehouse needs to be able to add new source systems and new
business intelligence tools throughout its lifetime. This property lends itself to phased development and
deployment. Phased efforts usually take longer and can cost more in the long run than all or nothing
efforts, but are inherently much less risky, and users can gain access and use the system much earlier
(which means the return on the investment is realized earlier as well.)
Another major benefit of a phased approach is that it is more flexible to users needs. Users typically
develop new ideas and requirements once they start accessing a data warehouse. A phased approach
can respond quickly and increase the value of the system to these users.
8/3/2019 Data Warehousing Best Practices Whitepaper
8/23
White Paper
Copyright 2009 Intrasphere Page 8 of 23 Confidential
Finally, a phased approach allows teams to include bug fixes or resolved data issues that may not have
been caught during testing. This will increase the quality of the data warehouse, and increase its value to
the users.
Communication Channels and Issue Tracking
A clear organization chart of the project and clear, articulated communication channels should be
established to ensure that issues are raised and dealt with in a timely manner. Data warehouses typically
uncover many unexpected problems, and problem resolution is key to ensuring the project runs smoothly
and is not excessively delayed by unresolved issues. Issues should be tracked, and the status of the
issues should be reviewed frequently to ensure progress is being made. Utilizing an issue tracking tool
can assist in making sure issues are visible to the entire team, but should not be the sole channel for
communicating issues. When issues are raised, they should be acknowledged before being assumed to
be assigned.
A weekly status meeting is effective for tracking major issues. However, due to the number of teams and
the complexity of the data warehousing development and testing effort, it is recommended that each teamhave frequent team meetings in addition to project status meetings.
Test team members should also be paired with development team members to report bugs directly. This
will usually allow for better communication between tester and developer, and will help speed the overall
resolution of issues found during testing.
Data Governance Committee
Data warehouse projects uncover many hidden data issues in operational databases. Due to the high
number of data issues found during initial development of a data warehouse, it is highly recommended
assembling a Data Governance Committee. The role of the committee will be to:
Resolve or seek resolution on data issues
Publish a set of master data by identifying official sources for data lists (such as product lists)
Identify methods for fixing data errors
Describe the definitions of data elements across multiple source systems
Enforce data standards
The Data Governance Committee should be staffed by knowledgeable data stewards from the business
users, source systems and the data warehouse. Its members may also perform some tasks in the data
warehouse development effort.
8/3/2019 Data Warehousing Best Practices Whitepaper
9/23
White Paper
Copyright 2009 Intrasphere Page 9 of 23 Confidential
Analysis
The analysis phase of a data warehousing project typically takes longer than other system development
projects. This is because of the effort required to profile and analyze the data from multiple source
systems. There are several activities that are best practices during this phase. They include:
Data profiling
Capacity planning
Tool assessment
Data Analysis
Data Profiling
Data profiling is a critical component to building a data warehouse. It consists of investigating the raw
data in the source systems, looking for data patterns, limits, string lengths, distinct values, constraint
violations, etc. It is the first pass at identifying potential data issues, and is also the analysis step that will
generate requirements for the designs of the data models and the data mappings. This step is typically
done by a data analyst using SQL editors or data profiling tools.
Most times this step is done in parallel with data modeling and data mapping efforts, usually by the same
team. This approach allows the team to focus on specific areas, modeling, mapping, and assessing the
data quality all at once. Even though the data modeling and data mapping activities reside in the Design
Phase, the reality is that these three activities lend themselves to iterative execution very well, and are
usually done together. In fact, it is not uncommon for these iterative activities to continue into the
development phase, since data issues and anomalies are sometimes not encountered until then.
System Analysis
Capacity Planning
A capacity plan is a critical component for any data warehouse. It is the guide to growing the system over
time. Capacity plans for data warehouses consider the following:
Initial data storage size requirements of the data warehouse
Incremental data growth due to ETL migrations
Number of users (usually identified as named users, active users and concurrent users)
Estimated processing requirements based on concurrent user queries and other processes
ETL schedule requirements
8/3/2019 Data Warehousing Best Practices Whitepaper
10/23
White Paper
Copyright 2009 Intrasphere Page 10 of 23 Confidential
User access time (up-time) requirements
Archiving and Partitioning
The inputs into the initial capacity plan should be gathered during the analysis phase, since data profiling
can give a clear understanding of data size and growth expectations. These will be used to ensure
appropriate data storage capacity is available, and that the ETL servers can process the required data
within the ETL batch window.
A capacity plan should be a living document, measuring the initial growth and user pattern estimates
against actual values periodically. This will ensure that performance and functionality is not limited by
data warehouse growth.Tool Assessment
Data warehouses require a number of specialized software tools. The most simple data warehouses willhave database software, ETL software, and some business intelligence or reporting software. During the
analysis phase, it is critical to identify the tools to be used to implement the data warehouse, especially if
they will require the assigned development resources to be trained in their use. It is essential to choose
the tools that will not only fit the current vision, but the future expectations of the data warehouse. Scaling,
support, and product upgrades should be considered. Time should be set aside to allow for vendor
demos and bake-offs. Time should be allowed for the vendors to build demos based on specific
requirements as this can help identify the fit of the tools.
The foundation software components of a data warehouse are typically very expensive, and require a
good deal of technical experience and competence to ensure a successful implementation. This causes
most companies to standardize on specific tool vendors. If that is the case, a tool assessment is stillrecommended to ensure that gaps in functionality are identified so that alternate solutions can be
designed.
8/3/2019 Data Warehousing Best Practices Whitepaper
11/23
White Paper
Copyright 2009 Intrasphere Page 11 of 23 Confidential
Design
The design phase of a data warehouse is typically longer than other system projects. This is due to
several factors, including the iterative nature of data modeling and data mapping, the design of complex
technical infrastructure, and the interactions with the several source systems. This section of bestpractices will identify some key considerations during the design phase of a data warehouse.
Data Modeling
The most critical task for any data warehouse is the data model. Deciding the appropriate model is key to
the success and performance of the data warehouse, as well as to the types and diversity of queries
supported. There are also several best practices based on or related to the data model. Data mapping
describes the movement of data from the source systems into the target data models. Data marts are
views or sub-components of the data warehouse built to support a specific business process or group of
related processes for a specific business functional group. A/B data switches are a mechanism used to
maintain separation between ETL and end users, allowing both access to the data concurrently.
Dimensional versus Relational
There are two approaches to data modeling for a data warehouse, each with strengths in various
scenarios. They are the Dimensional and the Relational models. Dimensional models (also called star
schemas) are especially effective for data that is aggregated at different levels, and for building cube
databases for drilling on various data attributes. Dimensional models retrieve large amounts of data more
efficiently and faster than Relational models. Relational models are effective for pinpointing individual
records quickly. Most data warehouses use dimensional models for the data marts (where users interact
with the data.) The back office staging areas usually contain both dimensional and relational tables, each
depending on the needs of the ETL and other back office processing.
The data modeling effort in a data warehouse takes 4 times as long as an operational or transactional
system. It is an iterative effort, and becomes increasingly (some say exponentially) more complicated with
each additional source system conformed. It is critical to have an experience data warehouse data
modeler work with the project team and data governance board to accomplish this critical task.
Data Mapping
Data mapping is an essential part in designing the ETL of the data warehouse. It identifies each source
data element, any transformations or processing routines applied, and the target data element into which
the data is loaded. Two main tips for data mapping are to use a single spreadsheet worksheet for each
table, and map target to source. When mapping target to source, it is easier to ensure that all required
data fields in the target data model are addressed. Mapping source to target can include many
unnecessary data elements (in the source systems) that will confuse and distract.
The data mapping is also used by the development team for implementing the ETL. All transformations,
data quality checks, and cross references must be included in the data mapping document. It is also not
8/3/2019 Data Warehousing Best Practices Whitepaper
12/23
White Paper
Copyright 2009 Intrasphere Page 12 of 23 Confidential
uncommon for the final revisions of the data mapping to be made during or shortly after the ETL
development, to reflect any modifications made during this phase.
Data Marts
Data marts are the views or tables of data that the users interact with. Data marts are tailored to answerspecific sets of user queries in order to ensure the best possible performance. This means that data
marts are typically limited to a single business process or a group of related business processes for a
single business functional group. This allows some variations on data modeling approaches and physical
implementation approaches for each data mart with performance tuning in mind. As mentioned above,
data marts can be either views or tables, or even files, depending on query performance requirements.
Often, data marts are materialized views that are refreshed with each ETL batch execution. However,
some data marts are frozen points in time (for example, quarterly data) in order to ensure best possible
performance. The bottom line is that a data mart is an individually tailored set of data to best serve the
users accessing it for the types of questions they need answered. A good data warehouse will have many
data marts specific to identified needs, rather than just piling all the data together to let the users do whatthey want.
System Design
Modular Design
One critical design best practice in data warehousing is to ensure that the design of all the system
components is as modular as possible. Data warehouses are complicated systems that interact with
many other transactional and operational systems. Over the lifespan of a typical data warehouse, source
systems will be retired and replaced by new sources. In addition, new data marts will be required to ask
and answer new business questions. New business intelligence tools will be leveraged to analyze and
model the results. This means that several times during the lifespan of a data warehouse, certain parts ofthe warehouse will be redesigned, replaced, or retired. In order to minimize the impact to the system
during these changes, a modular design is critical.
The design should balance the desire for reuse with the expectation of replacement. For example,
designs should abstract the ETL staging area to allow for a data source to be replaced and minimally
impact the reports leveraging the data. There are several data modeling techniques that can help
minimize the impact of change (such as slowly changing dimensions.) Other heuristics include:
ETL should be done in several legs (acquire from source, process and transform data, integrate
with other data, load into data marts)
Data marts should be used to group specific related business functional areas
Data models should support changes with minimum impact
One business function per code module
8/3/2019 Data Warehousing Best Practices Whitepaper
13/23
White Paper
Copyright 2009 Intrasphere Page 13 of 23 Confidential
A/B Data Environments
There are a few strategies for minimizing the impact of the ETL on end users. Typically, the ETL
processes run during off hours, and users are either discouraged from or even denied access to the data
during these times. An A/B switch is a set of identical tables, one available for user access while the other
is being updated by the ETL. This can be an especially effective strategy for ETL processes that run morethan once a day. The concept can be implemented in a number of ways, but the idea is fairly simple. The
ETL updates one set of tables while the users access the other. Then, when the ETL is finished, they
switch (usually by dropping and recreating synonyms.) This strategy also has the benefit of allowing the
users to continue to have access to the data in the event of an ETL failure.
8/3/2019 Data Warehousing Best Practices Whitepaper
14/23
White Paper
Copyright 2009 Intrasphere Page 14 of 23 Confidential
Development
The development phase of a data warehouse typically involves multiple sub-teams building various
components of the system. There may be one or more ETL teams, a database team, and one or more BI
system development teams (for the end user reports and query tools.) Careful planning is required toprevent these resources from stepping on each other toes during component development. Typically,
data warehouses will have multiple development environments to allow each team to focus on its
components, without negatively impacting other teams. This adds complexity, however, when the
underlying architecture components change, since these changes must be tracked and published to the
many development environments. These complications can be minimized by maintaining clear
communication channels and architecture oversight, usually in the form of the architecture team.
Development Environment
Multiple Dev Databases
As mentioned above, having several development environments will prevent teams from impacting each
other. An example would be the ETL team running component tests changing the data while the report
development team is writing SQL queries. The goal of this best practice activity is to give each team a
development environment that will minimize the impact of such situations. Depending on the design
structure of the data models, this can be a single database schema within a development instance, or
may be a copy of the entire database itself in separate database instances. The design of the
development environments will depend on balancing available hardware, software, and support resources
with the needs of the development teams.
Architecture Team
An architecture team is required to mitigate the potential dangers of variations in the architecture and data
models of multiple development environments. The role of the architecture team is to:
Control changes to the underlying architecture of the data warehouse
Brainstorm to resolve architectural issues
Represent each development team with regards to impacts
Communicate across development teams
Plan inter-team activities
Changes to the logical and physical database structures are especially important, since they will impact
all teams. The architecture team should consist of members of each development team to review and
assess the impact of proposed modifications with regards to the team they are representing. They also
represent the communication channel back to the team.
8/3/2019 Data Warehousing Best Practices Whitepaper
15/23
White Paper
Copyright 2009 Intrasphere Page 15 of 23 Confidential
In addition to controlling changes, the architecture team should be a conduit for each development team
to communicate to the other teams. Sharing of issues and resolutions, enhancement ideas, status, and
coordination of cross team activities (such as the periodic refresh of the dev environments) should be part
of the discussions of the architecture team.
Golden Copy
One of the key activities in minimizing the potential impact of multiple development environments is the
periodic refresh of those environments with the latest version of the official development environment
footprint. This official version is often called the golden copy. It is also the version of the development
environment that will be promoted to the test environments when the time comes to do so. The golden
copy should be maintained by one or more members of the architecture team. Whenever changes to the
underlying architecture are made, and these changes pass component (unit) testing, they should be
integrated into the golden copy. Then, the golden copy should be distributed to the various development
environments to allow team members to test for impacts to their components and to adjust their code if
necessary. Each team will then submit tested versions of its components for inclusion into the golden
copy, where it can be distributed to the other teams in the next periodic refresh. In addition, the goldencopy should maintain a full set of source data.
Full Sets of Source Data
Acquiring full, up-to-date copies of the data sources is an important key to the success of building and
testing data warehouse components. It will allow realistic component testing, is a key for performance
testing, and is also vital for preventing data issues that cause errors in the ETL and Business Intelligence
queries. In addition, it saves the need to create test data, which can be extremely time-consuming. It may
be necessary to update the copies with newer copies if a source system is altered (upgraded) or if the
overall data changes significantly.
Any manufactured data created by the development or test teams should not be stored in the golden copy
of the source data, unless all teams agree.
8/3/2019 Data Warehousing Best Practices Whitepaper
16/23
White Paper
Copyright 2009 Intrasphere Page 16 of 23 Confidential
Test
The testing phase of a data warehouse usually takes significantly longer than the testing of an operational
system. This is due in large part to the need to run multiple ETL cycles in order to fully test a data
warehouse release. If the initial data load will be implemented through the ETL processes, testing the firstfull run of the ETL can take several days. Similar to development, having a few test environments can
accelerate the process by allowing some tests to be run concurrently. In addition, the utilization of
automated testing tools can significantly decrease the time it takes to implement regression tests.
Test Environment
Concurrent Testing
Similar to the best practice of having several development environments, several test environments will
allow for concurrent testing. The following list is an example of the types of concurrent tests that can be
run if there are multiple test environments (database instances and schemas):
Initial ETL Load/Performance Test (tests the performance of the first run of ETL)
Incremental ETL Tests (both performance and functional tests)
Business Intelligence Data Mart Tests (tests the business intelligence components)
User Acceptance Tests
Typically, the initial load of an ETL process can take from several hours to several days, depending on
the amount of data processed. For example, an ETL run loading 10 years of data will handle 520x the
data a weekly incremental ETL run will process. Performance tuning changes require that these tests be
run a few times. This alone is why Performance and Load Testing is usually done in a separate
environment.
The incremental ETL testing will require several cycles to ensure data conditions are thoroughly tested
(e.g. testing inserts, updates, and deletes to source data.) Since these tests are focused on testing
changes to the data, combining these tests with Business Intelligence report and query tests can extend
the timeline of both. Typically, a specific ETL process is run to create the data mart environments for the
Business Intelligence testing processes/environment; these data marts are then left untouched by the
ETL until the next cycle.
While multiple environments can speed up the testing cycles, care must be taken to ensure that all
environments are rebuilt with the latest golden copy for each cycle. The golden copy will include the fixes
to all bugs found in the previous cycle, and each environment will need regression testing to ensure the
fixes did not break something already tested.
8/3/2019 Data Warehousing Best Practices Whitepaper
17/23
White Paper
Copyright 2009 Intrasphere Page 17 of 23 Confidential
Finally, it should be noted that the challenges, complexities, and costs of maintaining several test
environments has encouraged some projects to forgo the benefits of time savings and adopt a single
threaded, serial approach to testing.
Regression Testing
Regression testing is a key component of data warehouse development. In the first release, regression
testing is used to ensure code stability by testing bug fixes for negative impacts to components that have
already passed testing. While this is important for the first release, it is critical for the ongoing lifecycle of
the data warehouse. Subsequent releases will include new source systems, new business intelligence
tools, and/or new data marts with new ETL driven data processing. While regression testing is not unique
to data warehousing, developing a good regression test plan is essential to ensuring that multiple future
releases are deployed smoothly with little impact to the users. A best practice with regards to regression
testing is the use of automated testing tools.
Automated tools
There are several factors in choosing and using automated testing tools for a data warehouse. It should
be noted that in most scenarios, the initial implementation of automated testing is time consuming. The
activities involved in setting up the tools, creating the test scripts, and debugging test cycles, as well as
training testers, can add significant time to the test planning phase of the project. However, over the life of
a data warehouse, this initial investment will pay off. There are some key features that these tools should
possess:
Test script repository
Test script version control
Ability to organize and combine test scripts
Thorough result reporting
Ability to continue processing after errors are encountered
Ability to initiate ETL processes
Ability to query database
Ability to interact with Business Intelligence tools
Intuitive user interface
Issue tracking
8/3/2019 Data Warehousing Best Practices Whitepaper
18/23
White Paper
Copyright 2009 Intrasphere Page 18 of 23 Confidential
It may be necessary to use several automated test tools. For example, performance and load testing may
require a different tool than Business Intelligence report testing. Using a suite of vendor related tools is
common, and may have the benefit of leveraging a single repository for test scripts and issue tracking.
8/3/2019 Data Warehousing Best Practices Whitepaper
19/23
White Paper
Copyright 2009 Intrasphere Page 19 of 23 Confidential
Deployment
Deploying a data warehouse usually takes longer than deploying an operational system as well. (Why
would this be any different?) This is especially true for the first release. The initial data load of the system
can take several days. In addition, the system is usually given a few days to cycle through the ETL toensure all system integration points are correct, and no issues arise. The deployment package should
include the golden copy and checklists of steps to deploy the system, set up scheduled batch jobs, modify
permissions and create user accounts on both source and target databases, and implement other
necessary preparatory actions
Production Build
Deployment Checklist
The deployment checklist is an essential tool to ensure a smooth deployment. The checklist should be
developed by cataloguing all activities required to build the test environments. During the final test cycle,
the test environment should be built from scratch using the deployment checklist in order to test the
checklist. The checklist is usually a spreadsheet or series of spreadsheets that contain the following
information:
Activity name
Assigned resource
Step-by-step detailed activity instructions
Golden copy filenames or component names
Hardware and software component identifiers (such as server ids and schema names)
Required parameters (including login ids, etc.)
Communication directions (notification of completion, errors, etc.)
Execution or scheduling of initial processes (such as kicking off the ETL)
Including passwords in the deployment checklist can be a security risk. It is recommended that any
passwords that are needed to execute a deployment checklist step be supplied outside of the checklist in
a secure manner. Another alternative is to util ize temporary passwords during the deployment, with a finalchecklist step noting a change to the temporary passwords.
The checklist should be owned by the architecture team. While it is recommended that one person
coordinate the deployment effort through monitoring the checklist activities, it is vital to ensure that more
than one person be fully knowledgeable on the details of the checklist activities.
8/3/2019 Data Warehousing Best Practices Whitepaper
20/23
White Paper
Copyright 2009 Intrasphere Page 20 of 23 Confidential
Initial System Burn in
A system burn-in period is recommended after the initial deployment of the data warehouse. During this
burn-in period, the ETL is run for a few days to ensure there are no unexpected problems, such as a
scheduled backup interfering with the ETL. A few users should be granted access to ensure there are no
problems with the user-facing business intelligence applications and report data. This is a highlyrecommended best practice activity, because during the first release, something will go wrong. A week of
burn-in is typical.
8/3/2019 Data Warehousing Best Practices Whitepaper
21/23
White Paper
Copyright 2009 Intrasphere Page 21 of 23 Confidential
Summary
Data warehouse projects are similar to many other system development projects, and many of the
standard project methodology best practices apply. However, data warehouses are typically much larger
and much more complex than other types of systems, and this complexity brings challenges and risks.The underlying theme of many of the best practices listed in this document is to reduce the complexity
and size of the activities into smaller, more manageable chunks. Using a modular approach to component
design, a phased approach to scoping, and an iterative approach to analysis and testing can help
accomplish this goal. A methodical, piece-by-piece approach to building a data warehouse is much more
manageable than an all or nothing approach.
8/3/2019 Data Warehousing Best Practices Whitepaper
22/23
White Paper
Copyright 2009 Intrasphere Page 22 of 23 Confidential
About Intrasphere
Intrasphere Technologies, Inc. (www.intrasphere.com) is a consulting firm focused on the Life Sciences
industry. We provide comprehensive, business-focused services that help companies achieve
meaningful results. Our professionals leverage strategic acumen, deep industry knowledge and provenproject execution abilities to deliver superior service that builds true business value.
Our strategy, business process and technology services are developed to specifically address areas that
are most important to our clients including; Drug Safety, Business Intelligence, Enterprise Content
Management, Compliance and IT Management, to name a few.
We understand the unique nature of the Life Sciences working environment and clients need to reduce
costs, drive business processes and speed-to-market, while satisfying regulatory mandates.
Some of the worlds leading global companies including Pfizer Inc. (NYSE:PFE), Johnson & Johnson
(NYSE: JNJ), Novartis (NYSE: NVS), Eli Lilly (NYSE: LLY), Vertex Pharmaceuticals (Nasdaq: VRTX) andHarperCollins Publishers (NWS), among others, look to Intrasphere as their trusted solutions partner.
Founded in 1996, Intrasphere is headquartered in New York City with operations in Europe and Asia.
Intrasphere has been recognized nationally for performance by industry leading organizations such as,
Deloitte & Touche, Crains New York Business and Inc. Magazine.
8/3/2019 Data Warehousing Best Practices Whitepaper
23/23
White Paper
For More Information
Jim Brown
Intrasphere Technologies
(212) [email protected]
Locations
North America:
Corporate Headquarters
New York City
Intrasphere Technologies, Inc.
100 Broadway, 10th Floor
New York, NY 10005
ph: +1 (212) 937-8200
fax: +1 (212) 937-8298
Europe:
United Kingdom
4th Floor
Brook House
229-243 Shepherds Bush Road
Hammersmith London, W6 7NL
ph: +44 (0) 208 834 3700
fax: +44 (0) 208 834 3701
Asia:
India
Block 2-A, DLF Corporate Park
DLF City, Phase III
Gurgaon, Haryana 122002
ph: +91 (0124) 4168200
fax: +91 (0124) 4168201