Chapter 12 - Data Warehousing and Online Analytical Processing

8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

1/20

Chapter 12 - Data Warehousing and

Online Analytical Processing

A data warehouse is often used as the basis for a decision-support system (also referred to from an analyticaperspective as a business intelligence system). It is designed to overcome some of the problems encountered

when an organization attempts to perform strategic analysis using the same database that is used to performonline transaction processing (OLTP).

A typical OLTP system is characterized by having large numbers of concurrent users actively adding and

modifying data. The database represents the state of a particular business function at a specific point in time

such as an airline reservation system. However, the large volume of data maintained in many OLTP systems ca

overwhelm an organization. As databases grow larger with more complex data, response time can deterioratequickly due to competition for available resources. A typical OLTP system has many users adding new data to

the database while fewer users generate reports from the database. As the volume of data increases, reportstake longer to generate.

As organizations collect increasing volumes of data by using OLTP database systems, the need to analyze datbecomes more acute. Typically, OLTP systems are designed specifically to manage transaction processing and

minimize disk storage requirements by a series of related, normalized tables. However, when users need toanalyze their data, a myriad of problems often prohibits the data from being used:

Users may not understand the complex relationships among the tables, and therefore cannot generate a

hoc queries.

Application databases may be segmented across multiple servers, making it difficult for users to find the

tables in the first place.

Security restrictions may prevent users from accessing the detail data they need.

Database administrators prohibit ad hoc querying of OLTP systems, to prevent analytical users fromrunning queries that could slow down the performance of mission-critical production databases.

By copying an OLTP system to a reporting server on a regularly scheduled basis, an organization can improveresponse time for reports and queries. Yet a schema optimized for OLTP is often not flexible enough for decisio

support applications, largely due to the volume of data involved and the complexity of normalized relationaltables.

For example, each regional sales manager in a company may wish to produce a monthly summary of the sales

per region. Because the reporting server contains data at the same level of detail as the OLTP system, the

entire month's data is summarized each time the report is generated. The result is longer-running queries that

lower user satisfaction.

Additionally, many organizations store data in multiple heterogeneous database systems. Reporting is more

difficult because data is not only stored in different places, but in different formats.

Data warehousing and online analytical processing (OLAP) provide solutions to these problems. Datawarehousing is an approach to storing data in which heterogeneous data sources (typically from multiple OLTP

databases) are migrated to a separate homogenous data store. Data warehouses provide these benefits to

analytical users:

Data is organized to facilitate analytical queries rather than transaction processing.

Differences among data structures across multiple heterogeneous databases can be resolved.

Data transformation rules can be applied to validate and consolidate data when data is moved from theOLTP database into the data warehouse.

Security and performance issues can be resolved without requiring changes in the production systems.

Sometimes organizations maintain smaller, more topic-oriented data stores called data marts. In contrast to a

data warehouse which typically encapsulates all of an enterprise's analytical data, a data mart is typically a

3/6/2011 Chapter 12 - Data Warehousing and Onl

microsoft.com//cc917548(printer).aspx 1/


2/20


3/20

account.

Data should be stored in the data warehouse in a single, acceptable format agreed to by business analysts,

despite variations in the external operational sources. This allows data from across the organization, such as

legacy data on mainframes, data in spreadsheets, or even data from the Internet, to be consolidated in the

data warehouse, and effectively cross-referenced, giving the analysts a better understanding of the business.

Subject-oriented Data

Operational data sources across an organization tend to hold a large amount of data about a variety of

business-related functions, such as customer records, product information, and so on. However, most of this

information is also interspersed with data that has no relevance to business or executive reporting, and is

organized in a way that makes querying the data awkward. The data warehouse organizes only the keybusiness information from operational sources so that it is available for business analysis.

Historical Data

Data in OLTP systems correctly represents the current value at any moment in time. For example, an order-

entry application always shows the current value of stock inventory; it does not show the inventory at sometime in the past. Querying the stock inventory a moment later may return a different response. However, data

stored in a data warehouse is accurate as of some past point in time because the data stored represents

historical information.

The data stored in a data warehouse typically represents data over a long period of time; perhaps up to ten

years or more. OLTP systems often contain only current data, because maintaining large volumes of data usedto represent ten years of information in an OLTP system can affect performance. In effect, the data

warehouse stores snapshots of a business's operational data generated over a long period of time. It is

accurate for a specific moment in time and cannot change. This contrasts with an OLTP system where data isalways accurate and can be updated when necessary.

Read-only Data

After data has been moved to the data warehouse successfully, it typically does not change unless the data

was incorrect in the first place. Because the data stored in a data warehouse represents a point in time, it

must never be updated. Deletes, inserts, and updates (other than those involved in the data loading process)

are not applicable in a data warehouse. The only operations that occur in a data warehouse, when it has been

set up, are loading and querying data.

Top Of Page

Data Granularity

A significant difference between an OLTP or operational system and a data warehouse is the granularity of the

data stored. An operational system typically stores data at the lowest level of granularity: the maximum levelof detail. However, because the data warehouse contains data representing a long period in time, simply

storing all detail data from an operational system can result in an overworked system that takes too long toquery.

A data warehouse typically stores data in different levels of granularity or summarization, depending on the




4/20

data requirements of the business. If an enterprise needs data to assist strategic planning, then only highly

summarized data is required. The lower the level of granularity of data required by the enterprise, the higher

the number of resources (specifically data storage) required to build the data warehouse. The different levels

of summarization in order of increasing granularity are:

Current operational data

Historical operational data

Aggregated data

Metadata

Current and historical operational data are taken, unmodified, directly from operational systems. Historical data

is operational level data no longer queried on a regular basis, and is often archived onto secondary storage.

Aggregated, or summary, data is a filtered version of the current operational data. The design of the data

warehouse affects how the current data is aggregated. Considerations for generating summary data include thperiod of time used to aggregate the data (for example, weekly, monthly, and so on), and the parts of the

operational data to be summarized. For example, an organization can choose to aggregate at the part level thquantity of parts sold per sales representative per week.

There may be several levels of summary data. It may be necessary to create summary level data based on an

aggregated version of existing summary data. This can give an organization an even higher level view of the

business. For example, an organization can choose to aggregate summary level data further by generating thequantity of parts sold per month.

Metadata does not contain any operational data, but is used to document the way the data warehouse is

constructed. Metadata can describe the structure of the data warehouse, source of the data, rules used to

summarize the data at each level, and any transformations of the data from the operational systems.

Top Of Page

Data Marts

A data mart is typically defined as a subset of the contents of a data warehouse, stored within its owndatabase. A data mart tends to contain data focused at the department level, or on a specific business area.

The data can exist at both the detail and summary levels. The data mart can be populated with data takendirectly from operational sources, similar to a data warehouse, or data taken from the data warehouse itself.

Because the volume of data in a data mart is less than that in a data warehouse, query processing is often

faster.

Characteristics of a data mart include:

Quicker and simpler implementation.

Lower implementation cost.

Needs of a specific business unit or function met.

Protection of sensitive information stored elsewhere in the data warehouse.

Faster response times due to lower volumes of data.

Distribution of data marts to user organizations.

Built from the bottom upward.

Departmental or regional divisions often determine whether data marts or data warehouses are used. For

example, if managers in different sales regions require data from only their region, then it can be beneficial tobuild data marts containing specific regional data. If regional managers require ac cess to all the organization's

data, then a larger data warehouse is usually necessary.

Although data marts are often designed to contain data relating to a specific business function, there can betimes when users need a broader level of business data. However, because this broader-level data is often on




5/20

needed in summarized form, it is acceptable to store it within each data mart rather than implementing a full

data warehouse.

Building a Data Warehouse from Data Marts

Data warehouses can be built using a top-down or bottom-up approach. Top-down describes the process of

building a data warehouse for the entire organization, containing data from multiple, heterogeneous, operation

sources. The bottom-up approach describes the process of building data marts for departments, or specificbusiness areas, and then joining them to provide the data for the entire organization. Building a data

warehouse from the bottom-up, by implementing data marts, is often simpler because it is less ambitious.

A common approach to using data marts and data warehouses involves storing all detail data within the data

warehouse, and summarized versions within data marts. Each data mart contains summarized data perfunctional split within the business, such as sales region or product group, further reducing the data volume pe

data mart.

Data Mart Considerations

Data marts can be useful additions or alternatives to the data warehouse, but issues to consider before

implementation include:

Additional hardware and software.

Time required to populate each data mart regularly.

Consistency with other data marts and the data warehouse.

Network access (if each data mart is located in a different geographical region).

Top Of Page

Designing and Building a Data Warehouse and OLAP System

The steps required to build a data warehouse include:

Determining business, user, and technical requirements.




6/20

Designing and building the database.

Extracting and loading data into the data warehouse.

Designing and processing aggregations using OLAP tools.

Querying and maintaining the data warehouse and OLAP databases.

Determining Business, User, and Technical Requirements

Before a data warehouse can be built, a detailed project and implementation plan should be written. Theproject and implementation plan includes:

Building a business case.

Gathering user requirements.

Determining the technical requirements.

Defining standard reports required by users.

Analyzing client applicat ion tools being used.

Building the business case is common at the beginning of any project. It involves determining the business

needs solved by the project, the costs of the project, and the return on the investment.

Gathering user requirements largely involves interviewing the intended users of the data warehouse. The userrequirements determine:

Data requirements (level of granularity).

Operational systems within the enterprise containing the data.

Business rules followed by the data.

Queries required to provide the users with data.

The technical requirements may involve determining:

Hardware architecture and infrastructure (for example, links to remote geographical regions where data

marts might be located).

Backup and recovery mechanisms.

Security guidelines.

Methods of loading and transforming data from operational systems to the data warehouse.

Standard reports required by users should be analyzed to determine the tables, columns, and selection criteria

necessary to create the reports, and the frequency in which they are generated. Provisions should also be

made for expanding or modifying the scope of reports as required.

Client application tools should be analyzed to determine if they can provide enhanced processing capabilitiesthat help in processing data, performing queries, or generating reports.

Designing and Building the Database

Designing and building the database is a critical part of building a successful data warehouse. This step is ofte

performed by experienced database designers because it can involve taking data from multiple (sometimes

heterogeneous) sources and combining it into a single, logical model.

Unlike OLTP systems that store data in a highly normalized fashion, the data in the data warehouse is stored in

a very denormalized manner to improve query performance. Data warehouses often use star and snowflakeschemas to provide the fastest possible response times to complex queries, and the basis for aggregations

managed by OLAP tools.




7/20

The components of schema design are dimensions, keys, and fact and dimension tables.

Fact tables

Contain data that describes a specific event within a business, such as a bank transaction or product sale.

Alternatively, fact tables can contain data aggregations, such as sales per month per region. Except in cases

such as product or territory realignments, existing data within a fact table is not updated; new data is simply

added.

Because fact tables contain the vast majority of the data stored in a data warehouse, it is important that the

table structure be correct before data is loaded. Expensive table restructuring can be necessary if data

required by decision support queries is missing or incorrect.

The characteristics of fact tables are:

Many rows; possibly billions.

Primarily numeric data; rarely character data.

Multiple foreign keys (into dimension tables).

Static data.

Dimension tables

Contain data used to reference the data stored in the fact table, such as product descriptions, customer

names and addresses, and suppliers. Separating this verbose (typically character) information from specific

events, such as the value of a sale at one point in time, makes it possible to optimize queries against the

database by reducing the amount of data to be scanned in the fact table.

Dimension tables do not contain as many rows as fact tables, and dimensional data is subject to change, as

when a customer's address or telephone number changes. Dimension tables are structured to permit change.

The characteristics of dimension tables are:

Fewer rows than fact tables; possibly hundreds to thousands.

Primarily character data.

Multiple columns that are used to manage dimension hierarchies.

One primary key (dimensional key).

Updatable data.

Dimensions

Are categories of information that organize the warehouse data, such as time, geography, organization, and son. Dimensions are usually hierarchical in that one member may be a child of another member. For example, a

geography dimension may contain data by country/region, state, and city. A city member is a child to a state

member, which is in itself a child to a country member. Thus, the dimension is comprised of three hierarchical

levels: all countries, all states, and all cities in the dimension table. To support this, the dimension table shoul

include the relationship of each member to the higher levels in the hierarchy.

Dimensional keys

Are unique identifiers used to query data stored in the central fac t table. The dimensional key, like a primary

key, links a row in the fact table with one dimension table. This structure makes it easy to construct complex

queries and support drill-down analysis in dec ision support applications. An optimal data warehouse databasecontains long, narrow fact tables and small, wide dimension tables.

Star Schema

The most popular design technique used to implement a data warehouse is the star schema. The star schema

structure takes advantage of typical decision support queries by using one central fact table for the subject




8/20

area, and many dimension tables containing denormalized descriptions of the facts. After the fact table is

created, OLAP tools can be used to preaggregate commonly accessed information.

The star schema design helps to increase query performance by reducing the volume of data that is read from

disk. Queries analyze data in the smaller dimension tables to obtain the dimension keys that index into the

central fact table, reducing the number of rows to be scanned.

Snowflake Schema

The snowflake schema is a variation of the star schema where dimension tables are stored in a more normalize

form. This can help improve performance for queries by reducing the number of disk reads.

Creating a Database Schema

The database schema should support the business requirements rather than the typical query-drivenrequirements of an OLTP database design. For example, given the following database schema from an order

entry system:




9/20

The steps involved in converting this OLTP schema into a star schema include:

Determining the fact and dimension tables.

Designing the fact tables.

Designing the dimension tables.

Determining Fact and Dimension Tables

It is important to determine correctly what existing tables and data in the operational systems should compris

the fact and dimension tables. If these are not correctly identified, then the data warehouse can suffer from

poor performance, or may have to be redesigned at a later stage. Redesigning an operational data warehouse,

possibly containing large amounts of data, can be a prohibitively expensive task.

Although the process of determining the structure and composition of fact and dimension tables can be

difficult, especially when there are multiple (heterogeneous) operational systems to consider, the two mostimportant steps to follow are identifying the:

Fundamental business transactions on which the data warehouse will focus (fact tables).

Data associated with the business transactions that determine how business data will be analyzed

(dimension tables and hierarchies).

Identify Fundamental Business Transactions

The first step involves identifying the transactions that describe the basic operations of the business which th

data warehouse will be used to analyze. For example, using the sample order entry system described earlier,

the underlying business transaction is a sale of a product. Identifying the fundamental business transactionsyields the information that will be represented in the fact tables. The information needed to describe the sale




10/20

of a product is largely found in the Order_Details table.

When analyzing existing operational systems for potential fact tables, always look for the underlying business

processes involved. Many operational systems are designed based on necessity rather than an accuratebusiness model. For example, a school database may record only the grade per student for all subjects for a

year because it does not have enough disk space to store the data at a lower level of detail.

In this instance, a data warehouse used to store student data from all schools in a region should be designedto capture this summarized data as well as a lower level of detail when the schools are able to provide the

information in the future. For example, the fact table might store details regarding the grades for each subjec

per student, per school, per region, per date period.

Identify the Dimension Tables

The next step involves identifying the entities that describe how the fact data will be analyzed. For example,

given that the order entry system fundamental transaction is the sale of a product, dimension data from theoperational schema could include payment method, product name, date of sale, or shipping method. However,the dimension data chosen should represent the focus of the business analysis. As an example, the business

analysis performed on the order entry data warehouse will include variations of:

Sales of a specific product per region.

Sales of a specific product per time period (for example, a quarter).

All sales per region.

All sales per time period.

Therefore, the dimension tables will include product data, region data, and time period data. In this example,

payment or shipping methods were not required because the business will not use the data warehouse to

analyze that data.

From the original order entry OLTP schema, all the fact and dimension data for the data warehouse can befound in the Customers, Orders, Products, and Order_Details tables.




11/20

Designing Fact Tables

The primary goal when designing fact tables is to minimize the size without compromising the datarequirements. Fact tables are the largest tables in the database because they contain detail level data

representing the underlying business transactions. However, the costs of storing, and maintaining these large

tables should be considered. For example, larger tables require more online, and potentially offline, storage;

take longer to backup and restore in the event of a system failure; and take longer to query when buildingOLAP aggregations.

The easiest ways to reduce the size of fact tables include:

Reducing the number of columns.

Reducing the size of each column where possible.

Archiving historical data into separate fact tables.

Reducing the Number of Columns

Remove any columns that are not required to analyze the operations of the business. For example, if the data

does not represent a business transaction, or if the data can be derived using aggregates, remove the datafrom the fact table. Although aggregated columns often improve query performance, the size of a typical fact

table can prohibit using them. For example, if the Order_Details fact table contains one billion rows, and acolumn, Total_Price, is added representing Quantity multiplied by UnitPrice, one billion new values now exist

permanently in the table.

Important If a column is moved from a fact table to another table, and is referenced frequently in queries

involving data from the fact table, large join operations may be required. These joins can affect queryperformance. Therefore, the trade-off between reducing storage costs and affecting query performance shoul

be determined.




12/20

Although Order_Details forms the basis of the fact table, the OrderID column is not required in the final fact

table because OrderDetailID is the unique identifier for the business transaction: a sale or a product. In fact

OrderID does not represent a single business transaction; it represents the sale of one or many products to a

single customer, and so cannot be used.

Reducing the Size of Each Column

Because fact tables tend to have a large number of rows, even one redundant byte per row can add up to alarge amount of wasted database space. For example, a fact table containing one billion rows, with one unuse

byte in one of the columns, represents almost 1 GB of unused database. To reduce column widths:

Ensure that all character and binary data is variable length.

Use data types that require fewer bytes of storage where possible. For example, if a column contains

integer values only in the range from 1 through 100, use tinyint rather than int, saving 3 bytes per rowfor that column.

Archiving Historical Data

If data within fact tables is rarely used, such as sales data from several years ago, it may be useful to archivethe data. This approach reduces the volume of data in the fact table, hence increasing the performance of

queries. Exceptional queries, on older data, can be run against multiple fact tables without affecting the

majority of users querying the fact tables containing recent data. When Microsoft SQL Server OLAP Services i

used in conjunction with multiple fact tables, the OLAP Services engine manages queries against multiple back

end fact tables. This simplifies the management and use of multiple fact tables containing historical data.

Designing Dimension Tables

The primary goal in designing dimension tables is to denormalize the data that references the fact tables into

single tables. The most commonly used dimension data should reference the fact tables directly, rather than

indirectly through other tables. This approach minimizes the number of table joins, and speeds up performance

For example, the order entry star schema should support the business queries:

Sales of a specific product per region

All sales per region

Currently, the dimension data describing a region (City, StateOrProvince, and Country) is part of theCustomers table. However, Customers references Order_Details (fact data) using Orders.




13/20

To better support the business analysis required, the region data should be placed into a new table, Region,

directly referencing Order_Details. To implement this, a foreign key from the Region dimension table is added

to Order_Details (now renamed to Sales). Any queries involving sales per region now require only a two-tabl

join between the Region dimension table, and the Sales fact table.

Note The existing relationship between the Sales fact table and the Products dimension data is unchanged.

Date and Time Information

Date information is a common requirement in a data warehouse. To minimize the fact table width, a foreign key

is often created in the fact table referencing a dimension table containing a representation of the date and/or

time. The representat ion of the date depends on business analysis requirements.

For example, the business analysis to be performed on the order entry system requires product salessummarized by month, quarter, and year. The date information should be stored in a form that represents thes

increments. This is achieved by creating a foreign key in the fact table referencing a date dimension table(named Period) containing the date of the sale in a month, quarter, year format. To increase the flexibility of

this data, additional dimension tables are created, referenced by the Period dimension table, that containmonths, quarters, and years in more detail. When designing dimension tables for use with Microsoft SQL Serve

OLAP Services, only a date is needed. The OLAP Services Time Wizard enables dates to be summarized into ancombination of weeks, months, quarters, and years.

Implementing the Database Design




14/20

After the fact and dimension tables have been designed, the final step is to physically implement the database

in Microsoft SQL Server.

Creating the Database

When creating the database, consider the partitioning strategy, if any, that may be used. SQL Server offers

filegroups that can be used to stripe data, in addition to the disk striping available with Microsoft Windows NT

and hardware-based implementations.

Creating the Tables

When creating the tables used to store the fact and dimension data, consider creating the tables across the

partitions available to the database, based on usage. For example, create separate fact tables containing datsegmented by year or division on separate partitions (such as a SQL Server filegroup to improve read

performance).

Creating Any User-defined Views

Create user-defined views if necessary. SQL Server views can be used to merge horizontally partitioned tables

together logically, as interfaces to predefined queries or as a security mechanism.

Creating Indexes

Indexes should be created to maximize performance. Consider creating indexes on:

Key columns.

Columns involved in joins.

Multiple columns, to take advantage of index coverage.

All dimension table keys used by the fact table.

See Also

In Other Volumes

"CREATE VIEW" in Microsoft SQL Server Transact-SQL and Utilities Reference

"Overview of Creating and Maintaining Databases" in Microsoft SQL ServerDatabase Developer's Companion

"Indexes" in Microsoft SQL ServerDatabase Developer's Companion

"Physical Database Design" in Microsoft SQL ServerDiagnostics

Extracting and Loading Data

Extracting and loading data from operational systems to a data warehouse varies in complexity. The processcan be simple if there is a direct correlation between the source data and the data that should appear in the

data warehouse: for example, if all the source data from a single operational system is in the correct format,and does not have to be modified in any way. The process can also be complex: for example, if source data

resides in multiple, heterogeneous operational systems, and requires significant formatting and modification

before loading.

The extraction and load process involves:

Validating data in the operational systems.

Migrating data from the operational systems.

Scrubbing data.

Transforming data to the data warehouse.

Validating Data




15/20

Before data is extracted from the operational systems, it may be necessary to ensure that the data is

completely valid. If the data is not valid, the integrity of the business analysis relying on the data may be

compromised. For example, a value representing a monetary transfer between banks in different countries mus

be in the correct currency.

Data should be validated at the source by business analysts who understand what the data represents. Anychanges should be made in the operational systems, rather than the data warehouse, because the source dat

is incorrect regardless of where it is located.

Validating data can be a time-consuming process. The validation process can be automated by writing stored

procedures that check the data for domain integrity. However, it may be necessary to validate data manuallyIf any invalid data is discovered, determine where the fault originated and correct any processes contributing

to the error.

For example, the data in the order entry system should be validated, to ensure that:

Region information (City, State, Country) represents a valid city, state, country/region combination.

Product information (ProductID, ProductName, UnitPrice) represents valid products.

This information can be validated using the Data Transformation Services import and export wizards. A

Microsoft ActiveX script, executed by the DTS Import and DTS Export wizards when copying data from the

source to the destination, can determine if the region and product information is valid. Any invalid data can be

saved to the exception log for later examination by business analysts, to determine why it is incorrect.

See Also

In Other Volumes

"Data Transformation Services Import and Export Wizards" in Microsoft SQL Server Distributed Data Operations

and Replication

"Column Mappings" in Microsoft SQL Server Distributed Data Operations and Replication

Migrating Data

Migrating data from operational systems typically involves copying the data to an intermediate database befor

it is finally copied to the data warehouse. Copying data to an intermediate database is necessary if data shoube scrubbed.

Copying data should occur ideally during a period of low activity on the operational system. Otherwise, system

performance may degrade, affecting any users. Additionally, if the data warehouse is composed of data frommultiple interrelated operational systems, then it is important to ensure that data migration occurs when the

systems are synchronized. If the operational systems are not synchronized, the data in the warehouse can

produce unexpected results when queried.

The DTS Import and DTS Export wizards can be used to create a series of tasks that copy data from manyheterogeneous operational systems to an intermediate database running on Microsoft SQL Server. Alternativel

you can use a Microsoft ActiveX script with DTS to sc rub the data, and then copy it directly to the SQL Servdata warehouse, avoiding the need for an intermediate database.

See Also

In Other Volumes

"Data Transformation Services Import and Export Wizards" in Microsoft SQL Server Distributed Data Operationsand Replication

Scrubbing Data

Scrubbing data involves making the data consistent. It is possible that the same data is represented in

different ways in multiple operational systems. For example, a product name might be abbreviated in one

operational system, but not in another. If the two values were not made consistent, any queries using the da

likely would evaluate the values as different products. If the detail data in the data warehouse is to produce




16/20

consistent information, the product name must be made consistent for all values.

Data scrubbing can be achieved:

Using the DTS Import and DTS Export wizards to modify data as it is copied from the operational system

to the intermediate database, or directly to the data warehouse.

By writing a Microsoft ActiveX script, executed by a program using the DTS API, to connect to the data

source and scrub the data. Any data manipulation that can be achieved using ActiveX scripting, or aprogramming language such as Microsoft Visual C++, can be performed on the data.

Using a DTS Lookup, which provides the ability to perform queries using one or more named,

parameterized query strings that allow a custom transformation to retrieve data from locations other tha

the immediate source or destination row being transformed.

For example, the data in the order entry system should be scrubbed, such as the values for:

State (must always be a two-character value, such as WA).

ProductName (must always be the full product name, with no abbreviations).

Using the DTS Import and DTS Export wizards, an ActiveX script can be executed during the copy process,that checks the State value, and changes it to a known two-character value. Alternatively, the

ProductName value could be scrubbed by writing a Visual C++ program that calls the DTS API to execute

Microsoft JScript scripts , and other executable modules.

See Also

In Other Volumes



and Replication

"DTS Lookup" in Microsoft SQL Server Distributed Data Operations and Replication

"Programming DTS Applications" in Microsoft SQL Server Building Applications

Transforming Data

During the data migration step, it is often necessary to transform operational data into a separate format

appropriate to the data warehouse design. Transformation examples include:

Changing all alphabetic characters to uppercase.

Calculating new values based on existing data, including data aggregation and summarization.

Breaking up a single data value into multiple values, such as a product code in nnnn-description format

into separate code and description values, or a date value in MMDDYY format into separate month, day,and year values.

Merging separate data values into a single value, such as concatenating a first name value with asurname value.

Mapping data from one representation to another, such as converting data values (1, 2, 3, 4) to (I, II,

III, IV).

Data transformation also involves formatting and modifying extracted data from operational systems intomerged or derived values that are more useful in the data warehouse. For example, copying the OrderDate

value from the order entry system to the data warehouse star schema, involves splitting the date into Month

Quarter, and Year components. These date components are required for the type of business analysis

performed on the data warehouse.

The transformation process usually takes place during the migration process: when data is copied either




17/20

directly from the operational sources or from an intermediate database, because the data has been scrubbed.

For complex data migrations, DTS provides skip return values to assist in splitting data into multiple tables.

Data transformation and migration can be completed in a single step using the DTS Import and DTS Export

wizards. Transforming and migrating data from the order entry OLTP operational system schema to the data

warehouse star schema involves using the DTS Import and DTS Export wizards to:

Create a query to extract all the required detail level (fact) data.

Split OrderDate in the Orders table into Month, Quarter, and Year components and add to Period

using a Microsoft ActiveX script.

Extract the City, StateOrProvince, and Country data relating to the detail data and add to Region

using an ActiveX script.

Perform a simple table copy ofProducts.

Create a query to generate the data for Summary.

Each step, for example, can be built as a separate package, which is stored in the Microsoft SQL Server msddatabase, and scheduled to be executed every Friday night at midnight.

In addition to performing insert-based transformations of data, DTS provides data-driven-queries, in which da

is read from the source and transformed, and a parameterized query is executed at the destination, using the

transformed values in the destination row.

Note When using DTS to create fact tables for use with Microsoft SQL Server OLAP Services, do not createany aggregations while migrating the data. OLAP Services is specifically designed to create the optimal

aggregations after the data warehouse has been populated with DTS. It is also unnecessary to segment a datinto week, month, quarter, or year columns in the Time dimension table. The OLAP Services Time Wizard

provides an automated facility for this type of time transformations.

See Also

In Other Volumes


"Data-Driven Queries" in Microsoft SQL Server Distributed Data Operations and Replication


and Replication

"Understanding Data Transformation Services" in Microsoft SQL Server Building Applications

Designing and Processing Aggregations

OLAP tools are typically used to create and manage summary data. Microsoft SQL Server OLAP Services allow

aggregations to be stored in a variety of formats and locations, with dynamic connections to underlying detailin the data warehouse. Summary data is often generated to satisfy the commonly executed queries in the dat

warehouse. Storing preaggregated data increases query performance, and reduces the load on the data

warehouse.

If a data warehouse is built so the data in it does not change, then preaggregating data in the fact table save

only the disk space required by the fact table. OLAP Services uses the processing time that would have been

used to preaggregate in the fact table when it processes the fact table as it builds a cube. However,

precalculated aggregations are stored in the cube and do not need to be recalculated for each query. If ahybrid OLAP (HOLAP) or relational OLAP (ROLAP) cube is used, the fact table is not copied into the cube as it

in multidimensional OLAP (MOLAP) cubes, so the overhead required to retain availability of the detail data isonly the fact table size, not processing time or query response time.

Preaggregation strategy when designing a data warehouse for use by OLAP Services depends on the followingvariables:




18/20

Stability of the data.

If the source data changes, the preaggregations have to be performed each time, whether

preaggregated in the fact table or in the OLAP cubes that have to be rebuilt from the fact table.

Query response time.

With properly designed OLAP cubes, the granularity of detail in the fact table has no effect on query

response time for queries that do not access detail facts.

Storage requirements.

A finer level of granularity in the fact table requires more storage for the fact table and for MOLAP cubeThis is a trade-off against detail availability and choice of OLAP cube storage mode. OLAP cubes tend to

be large regardless of the storage type; therefore the storage required to retain fine granularity in thefact table may not be particularly significant when compared to OLAP storage needs.

When designing the data warehouse for OLAP, the user's needs should drive the preaggregation strategy. Thefact table should only be preaggregated to the level of granularity below which no user would want to access

detail.

For more information, see your OLAP Services documentation

Maintaining the Data Warehouse

Maintenance of the data warehouse is an ongoing task that should be designed before the data warehouse ismade available to users. Maintenance involves:

Implementing a backup and recovery mechanism to protect the data in the event of a system failure, or

some other problem.

Archiving the database. This may be necessary to purge the database of unused historical data, and fre

up space.

Running SQL Server Profiler to determine which indexes to create to enhance query performance.

See Also

In Other Volumes

"Monitoring with SQL Server Profiler" in Microsoft SQL ServerAdministrator's Companion

Top Of Page

Data Transformation Services Data Warehousing Support

Using Data Transformation Services (DTS), you can import and export data between multiple heterogeneous

sources using an OLE DB-based architecture, and transfer databases and database objects (for example,

indexes and stored procedures) between computers running Microsoft SQL Server version 7.0. You can also us

the data transformation capabilities of DTS to build a data warehouse from an online transaction processing(OLTP) system. You can build data warehouses and data marts in SQL Server by importing and transferring

data from multiple heterogeneous sources interact ively or automatically on a regularly scheduled basis.

DTS components include the DTS Import Wizard, DTS Export Wizard, and DTS Designer, which are availablethrough SQL Server Enterprise Manager. DTS also includes COM programming interfaces you can use to create

customized import, export, and transformation applications.

A transformation is the set of operations applied to source data before it is stored at the destination during th

process of building a data warehouse. For example, the DTS transformation capability allows calculating newvalues from one or more source columns, or even breaking a single column into multiple values to be stored in

separate destination columns. Transformations, therefore, make it easy to implement complex data validation,

scrubbing, and enhancement during import and export.

Data Transformation Services (DTS) allows you to import, export, or transform data in a process that can be




19/20

saved as a package. Each package defines a workflow that includes one or more tasks executed in a

coordinated sequence as steps. Tasks can copy data from a source to a destination, transform data using a

Microsoft ActiveX script, execute an SQL statement at a server, or even run an external program. Tasks can

also transfer database objects between computers running SQL Server 7.0.

The DTS package can be created manually by using a language that supports OLE Automation, such asMicrosoft Visual Basic, or interactively by using the Data Transformation Services wizards or DTS Designer.

After a DTS package has been created and saved, it is completely self-contained and can be retrieved and ruusing SQL Server Enterprise Manager or the dtsrun utility.

DTS packages can be stored in the Microsoft Repository, providing the ability to record data lineage. Thisallows you to determine the source of any piece of data, and the transformations applied to that data. Data

lineage can be tracked at the package and row levels of a table and provide a complete audit trail of data

transformation and DTS package execution information in your data warehouse.

DTS Designer is a graphical design environment for creating and executing complex sets of data transformation

and workflows, in preparation for moving data to a data warehouse. Experienced users can use DTS Designer

integrate, consolidate, and transform heterogeneous data from multiple sources. Packages created can be

stored in a SQL Server msdb database, the Repository, or a COM-structured storage file.

The visual objects used by DTS Designer are based on the DTS object model, an API that includes objects,properties, methods, and collections designed for programs that copy and transform data from an OLE DB data

source to an OLE DB destination. This object model can be accessed through ActiveX scripts from within DTSDesigner, and through external programs written in languages such as Visual Basic and Microsoft Visual C++.

You can also access custom programs through DTS Designer, and include their tasks and icons as part of the

package. Because DTS Designer accesses an underlying programming model, it does most of the programming

work for you.

See Also

In Other Volumes

"DTS Designer" in Microsoft SQL Server Distributed Data Operations and Replication

"Overview of Data Transformation Services" in Microsoft SQL Server Distributed Data Operations and

Replication

Top Of Page

OLAP Services Data Warehousing Support

Microsoft SQL Server OLAP Services provides online analytical processing (OLAP) services to applications.

OLAP focuses on finding trends in aggregated or summarized data. The main objects used by OLAP programs

are multidimensional cubes. A multidimensional cube records a set of data derived from fact tables anddimensions. A fact table records data about a set of transactions. Measures are numeric columns from the fac

table that are of interest to the user. For example, a car sales fact table could provide measures such as saleprice, invoice cost, tax paid, and discount. A cube represents how these measures vary over multiple

dimensions, such as by car dealer, by location, by customer, or by date.

OLAP Services provides the capability of designing, creating, and managing cubes from a data warehouse, and

then making them available to client applications written using either the OLE DB 2.0 OLAP extensions, or the

Microsoft ActiveX Data Objects 2.0 Multidimensional Objects (ADO MD).




20/20

2011 Microsoft. All rights reserved.

The OLAP server performs multidimensional queries of data and stores the results in its multidimensional

storage. It speeds the analysis of fact tables into cubes, stores the cubes until needed, and then quicklyreturns the data to clients.

The OLAP server is managed through an API called the Microsoft Decision Support Objects (DSO). OLAP

Services provides a snap-in for the Microsoft Management Console (MMC). This MMC snap-in uses DSO to

provide administrators with an easy-to-use graphical interface for defining, creating, and managing the cubesbuilt by the OLAP server. DSO can also be called from custom applications, which in turn can be added to the

OLAP Manager as an add-in.

OLAP Services passes its multidimensional data to a middle-tier PivotTable Service. The PivotTable Serviceoperates as an OLE DB for OLAP provider. It exposes the multidimensional data to applications using either the

OLE DB 2.0 OLAP extensions, or the ADO MD API that wraps the OLE DB OLAP extensions.

See Also

In This Volume

Installing OLAP Services

Server Improvements and New Features

Top Of Page


Documents

Chapter 12 - Data Warehousing and Online Analytical Processing