Chapter 12 - Data Warehousing and Online Analytical Processing

Embed Size (px)

Citation preview

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    1/20

    Chapter 12 - Data Warehousing and

    Online Analytical Processing

    A data warehouse is often used as the basis for a decision-support system (also referred to from an analyticaperspective as a business intelligence system). It is designed to overcome some of the problems encountered

    when an organization attempts to perform strategic analysis using the same database that is used to performonline transaction processing (OLTP).

    A typical OLTP system is characterized by having large numbers of concurrent users actively adding and

    modifying data. The database represents the state of a particular business function at a specific point in time

    such as an airline reservation system. However, the large volume of data maintained in many OLTP systems ca

    overwhelm an organization. As databases grow larger with more complex data, response time can deterioratequickly due to competition for available resources. A typical OLTP system has many users adding new data to

    the database while fewer users generate reports from the database. As the volume of data increases, reportstake longer to generate.

    As organizations collect increasing volumes of data by using OLTP database systems, the need to analyze datbecomes more acute. Typically, OLTP systems are designed specifically to manage transaction processing and

    minimize disk storage requirements by a series of related, normalized tables. However, when users need toanalyze their data, a myriad of problems often prohibits the data from being used:

    Users may not understand the complex relationships among the tables, and therefore cannot generate a

    hoc queries.

    Application databases may be segmented across multiple servers, making it difficult for users to find the

    tables in the first place.

    Security restrictions may prevent users from accessing the detail data they need.

    Database administrators prohibit ad hoc querying of OLTP systems, to prevent analytical users fromrunning queries that could slow down the performance of mission-critical production databases.

    By copying an OLTP system to a reporting server on a regularly scheduled basis, an organization can improveresponse time for reports and queries. Yet a schema optimized for OLTP is often not flexible enough for decisio

    support applications, largely due to the volume of data involved and the complexity of normalized relationaltables.

    For example, each regional sales manager in a company may wish to produce a monthly summary of the sales

    per region. Because the reporting server contains data at the same level of detail as the OLTP system, the

    entire month's data is summarized each time the report is generated. The result is longer-running queries that

    lower user satisfaction.

    Additionally, many organizations store data in multiple heterogeneous database systems. Reporting is more

    difficult because data is not only stored in different places, but in different formats.

    Data warehousing and online analytical processing (OLAP) provide solutions to these problems. Datawarehousing is an approach to storing data in which heterogeneous data sources (typically from multiple OLTP

    databases) are migrated to a separate homogenous data store. Data warehouses provide these benefits to

    analytical users:

    Data is organized to facilitate analytical queries rather than transaction processing.

    Differences among data structures across multiple heterogeneous databases can be resolved.

    Data transformation rules can be applied to validate and consolidate data when data is moved from theOLTP database into the data warehouse.

    Security and performance issues can be resolved without requiring changes in the production systems.

    Sometimes organizations maintain smaller, more topic-oriented data stores called data marts. In contrast to a

    data warehouse which typically encapsulates all of an enterprise's analytical data, a data mart is typically a

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 1/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    2/20

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    3/20

    account.

    Data should be stored in the data warehouse in a single, acceptable format agreed to by business analysts,

    despite variations in the external operational sources. This allows data from across the organization, such as

    legacy data on mainframes, data in spreadsheets, or even data from the Internet, to be consolidated in the

    data warehouse, and effectively cross-referenced, giving the analysts a better understanding of the business.

    Subject-oriented Data

    Operational data sources across an organization tend to hold a large amount of data about a variety of

    business-related functions, such as customer records, product information, and so on. However, most of this

    information is also interspersed with data that has no relevance to business or executive reporting, and is

    organized in a way that makes querying the data awkward. The data warehouse organizes only the keybusiness information from operational sources so that it is available for business analysis.

    Historical Data

    Data in OLTP systems correctly represents the current value at any moment in time. For example, an order-

    entry application always shows the current value of stock inventory; it does not show the inventory at sometime in the past. Querying the stock inventory a moment later may return a different response. However, data

    stored in a data warehouse is accurate as of some past point in time because the data stored represents

    historical information.

    The data stored in a data warehouse typically represents data over a long period of time; perhaps up to ten

    years or more. OLTP systems often contain only current data, because maintaining large volumes of data usedto represent ten years of information in an OLTP system can affect performance. In effect, the data

    warehouse stores snapshots of a business's operational data generated over a long period of time. It is

    accurate for a specific moment in time and cannot change. This contrasts with an OLTP system where data isalways accurate and can be updated when necessary.

    Read-only Data

    After data has been moved to the data warehouse successfully, it typically does not change unless the data

    was incorrect in the first place. Because the data stored in a data warehouse represents a point in time, it

    must never be updated. Deletes, inserts, and updates (other than those involved in the data loading process)

    are not applicable in a data warehouse. The only operations that occur in a data warehouse, when it has been

    set up, are loading and querying data.

    Top Of Page

    Data Granularity

    A significant difference between an OLTP or operational system and a data warehouse is the granularity of the

    data stored. An operational system typically stores data at the lowest level of granularity: the maximum levelof detail. However, because the data warehouse contains data representing a long period in time, simply

    storing all detail data from an operational system can result in an overworked system that takes too long toquery.

    A data warehouse typically stores data in different levels of granularity or summarization, depending on the

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 3/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    4/20

    data requirements of the business. If an enterprise needs data to assist strategic planning, then only highly

    summarized data is required. The lower the level of granularity of data required by the enterprise, the higher

    the number of resources (specifically data storage) required to build the data warehouse. The different levels

    of summarization in order of increasing granularity are:

    Current operational data

    Historical operational data

    Aggregated data

    Metadata

    Current and historical operational data are taken, unmodified, directly from operational systems. Historical data

    is operational level data no longer queried on a regular basis, and is often archived onto secondary storage.

    Aggregated, or summary, data is a filtered version of the current operational data. The design of the data

    warehouse affects how the current data is aggregated. Considerations for generating summary data include thperiod of time used to aggregate the data (for example, weekly, monthly, and so on), and the parts of the

    operational data to be summarized. For example, an organization can choose to aggregate at the part level thquantity of parts sold per sales representative per week.

    There may be several levels of summary data. It may be necessary to create summary level data based on an

    aggregated version of existing summary data. This can give an organization an even higher level view of the

    business. For example, an organization can choose to aggregate summary level data further by generating thequantity of parts sold per month.

    Metadata does not contain any operational data, but is used to document the way the data warehouse is

    constructed. Metadata can describe the structure of the data warehouse, source of the data, rules used to

    summarize the data at each level, and any transformations of the data from the operational systems.

    Top Of Page

    Data Marts

    A data mart is typically defined as a subset of the contents of a data warehouse, stored within its owndatabase. A data mart tends to contain data focused at the department level, or on a specific business area.

    The data can exist at both the detail and summary levels. The data mart can be populated with data takendirectly from operational sources, similar to a data warehouse, or data taken from the data warehouse itself.

    Because the volume of data in a data mart is less than that in a data warehouse, query processing is often

    faster.

    Characteristics of a data mart include:

    Quicker and simpler implementation.

    Lower implementation cost.

    Needs of a specific business unit or function met.

    Protection of sensitive information stored elsewhere in the data warehouse.

    Faster response times due to lower volumes of data.

    Distribution of data marts to user organizations.

    Built from the bottom upward.

    Departmental or regional divisions often determine whether data marts or data warehouses are used. For

    example, if managers in different sales regions require data from only their region, then it can be beneficial tobuild data marts containing specific regional data. If regional managers require ac cess to all the organization's

    data, then a larger data warehouse is usually necessary.

    Although data marts are often designed to contain data relating to a specific business function, there can betimes when users need a broader level of business data. However, because this broader-level data is often on

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 4/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    5/20

    needed in summarized form, it is acceptable to store it within each data mart rather than implementing a full

    data warehouse.

    Building a Data Warehouse from Data Marts

    Data warehouses can be built using a top-down or bottom-up approach. Top-down describes the process of

    building a data warehouse for the entire organization, containing data from multiple, heterogeneous, operation

    sources. The bottom-up approach describes the process of building data marts for departments, or specificbusiness areas, and then joining them to provide the data for the entire organization. Building a data

    warehouse from the bottom-up, by implementing data marts, is often simpler because it is less ambitious.

    A common approach to using data marts and data warehouses involves storing all detail data within the data

    warehouse, and summarized versions within data marts. Each data mart contains summarized data perfunctional split within the business, such as sales region or product group, further reducing the data volume pe

    data mart.

    Data Mart Considerations

    Data marts can be useful additions or alternatives to the data warehouse, but issues to consider before

    implementation include:

    Additional hardware and software.

    Time required to populate each data mart regularly.

    Consistency with other data marts and the data warehouse.

    Network access (if each data mart is located in a different geographical region).

    Top Of Page

    Designing and Building a Data Warehouse and OLAP System

    The steps required to build a data warehouse include:

    Determining business, user, and technical requirements.

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 5/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    6/20

    Designing and building the database.

    Extracting and loading data into the data warehouse.

    Designing and processing aggregations using OLAP tools.

    Querying and maintaining the data warehouse and OLAP databases.

    Determining Business, User, and Technical Requirements

    Before a data warehouse can be built, a detailed project and implementation plan should be written. Theproject and implementation plan includes:

    Building a business case.

    Gathering user requirements.

    Determining the technical requirements.

    Defining standard reports required by users.

    Analyzing client applicat ion tools being used.

    Building the business case is common at the beginning of any project. It involves determining the business

    needs solved by the project, the costs of the project, and the return on the investment.

    Gathering user requirements largely involves interviewing the intended users of the data warehouse. The userrequirements determine:

    Data requirements (level of granularity).

    Operational systems within the enterprise containing the data.

    Business rules followed by the data.

    Queries required to provide the users with data.

    The technical requirements may involve determining:

    Hardware architecture and infrastructure (for example, links to remote geographical regions where data

    marts might be located).

    Backup and recovery mechanisms.

    Security guidelines.

    Methods of loading and transforming data from operational systems to the data warehouse.

    Standard reports required by users should be analyzed to determine the tables, columns, and selection criteria

    necessary to create the reports, and the frequency in which they are generated. Provisions should also be

    made for expanding or modifying the scope of reports as required.

    Client application tools should be analyzed to determine if they can provide enhanced processing capabilitiesthat help in processing data, performing queries, or generating reports.

    Designing and Building the Database

    Designing and building the database is a critical part of building a successful data warehouse. This step is ofte

    performed by experienced database designers because it can involve taking data from multiple (sometimes

    heterogeneous) sources and combining it into a single, logical model.

    Unlike OLTP systems that store data in a highly normalized fashion, the data in the data warehouse is stored in

    a very denormalized manner to improve query performance. Data warehouses often use star and snowflakeschemas to provide the fastest possible response times to complex queries, and the basis for aggregations

    managed by OLAP tools.

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 6/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    7/20

    The components of schema design are dimensions, keys, and fact and dimension tables.

    Fact tables

    Contain data that describes a specific event within a business, such as a bank transaction or product sale.

    Alternatively, fact tables can contain data aggregations, such as sales per month per region. Except in cases

    such as product or territory realignments, existing data within a fact table is not updated; new data is simply

    added.

    Because fact tables contain the vast majority of the data stored in a data warehouse, it is important that the

    table structure be correct before data is loaded. Expensive table restructuring can be necessary if data

    required by decision support queries is missing or incorrect.

    The characteristics of fact tables are:

    Many rows; possibly billions.

    Primarily numeric data; rarely character data.

    Multiple foreign keys (into dimension tables).

    Static data.

    Dimension tables

    Contain data used to reference the data stored in the fact table, such as product descriptions, customer

    names and addresses, and suppliers. Separating this verbose (typically character) information from specific

    events, such as the value of a sale at one point in time, makes it possible to optimize queries against the

    database by reducing the amount of data to be scanned in the fact table.

    Dimension tables do not contain as many rows as fact tables, and dimensional data is subject to change, as

    when a customer's address or telephone number changes. Dimension tables are structured to permit change.

    The characteristics of dimension tables are:

    Fewer rows than fact tables; possibly hundreds to thousands.

    Primarily character data.

    Multiple columns that are used to manage dimension hierarchies.

    One primary key (dimensional key).

    Updatable data.

    Dimensions

    Are categories of information that organize the warehouse data, such as time, geography, organization, and son. Dimensions are usually hierarchical in that one member may be a child of another member. For example, a

    geography dimension may contain data by country/region, state, and city. A city member is a child to a state

    member, which is in itself a child to a country member. Thus, the dimension is comprised of three hierarchical

    levels: all countries, all states, and all cities in the dimension table. To support this, the dimension table shoul

    include the relationship of each member to the higher levels in the hierarchy.

    Dimensional keys

    Are unique identifiers used to query data stored in the central fac t table. The dimensional key, like a primary

    key, links a row in the fact table with one dimension table. This structure makes it easy to construct complex

    queries and support drill-down analysis in dec ision support applications. An optimal data warehouse databasecontains long, narrow fact tables and small, wide dimension tables.

    Star Schema

    The most popular design technique used to implement a data warehouse is the star schema. The star schema

    structure takes advantage of typical decision support queries by using one central fact table for the subject

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 7/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    8/20

    area, and many dimension tables containing denormalized descriptions of the facts. After the fact table is

    created, OLAP tools can be used to preaggregate commonly accessed information.

    The star schema design helps to increase query performance by reducing the volume of data that is read from

    disk. Queries analyze data in the smaller dimension tables to obtain the dimension keys that index into the

    central fact table, reducing the number of rows to be scanned.

    Snowflake Schema

    The snowflake schema is a variation of the star schema where dimension tables are stored in a more normalize

    form. This can help improve performance for queries by reducing the number of disk reads.

    Creating a Database Schema

    The database schema should support the business requirements rather than the typical query-drivenrequirements of an OLTP database design. For example, given the following database schema from an order

    entry system:

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 8/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    9/20

    The steps involved in converting this OLTP schema into a star schema include:

    Determining the fact and dimension tables.

    Designing the fact tables.

    Designing the dimension tables.

    Determining Fact and Dimension Tables

    It is important to determine correctly what existing tables and data in the operational systems should compris

    the fact and dimension tables. If these are not correctly identified, then the data warehouse can suffer from

    poor performance, or may have to be redesigned at a later stage. Redesigning an operational data warehouse,

    possibly containing large amounts of data, can be a prohibitively expensive task.

    Although the process of determining the structure and composition of fact and dimension tables can be

    difficult, especially when there are multiple (heterogeneous) operational systems to consider, the two mostimportant steps to follow are identifying the:

    Fundamental business transactions on which the data warehouse will focus (fact tables).

    Data associated with the business transactions that determine how business data will be analyzed

    (dimension tables and hierarchies).

    Identify Fundamental Business Transactions

    The first step involves identifying the transactions that describe the basic operations of the business which th

    data warehouse will be used to analyze. For example, using the sample order entry system described earlier,

    the underlying business transaction is a sale of a product. Identifying the fundamental business transactionsyields the information that will be represented in the fact tables. The information needed to describe the sale

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 9/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    10/20

    of a product is largely found in the Order_Details table.

    When analyzing existing operational systems for potential fact tables, always look for the underlying business

    processes involved. Many operational systems are designed based on necessity rather than an accuratebusiness model. For example, a school database may record only the grade per student for all subjects for a

    year because it does not have enough disk space to store the data at a lower level of detail.

    In this instance, a data warehouse used to store student data from all schools in a region should be designedto capture this summarized data as well as a lower level of detail when the schools are able to provide the

    information in the future. For example, the fact table might store details regarding the grades for each subjec

    per student, per school, per region, per date period.

    Identify the Dimension Tables

    The next step involves identifying the entities that describe how the fact data will be analyzed. For example,

    given that the order entry system fundamental transaction is the sale of a product, dimension data from theoperational schema could include payment method, product name, date of sale, or shipping method. However,the dimension data chosen should represent the focus of the business analysis. As an example, the business

    analysis performed on the order entry data warehouse will include variations of:

    Sales of a specific product per region.

    Sales of a specific product per time period (for example, a quarter).

    All sales per region.

    All sales per time period.

    Therefore, the dimension tables will include product data, region data, and time period data. In this example,

    payment or shipping methods were not required because the business will not use the data warehouse to

    analyze that data.

    From the original order entry OLTP schema, all the fact and dimension data for the data warehouse can befound in the Customers, Orders, Products, and Order_Details tables.

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 10/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    11/20

    Designing Fact Tables

    The primary goal when designing fact tables is to minimize the size without compromising the datarequirements. Fact tables are the largest tables in the database because they contain detail level data

    representing the underlying business transactions. However, the costs of storing, and maintaining these large

    tables should be considered. For example, larger tables require more online, and potentially offline, storage;

    take longer to backup and restore in the event of a system failure; and take longer to query when buildingOLAP aggregations.

    The easiest ways to reduce the size of fact tables include:

    Reducing the number of columns.

    Reducing the size of each column where possible.

    Archiving historical data into separate fact tables.

    Reducing the Number of Columns

    Remove any columns that are not required to analyze the operations of the business. For example, if the data

    does not represent a business transaction, or if the data can be derived using aggregates, remove the datafrom the fact table. Although aggregated columns often improve query performance, the size of a typical fact

    table can prohibit using them. For example, if the Order_Details fact table contains one billion rows, and acolumn, Total_Price, is added representing Quantity multiplied by UnitPrice, one billion new values now exist

    permanently in the table.

    Important If a column is moved from a fact table to another table, and is referenced frequently in queries

    involving data from the fact table, large join operations may be required. These joins can affect queryperformance. Therefore, the trade-off between reducing storage costs and affecting query performance shoul

    be determined.

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 11/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    12/20

    Although Order_Details forms the basis of the fact table, the OrderID column is not required in the final fact

    table because OrderDetailID is the unique identifier for the business transaction: a sale or a product. In fact

    OrderID does not represent a single business transaction; it represents the sale of one or many products to a

    single customer, and so cannot be used.

    Reducing the Size of Each Column

    Because fact tables tend to have a large number of rows, even one redundant byte per row can add up to alarge amount of wasted database space. For example, a fact table containing one billion rows, with one unuse

    byte in one of the columns, represents almost 1 GB of unused database. To reduce column widths:

    Ensure that all character and binary data is variable length.

    Use data types that require fewer bytes of storage where possible. For example, if a column contains

    integer values only in the range from 1 through 100, use tinyint rather than int, saving 3 bytes per rowfor that column.

    Archiving Historical Data

    If data within fact tables is rarely used, such as sales data from several years ago, it may be useful to archivethe data. This approach reduces the volume of data in the fact table, hence increasing the performance of

    queries. Exceptional queries, on older data, can be run against multiple fact tables without affecting the

    majority of users querying the fact tables containing recent data. When Microsoft SQL Server OLAP Services i

    used in conjunction with multiple fact tables, the OLAP Services engine manages queries against multiple back

    end fact tables. This simplifies the management and use of multiple fact tables containing historical data.

    Designing Dimension Tables

    The primary goal in designing dimension tables is to denormalize the data that references the fact tables into

    single tables. The most commonly used dimension data should reference the fact tables directly, rather than

    indirectly through other tables. This approach minimizes the number of table joins, and speeds up performance

    For example, the order entry star schema should support the business queries:

    Sales of a specific product per region

    All sales per region

    Currently, the dimension data describing a region (City, StateOrProvince, and Country) is part of theCustomers table. However, Customers references Order_Details (fact data) using Orders.

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 12/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    13/20

    To better support the business analysis required, the region data should be placed into a new table, Region,

    directly referencing Order_Details. To implement this, a foreign key from the Region dimension table is added

    to Order_Details (now renamed to Sales). Any queries involving sales per region now require only a two-tabl

    join between the Region dimension table, and the Sales fact table.

    Note The existing relationship between the Sales fact table and the Products dimension data is unchanged.

    Date and Time Information

    Date information is a common requirement in a data warehouse. To minimize the fact table width, a foreign key

    is often created in the fact table referencing a dimension table containing a representation of the date and/or

    time. The representat ion of the date depends on business analysis requirements.

    For example, the business analysis to be performed on the order entry system requires product salessummarized by month, quarter, and year. The date information should be stored in a form that represents thes

    increments. This is achieved by creating a foreign key in the fact table referencing a date dimension table(named Period) containing the date of the sale in a month, quarter, year format. To increase the flexibility of

    this data, additional dimension tables are created, referenced by the Period dimension table, that containmonths, quarters, and years in more detail. When designing dimension tables for use with Microsoft SQL Serve

    OLAP Services, only a date is needed. The OLAP Services Time Wizard enables dates to be summarized into ancombination of weeks, months, quarters, and years.

    Implementing the Database Design

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 13/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    14/20

    After the fact and dimension tables have been designed, the final step is to physically implement the database

    in Microsoft SQL Server.

    Creating the Database

    When creating the database, consider the partitioning strategy, if any, that may be used. SQL Server offers

    filegroups that can be used to stripe data, in addition to the disk striping available with Microsoft Windows NT

    and hardware-based implementations.

    Creating the Tables

    When creating the tables used to store the fact and dimension data, consider creating the tables across the

    partitions available to the database, based on usage. For example, create separate fact tables containing datsegmented by year or division on separate partitions (such as a SQL Server filegroup to improve read

    performance).

    Creating Any User-defined Views

    Create user-defined views if necessary. SQL Server views can be used to merge horizontally partitioned tables

    together logically, as interfaces to predefined queries or as a security mechanism.

    Creating Indexes

    Indexes should be created to maximize performance. Consider creating indexes on:

    Key columns.

    Columns involved in joins.

    Multiple columns, to take advantage of index coverage.

    All dimension table keys used by the fact table.

    See Also

    In Other Volumes

    "CREATE VIEW" in Microsoft SQL Server Transact-SQL and Utilities Reference

    "Overview of Creating and Maintaining Databases" in Microsoft SQL ServerDatabase Developer's Companion

    "Indexes" in Microsoft SQL ServerDatabase Developer's Companion

    "Physical Database Design" in Microsoft SQL ServerDiagnostics

    Extracting and Loading Data

    Extracting and loading data from operational systems to a data warehouse varies in complexity. The processcan be simple if there is a direct correlation between the source data and the data that should appear in the

    data warehouse: for example, if all the source data from a single operational system is in the correct format,and does not have to be modified in any way. The process can also be complex: for example, if source data

    resides in multiple, heterogeneous operational systems, and requires significant formatting and modification

    before loading.

    The extraction and load process involves:

    Validating data in the operational systems.

    Migrating data from the operational systems.

    Scrubbing data.

    Transforming data to the data warehouse.

    Validating Data

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 14/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    15/20

    Before data is extracted from the operational systems, it may be necessary to ensure that the data is

    completely valid. If the data is not valid, the integrity of the business analysis relying on the data may be

    compromised. For example, a value representing a monetary transfer between banks in different countries mus

    be in the correct currency.

    Data should be validated at the source by business analysts who understand what the data represents. Anychanges should be made in the operational systems, rather than the data warehouse, because the source dat

    is incorrect regardless of where it is located.

    Validating data can be a time-consuming process. The validation process can be automated by writing stored

    procedures that check the data for domain integrity. However, it may be necessary to validate data manuallyIf any invalid data is discovered, determine where the fault originated and correct any processes contributing

    to the error.

    For example, the data in the order entry system should be validated, to ensure that:

    Region information (City, State, Country) represents a valid city, state, country/region combination.

    Product information (ProductID, ProductName, UnitPrice) represents valid products.

    This information can be validated using the Data Transformation Services import and export wizards. A

    Microsoft ActiveX script, executed by the DTS Import and DTS Export wizards when copying data from the

    source to the destination, can determine if the region and product information is valid. Any invalid data can be

    saved to the exception log for later examination by business analysts, to determine why it is incorrect.

    See Also

    In Other Volumes

    "Data Transformation Services Import and Export Wizards" in Microsoft SQL Server Distributed Data Operations

    and Replication

    "Column Mappings" in Microsoft SQL Server Distributed Data Operations and Replication

    Migrating Data

    Migrating data from operational systems typically involves copying the data to an intermediate database befor

    it is finally copied to the data warehouse. Copying data to an intermediate database is necessary if data shoube scrubbed.

    Copying data should occur ideally during a period of low activity on the operational system. Otherwise, system

    performance may degrade, affecting any users. Additionally, if the data warehouse is composed of data frommultiple interrelated operational systems, then it is important to ensure that data migration occurs when the

    systems are synchronized. If the operational systems are not synchronized, the data in the warehouse can

    produce unexpected results when queried.

    The DTS Import and DTS Export wizards can be used to create a series of tasks that copy data from manyheterogeneous operational systems to an intermediate database running on Microsoft SQL Server. Alternativel

    you can use a Microsoft ActiveX script with DTS to sc rub the data, and then copy it directly to the SQL Servdata warehouse, avoiding the need for an intermediate database.

    See Also

    In Other Volumes

    "Data Transformation Services Import and Export Wizards" in Microsoft SQL Server Distributed Data Operationsand Replication

    Scrubbing Data

    Scrubbing data involves making the data consistent. It is possible that the same data is represented in

    different ways in multiple operational systems. For example, a product name might be abbreviated in one

    operational system, but not in another. If the two values were not made consistent, any queries using the da

    likely would evaluate the values as different products. If the detail data in the data warehouse is to produce

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 15/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    16/20

    consistent information, the product name must be made consistent for all values.

    Data scrubbing can be achieved:

    Using the DTS Import and DTS Export wizards to modify data as it is copied from the operational system

    to the intermediate database, or directly to the data warehouse.

    By writing a Microsoft ActiveX script, executed by a program using the DTS API, to connect to the data

    source and scrub the data. Any data manipulation that can be achieved using ActiveX scripting, or aprogramming language such as Microsoft Visual C++, can be performed on the data.

    Using a DTS Lookup, which provides the ability to perform queries using one or more named,

    parameterized query strings that allow a custom transformation to retrieve data from locations other tha

    the immediate source or destination row being transformed.

    For example, the data in the order entry system should be scrubbed, such as the values for:

    State (must always be a two-character value, such as WA).

    ProductName (must always be the full product name, with no abbreviations).

    Using the DTS Import and DTS Export wizards, an ActiveX script can be executed during the copy process,that checks the State value, and changes it to a known two-character value. Alternatively, the

    ProductName value could be scrubbed by writing a Visual C++ program that calls the DTS API to execute

    Microsoft JScript scripts , and other executable modules.

    See Also

    In Other Volumes

    "Column Mappings" in Microsoft SQL Server Distributed Data Operations and Replication

    "Data Transformation Services Import and Export Wizards" in Microsoft SQL Server Distributed Data Operations

    and Replication

    "DTS Lookup" in Microsoft SQL Server Distributed Data Operations and Replication

    "Programming DTS Applications" in Microsoft SQL Server Building Applications

    Transforming Data

    During the data migration step, it is often necessary to transform operational data into a separate format

    appropriate to the data warehouse design. Transformation examples include:

    Changing all alphabetic characters to uppercase.

    Calculating new values based on existing data, including data aggregation and summarization.

    Breaking up a single data value into multiple values, such as a product code in nnnn-description format

    into separate code and description values, or a date value in MMDDYY format into separate month, day,and year values.

    Merging separate data values into a single value, such as concatenating a first name value with asurname value.

    Mapping data from one representation to another, such as converting data values (1, 2, 3, 4) to (I, II,

    III, IV).

    Data transformation also involves formatting and modifying extracted data from operational systems intomerged or derived values that are more useful in the data warehouse. For example, copying the OrderDate

    value from the order entry system to the data warehouse star schema, involves splitting the date into Month

    Quarter, and Year components. These date components are required for the type of business analysis

    performed on the data warehouse.

    The transformation process usually takes place during the migration process: when data is copied either

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 16/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    17/20

    directly from the operational sources or from an intermediate database, because the data has been scrubbed.

    For complex data migrations, DTS provides skip return values to assist in splitting data into multiple tables.

    Data transformation and migration can be completed in a single step using the DTS Import and DTS Export

    wizards. Transforming and migrating data from the order entry OLTP operational system schema to the data

    warehouse star schema involves using the DTS Import and DTS Export wizards to:

    Create a query to extract all the required detail level (fact) data.

    Split OrderDate in the Orders table into Month, Quarter, and Year components and add to Period

    using a Microsoft ActiveX script.

    Extract the City, StateOrProvince, and Country data relating to the detail data and add to Region

    using an ActiveX script.

    Perform a simple table copy ofProducts.

    Create a query to generate the data for Summary.

    Each step, for example, can be built as a separate package, which is stored in the Microsoft SQL Server msddatabase, and scheduled to be executed every Friday night at midnight.

    In addition to performing insert-based transformations of data, DTS provides data-driven-queries, in which da

    is read from the source and transformed, and a parameterized query is executed at the destination, using the

    transformed values in the destination row.

    Note When using DTS to create fact tables for use with Microsoft SQL Server OLAP Services, do not createany aggregations while migrating the data. OLAP Services is specifically designed to create the optimal

    aggregations after the data warehouse has been populated with DTS. It is also unnecessary to segment a datinto week, month, quarter, or year columns in the Time dimension table. The OLAP Services Time Wizard

    provides an automated facility for this type of time transformations.

    See Also

    In Other Volumes

    "Column Mappings" in Microsoft SQL Server Distributed Data Operations and Replication

    "Data-Driven Queries" in Microsoft SQL Server Distributed Data Operations and Replication

    "Data Transformation Services Import and Export Wizards" in Microsoft SQL Server Distributed Data Operations

    and Replication

    "Understanding Data Transformation Services" in Microsoft SQL Server Building Applications

    Designing and Processing Aggregations

    OLAP tools are typically used to create and manage summary data. Microsoft SQL Server OLAP Services allow

    aggregations to be stored in a variety of formats and locations, with dynamic connections to underlying detailin the data warehouse. Summary data is often generated to satisfy the commonly executed queries in the dat

    warehouse. Storing preaggregated data increases query performance, and reduces the load on the data

    warehouse.

    If a data warehouse is built so the data in it does not change, then preaggregating data in the fact table save

    only the disk space required by the fact table. OLAP Services uses the processing time that would have been

    used to preaggregate in the fact table when it processes the fact table as it builds a cube. However,

    precalculated aggregations are stored in the cube and do not need to be recalculated for each query. If ahybrid OLAP (HOLAP) or relational OLAP (ROLAP) cube is used, the fact table is not copied into the cube as it

    in multidimensional OLAP (MOLAP) cubes, so the overhead required to retain availability of the detail data isonly the fact table size, not processing time or query response time.

    Preaggregation strategy when designing a data warehouse for use by OLAP Services depends on the followingvariables:

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 17/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    18/20

    Stability of the data.

    If the source data changes, the preaggregations have to be performed each time, whether

    preaggregated in the fact table or in the OLAP cubes that have to be rebuilt from the fact table.

    Query response time.

    With properly designed OLAP cubes, the granularity of detail in the fact table has no effect on query

    response time for queries that do not access detail facts.

    Storage requirements.

    A finer level of granularity in the fact table requires more storage for the fact table and for MOLAP cubeThis is a trade-off against detail availability and choice of OLAP cube storage mode. OLAP cubes tend to

    be large regardless of the storage type; therefore the storage required to retain fine granularity in thefact table may not be particularly significant when compared to OLAP storage needs.

    When designing the data warehouse for OLAP, the user's needs should drive the preaggregation strategy. Thefact table should only be preaggregated to the level of granularity below which no user would want to access

    detail.

    For more information, see your OLAP Services documentation

    Maintaining the Data Warehouse

    Maintenance of the data warehouse is an ongoing task that should be designed before the data warehouse ismade available to users. Maintenance involves:

    Implementing a backup and recovery mechanism to protect the data in the event of a system failure, or

    some other problem.

    Archiving the database. This may be necessary to purge the database of unused historical data, and fre

    up space.

    Running SQL Server Profiler to determine which indexes to create to enhance query performance.

    See Also

    In Other Volumes

    "Monitoring with SQL Server Profiler" in Microsoft SQL ServerAdministrator's Companion

    Top Of Page

    Data Transformation Services Data Warehousing Support

    Using Data Transformation Services (DTS), you can import and export data between multiple heterogeneous

    sources using an OLE DB-based architecture, and transfer databases and database objects (for example,

    indexes and stored procedures) between computers running Microsoft SQL Server version 7.0. You can also us

    the data transformation capabilities of DTS to build a data warehouse from an online transaction processing(OLTP) system. You can build data warehouses and data marts in SQL Server by importing and transferring

    data from multiple heterogeneous sources interact ively or automatically on a regularly scheduled basis.

    DTS components include the DTS Import Wizard, DTS Export Wizard, and DTS Designer, which are availablethrough SQL Server Enterprise Manager. DTS also includes COM programming interfaces you can use to create

    customized import, export, and transformation applications.

    A transformation is the set of operations applied to source data before it is stored at the destination during th

    process of building a data warehouse. For example, the DTS transformation capability allows calculating newvalues from one or more source columns, or even breaking a single column into multiple values to be stored in

    separate destination columns. Transformations, therefore, make it easy to implement complex data validation,

    scrubbing, and enhancement during import and export.

    Data Transformation Services (DTS) allows you to import, export, or transform data in a process that can be

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 18/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    19/20

    saved as a package. Each package defines a workflow that includes one or more tasks executed in a

    coordinated sequence as steps. Tasks can copy data from a source to a destination, transform data using a

    Microsoft ActiveX script, execute an SQL statement at a server, or even run an external program. Tasks can

    also transfer database objects between computers running SQL Server 7.0.

    The DTS package can be created manually by using a language that supports OLE Automation, such asMicrosoft Visual Basic, or interactively by using the Data Transformation Services wizards or DTS Designer.

    After a DTS package has been created and saved, it is completely self-contained and can be retrieved and ruusing SQL Server Enterprise Manager or the dtsrun utility.

    DTS packages can be stored in the Microsoft Repository, providing the ability to record data lineage. Thisallows you to determine the source of any piece of data, and the transformations applied to that data. Data

    lineage can be tracked at the package and row levels of a table and provide a complete audit trail of data

    transformation and DTS package execution information in your data warehouse.

    DTS Designer is a graphical design environment for creating and executing complex sets of data transformation

    and workflows, in preparation for moving data to a data warehouse. Experienced users can use DTS Designer

    integrate, consolidate, and transform heterogeneous data from multiple sources. Packages created can be

    stored in a SQL Server msdb database, the Repository, or a COM-structured storage file.

    The visual objects used by DTS Designer are based on the DTS object model, an API that includes objects,properties, methods, and collections designed for programs that copy and transform data from an OLE DB data

    source to an OLE DB destination. This object model can be accessed through ActiveX scripts from within DTSDesigner, and through external programs written in languages such as Visual Basic and Microsoft Visual C++.

    You can also access custom programs through DTS Designer, and include their tasks and icons as part of the

    package. Because DTS Designer accesses an underlying programming model, it does most of the programming

    work for you.

    See Also

    In Other Volumes

    "DTS Designer" in Microsoft SQL Server Distributed Data Operations and Replication

    "Overview of Data Transformation Services" in Microsoft SQL Server Distributed Data Operations and

    Replication

    Top Of Page

    OLAP Services Data Warehousing Support

    Microsoft SQL Server OLAP Services provides online analytical processing (OLAP) services to applications.

    OLAP focuses on finding trends in aggregated or summarized data. The main objects used by OLAP programs

    are multidimensional cubes. A multidimensional cube records a set of data derived from fact tables anddimensions. A fact table records data about a set of transactions. Measures are numeric columns from the fac

    table that are of interest to the user. For example, a car sales fact table could provide measures such as saleprice, invoice cost, tax paid, and discount. A cube represents how these measures vary over multiple

    dimensions, such as by car dealer, by location, by customer, or by date.

    OLAP Services provides the capability of designing, creating, and managing cubes from a data warehouse, and

    then making them available to client applications written using either the OLE DB 2.0 OLAP extensions, or the

    Microsoft ActiveX Data Objects 2.0 Multidimensional Objects (ADO MD).

    3/6/2011 Chapter 12 - Data Warehousing and Onl

    microsoft.com//cc917548(printer).aspx 19/

  • 8/7/2019 Chapter 12 - Data Warehousing and Online Analytical Processing

    20/20

    2011 Microsoft. All rights reserved.

    The OLAP server performs multidimensional queries of data and stores the results in its multidimensional

    storage. It speeds the analysis of fact tables into cubes, stores the cubes until needed, and then quicklyreturns the data to clients.

    The OLAP server is managed through an API called the Microsoft Decision Support Objects (DSO). OLAP

    Services provides a snap-in for the Microsoft Management Console (MMC). This MMC snap-in uses DSO to

    provide administrators with an easy-to-use graphical interface for defining, creating, and managing the cubesbuilt by the OLAP server. DSO can also be called from custom applications, which in turn can be added to the

    OLAP Manager as an add-in.

    OLAP Services passes its multidimensional data to a middle-tier PivotTable Service. The PivotTable Serviceoperates as an OLE DB for OLAP provider. It exposes the multidimensional data to applications using either the

    OLE DB 2.0 OLAP extensions, or the ADO MD API that wraps the OLE DB OLAP extensions.

    See Also

    In This Volume

    Installing OLAP Services

    Server Improvements and New Features

    Top Of Page

    3/6/2011 Chapter 12 - Data Warehousing and Onl