Chapter 16

McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.

Chapter 16

Data Warehouse Technology and Management

16-2

Outline

Basic concepts and characteristics Business architectures and applications Data cube concepts and operators Relational DBMS features Maintaining a data warehouse

16-3

Comparison of Environments Transaction processing

Uses operational databases Short-term decisions: fulfill orders, resolve

complaints, provide staffing Decision support processing

Uses integrated and summarized data Medium and long-term decisions: capacity

planning, store locations, new lines of business

16-4

Definition and Characteristics A central repository for summarized and

integrated data from operational databases and external data sources

Key Characteristics Subject-oriented Integrated Time-variant Nonvolatile

16-5

Data ComparisonCharacteristic Operational

Database Data Warehouse

Currency Current Historical

Detail level Individual Individual and Summary

Orientation Process orientation

Subject orientation

Number of records processed

Few Thousands

Normalization level Mostly normalized

Mostly denormalized

Update level Volatile Nonvolatile (refreshed)

Data model Relational Multidimensional and relational with star schemas

16-6

Architectures and Applications Data warehouse projects Top-down architectures Bottom-up architecture Applications and data mining

16-7

Data Warehouse Projects Large efforts with much coordination

across departments Enterprise data model

Important artifact of data warehouse project Structure of data model Meta data for data transformation

Top-down vs. bottom-up business architectures

16-8

Two Tier Architecture

Data warehouse

Operationaldatabase

Operationaldatabase

External data source

EDM

Detailed and summarized data

Transformation process

Data warehouse server

User departments

16-9

Three Tier Architecture

Data warehouse

Operationaldatabase

Operationaldatabase

External data source

EDM

Detailed and summarized data

Transformation process

Data warehouse server

User departments

Data mart

Data mart

Data mart tier

Extraction process

16-10

Bottom-up Architecture

Operationaldatabase

Operationaldatabase

Externaldata source

Transformationprocess

Userdepartments

Data mart

Data mart

Data mart tier

16-11

Applications

Industry Key Applications Airline Yield management,

route assessment Telecommunications Customer retention,

network design Insurance Risk assessment, product

design, fraud detection Retail Target marketing,

supply-chain management

16-12

Maturity Model Guidance for investment decisions Stages provide a framework to view an

organization’s progress Insights: difficulty moving between stages

Infant to child stages because of investment level

Teenager to adult because of strategic importance of data warehouse

16-13

Data Mining Discover significant, implicit patterns

Target promotions Change mix and collocation of items

Requires large volumes of transaction data

Important application for data warehouses

16-14

Data Cube Concepts and Operators Basics Dimension and measure details Operators

16-15

Data Cube Basics Multidimensional arrangement of data Users think about decision support data as

data cubes Terminology

Dimension: subject label for a row or column Member: value of dimension Measure: quantitative data stored in cells

16-16

Data Cube Example

Product

Mono Ink Photo Portable Laser Jet

Lo

cati

on California

Utah

Arizona

Washington

Colorado Time 1/1/2006 1/2/2006

….. 12/31/2006

80 110 60 25

40 90 50 30

70 55 60 35

75 85 45 45

65 45 85 60

16-17

Dimensions and Measures Dimensions

Hierarchies: members can have sub members Sparsity: many cells do not have data

Measures Derived measures Multiple measures in cells

16-18

Time Series Data Common data type in trend analysis Reduce dimensionality using time series Time series properties

Data type Start date Calendar Periodicity Conversion

16-19

Slice Operator Focus on a subset of dimensions Set dimension to specific value: 1/1/2006

Location Product Mono Laser Ink Jet Photo Portable

California 80 110 60 25 Utah 40 90 50 30 Arizona 70 55 60 35 Washington 75 85 45 45 Colorado 65 45 85 60

16-20

Dice Operator

Focus on a subset of member values Replace dimension with a subset of values Dice operation often follows a slice

operation

Product

Mono Ink Photo Portable Laser Jet L

ocat

ion Utah 40 90 50 30

16-21

Other Operators Operators for hierarchical dimensions

Drill-down: add detail to a dimension Roll-up: remove detail from a dimension Recalculate measure values

Pivot: rearrange dimensions

16-22

Operator Summary

Operator Purpose Description

Slice Focus attention on a subset of dimensions

Replace a dimension with a single member value or with a summary of its measure values

Dice Focus attention on a subset of member values

Replace a dimension with a subset of members

Drill-down Obtain more detail about a dimension

Navigate from a more general level to a more specific level

Roll-up Summarize details about a dimension

Navigate from a more specific level to a more general level

Pivot Present data in a different order

Rearrange the dimensions in a data cube

16-23

Relational DBMS Support Data modeling Dimension representation GROUP BY extensions Materialized views and query rewriting Storage structures and optimization

16-24

Relational Data Modeling Dimension table: contains member values Fact table: contains measure values 1-M relationships from dimension to fact

tables Grain: most detailed measure values

stored

16-25

Star Schema Example

CustomerCustIdCustNameCustPhoneCustStreetCustCityCustStateCustZipCustNation

StoreStoreIdStoreManagerStoreStreetStoreCityStoreStateStoreZipStoreNationDivIdDivNameDivManager

SalesSalesNoSalesUnitsSalesDollarSalesCost

ItemItemIdItemNameItemUnitPriceItemBrandItemCategory

TimeDimTimeNoTimeDayTimeMonthTimeQuarterTimeYearTimeDayOfWeekTimeFiscalYear

ItemSales

CustSales

TimeSales

StoreSales

16-26

Constellation Schema


StoreStoreIdStoreManagerStoreStreetStoreCityStoreStateStoreZipStoreNationDivIdDivNameDivManager




ItemSales

CustSales

TimeSales

StoreSales

InventoryInvNoInvQOHInvCostInvReturns

SupplierSuppIdSuppNameSuppCitySuppStateSuppZipSuppNation

SuppInv

ItemInv

StoreInv

TimeInv

16-27

Snowflake Schema Example


StoreStoreIdStoreManagerStoreStreetStoreCityStoreStateStoreZipStoreNation




ItemSales

CustSales

TimeSales

StoreSales

DivisionDivIdDivNameDivManager

DivStore

16-28

Handling M-N Relationships Source data may have M-N relationships,

not 1-M relationships Adjust fact or dimension tables for a fixed

number of exceptions More complex solutions to support M-N

relationships with a variable number of connections

16-29

Time Representation

Timestamp Time dimension table for organization

specific calendar features Two fact tables for international operations Accumulating fact table for representation

of multiple events

16-30

Level of Historical Integrity

Primarily an issue for dimension updates Type I: overwrite old values Type II: version numbers for an unlimited

history Type III: new columns for a limited history

16-31

Historical Integrity Example

CustomerCustIdVersionNoCustNameCustPhoneCustStreetCustCityCustCityBegEffDateCustCityEndEffDateCustStateCustZipCustNation

CustomerCustIdCustNameCustPhoneCustStreetCustCityCurrCustCityCurrBegEffDateCustCityCurrEndEffDateCustCityPrevCustCityPrevBegEffDateCustCityPrevEndEffDateCustCityPastCustCityPastBegEffDateCustCityPastEndEffDateCustStateCustZipCustNation

Type II Representation Type III Representation

16-32

Dimension Representation Star schema and variations lack

dimension representation Explicit dimension representation

important to data cube operations and optimization

Proprietary extensions for dimension representation

Represent levels, hierarchies, and constraints

16-33

Oracle Dimension Representation

Levels: dimension components Hierarchies: may have multiple hierarchies Constraints: functional dependency

relationships

16-34

CREATE DIMENSION ExampleCREATE DIMENSION StoreDim LEVEL StoreId IS Store.StoreId LEVEL City IS Store.StoreCity LEVEL State IS Store.StoreState LEVEL Zip IS Store.StoreZip LEVEL Nation IS Store.StoreNation LEVEL DivId IS Division.DivId HIERARCHY CityRollup ( StoreId CHILD OF City CHILD OF State CHILD OF Nation )HIERARCHY ZipRollup ( StoreId CHILD OF Zip CHILD OF State CHILD OF Nation )HIERARCHY DivisionRollup ( StoreId CHILD OF DivId JOIN KEY Store.DivId REFERENCES DivId )ATTRIBUTE DivId DETERMINES Division.DivNameATTRIBUTE DivId DETERMINES Division.DivManager ;

16-35

GROUP BY Extensions ROLLUP operator CUBE operator GROUPING SETS operator Other extensions

Ranking Ratios Moving summary values

16-36

CUBE ExampleSELECT StoreZip, TimeMonth, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear = 2005 GROUP BY CUBE (StoreZip, TimeMonth)

16-37

ROLLUP ExampleSELECT TimeMonth, TimeYear, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear BETWEEN 2005 AND 2006 GROUP BY ROLLUP (TimeMonth,TimeYear);

16-38

GROUPING SETS ExampleSELECT StoreZip, TimeMonth, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear = 2005 GROUP BY GROUPING SETS((StoreZip, TimeMonth), StoreZip, TimeMonth, ());

16-39

Variations of the Grouping Operators Partial cube Partial rollup Composite columns CUBE and ROLLUP inside a GROUPIING

SETS operation

16-40

Materialized Views

Stored view Periodically refreshed with source data Usually contain summary data Fast query response for summary data Appropriate in query dominant

environments

16-41

Materialized View ExampleCREATE MATERIALIZED VIEW MV1BUILD IMMEDIATEREFRESH COMPLETE ON DEMANDENABLE QUERY REWRITE ASSELECT StoreState, TimeYear, SUM(SalesDollar) AS SUMDollar1 FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND TimeYear > 2003 GROUP BY StoreState, TimeYear;

16-42

Query Rewriting

Substitution process Materialized view replaces references to

fact and dimension tables in a query Query optimizer must evaluate whether

the substitution will improve performance over the original query

More complex than query modification process for traditional views

16-43

Query Rewriting Process

Rewrite SQL EngineQueryFD QueryMV

Results

QueryFD: query that references fact and dimension tables

QueryMV: rewrite of QueryFD such that materialized views are substituted for fact and dimension tables whenever justified by expected performance improvements.

16-44

Query Rewriting Matching

Row conditions: query conditions at least as restrictive as MV conditions

Grouping detail: query grouping columns at least as general as MV grouping columns

Grouping dependencies: query columns must match or be derivable by functional dependencies

Aggregate functions: query aggregate functions must match or be derivable from MV aggregate functions

16-45

Query Rewriting Example-- Data warehouse querySELECT StoreState, TimeYear, SUM(SalesDollar) FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND StoreNation IN ('USA','Canada') AND TimeYear = 2005 GROUP BY StoreState, TimeYear;-- Query Rewrite: replace Sales and Time tables with MV1SELECT DISTINCT MV1.StoreState, TimeYear, SumDollar1FROM MV1, StoreWHERE MV1.StoreState = Store.StoreState AND TimeYear = 2005 AND StoreNation IN ('USA','Canada');

16-46

Storage and Optimization Technologies MOLAP: direct storage and manipulation

of data cubes ROLAP: relational extensions to support

multidimensional data HOLAP: combine MOLAP and ROLAP

storage engines

16-47

ROLAP Techniques Bitmap join indexes Star join optimization Query rewriting Summary storage advisors Parallel query execution

16-48

Maintaining a Data Warehouse Data sources Workflow representation Optimizing the refresh process

16-49

Data Sources Cooperative:

Notification using triggers Requires source system changes

Logged Readily available Extraneous data in logs

Queryable Queries using timestamps Requires timestamps in source data

Snapshot Periodic dumps of source data Significant processing for difference operations

16-50

Maintenance Workflow

PreparationPhase

IntegrationPhase

UpdatePhase

Propagation

Notification

Extraction

Cleaning

Auditing

Transportation

Merging

Auditing

16-51

Data Quality Problems Multiple identifiers Multiple field names Different units Missing values Orphaned values Multipurpose fields Conflicting data Different update times

16-52

ETL Tools Extraction, Transformation, and Loading Specification based Eliminate custom coding Third party and DBMS based tools

16-53

Refresh Processing

Accounting

UnknownProcesses

ExternalData Sources

InternalData Sources

DataWarehouse

ETLTools

Valid Time Lag

Load Time Lag

Fact andDimensionChanges

PrimarilyDimensionChanges

16-54

Determining the Refresh Frequency Maximize net refresh benefit Value of data timeliness Cost of refresh Satisfy data warehouse and source

system constraints

16-55

Refresh Constraints

Source access: restrictions on time and frequency

Integration: restrictions that require concurrent reconciliation

Completeness/consistency: loading in the same refresh period

Availability: load scheduling restrictions due to storage capacity, online availability, and server usage

16-56

Summary Data warehouse requirements differ from

transaction processing. Architecture choice is important. Multidimensional data model is intuitive Relational representation and storage

techniques are significant. Maintaining a data warehouse is an

important, operational problem.

Documents

Chapter 16