Upload
vielka-dawson
View
23
Download
3
Embed Size (px)
DESCRIPTION
Chapter 16. Data Warehouse Technology and Management. Outline. Basic concepts and characteristics Business architectures and applications Data cube concepts and operators Relational DBMS features Maintaining a data warehouse. Comparison of Environments. Transaction processing - PowerPoint PPT Presentation
Citation preview
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
Chapter 16
Data Warehouse Technology and Management
16-2
Outline
Basic concepts and characteristics Business architectures and applications Data cube concepts and operators Relational DBMS features Maintaining a data warehouse
16-3
Comparison of Environments Transaction processing
Uses operational databases Short-term decisions: fulfill orders, resolve
complaints, provide staffing Decision support processing
Uses integrated and summarized data Medium and long-term decisions: capacity
planning, store locations, new lines of business
16-4
Definition and Characteristics A central repository for summarized and
integrated data from operational databases and external data sources
Key Characteristics Subject-oriented Integrated Time-variant Nonvolatile
16-5
Data ComparisonCharacteristic Operational
Database Data Warehouse
Currency Current Historical
Detail level Individual Individual and Summary
Orientation Process orientation
Subject orientation
Number of records processed
Few Thousands
Normalization level Mostly normalized
Mostly denormalized
Update level Volatile Nonvolatile (refreshed)
Data model Relational Multidimensional and relational with star schemas
16-6
Architectures and Applications Data warehouse projects Top-down architectures Bottom-up architecture Applications and data mining
16-7
Data Warehouse Projects Large efforts with much coordination
across departments Enterprise data model
Important artifact of data warehouse project Structure of data model Meta data for data transformation
Top-down vs. bottom-up business architectures
16-8
Two Tier Architecture
Data warehouse
Operationaldatabase
Operationaldatabase
External data source
EDM
Detailed and summarized data
Transformation process
Data warehouse server
User departments
16-9
Three Tier Architecture
Data warehouse
Operationaldatabase
Operationaldatabase
External data source
EDM
Detailed and summarized data
Transformation process
Data warehouse server
User departments
Data mart
Data mart
Data mart tier
Extraction process
16-10
Bottom-up Architecture
Operationaldatabase
Operationaldatabase
Externaldata source
Transformationprocess
Userdepartments
Data mart
Data mart
Data mart tier
16-11
Applications
Industry Key Applications Airline Yield management,
route assessment Telecommunications Customer retention,
network design Insurance Risk assessment, product
design, fraud detection Retail Target marketing,
supply-chain management
16-12
Maturity Model Guidance for investment decisions Stages provide a framework to view an
organization’s progress Insights: difficulty moving between stages
Infant to child stages because of investment level
Teenager to adult because of strategic importance of data warehouse
16-13
Data Mining Discover significant, implicit patterns
Target promotions Change mix and collocation of items
Requires large volumes of transaction data
Important application for data warehouses
16-14
Data Cube Concepts and Operators Basics Dimension and measure details Operators
16-15
Data Cube Basics Multidimensional arrangement of data Users think about decision support data as
data cubes Terminology
Dimension: subject label for a row or column Member: value of dimension Measure: quantitative data stored in cells
16-16
Data Cube Example
Product
Mono Ink Photo Portable Laser Jet
Lo
cati
on California
Utah
Arizona
Washington
Colorado Time 1/1/2006 1/2/2006
….. 12/31/2006
80 110 60 25
40 90 50 30
70 55 60 35
75 85 45 45
65 45 85 60
16-17
Dimensions and Measures Dimensions
Hierarchies: members can have sub members Sparsity: many cells do not have data
Measures Derived measures Multiple measures in cells
16-18
Time Series Data Common data type in trend analysis Reduce dimensionality using time series Time series properties
Data type Start date Calendar Periodicity Conversion
16-19
Slice Operator Focus on a subset of dimensions Set dimension to specific value: 1/1/2006
Location Product Mono Laser Ink Jet Photo Portable
California 80 110 60 25 Utah 40 90 50 30 Arizona 70 55 60 35 Washington 75 85 45 45 Colorado 65 45 85 60
16-20
Dice Operator
Focus on a subset of member values Replace dimension with a subset of values Dice operation often follows a slice
operation
Product
Mono Ink Photo Portable Laser Jet L
ocat
ion Utah 40 90 50 30
16-21
Other Operators Operators for hierarchical dimensions
Drill-down: add detail to a dimension Roll-up: remove detail from a dimension Recalculate measure values
Pivot: rearrange dimensions
16-22
Operator Summary
Operator Purpose Description
Slice Focus attention on a subset of dimensions
Replace a dimension with a single member value or with a summary of its measure values
Dice Focus attention on a subset of member values
Replace a dimension with a subset of members
Drill-down Obtain more detail about a dimension
Navigate from a more general level to a more specific level
Roll-up Summarize details about a dimension
Navigate from a more specific level to a more general level
Pivot Present data in a different order
Rearrange the dimensions in a data cube
16-23
Relational DBMS Support Data modeling Dimension representation GROUP BY extensions Materialized views and query rewriting Storage structures and optimization
16-24
Relational Data Modeling Dimension table: contains member values Fact table: contains measure values 1-M relationships from dimension to fact
tables Grain: most detailed measure values
stored
16-25
Star Schema Example
CustomerCustIdCustNameCustPhoneCustStreetCustCityCustStateCustZipCustNation
StoreStoreIdStoreManagerStoreStreetStoreCityStoreStateStoreZipStoreNationDivIdDivNameDivManager
SalesSalesNoSalesUnitsSalesDollarSalesCost
ItemItemIdItemNameItemUnitPriceItemBrandItemCategory
TimeDimTimeNoTimeDayTimeMonthTimeQuarterTimeYearTimeDayOfWeekTimeFiscalYear
ItemSales
CustSales
TimeSales
StoreSales
16-26
Constellation Schema
CustomerCustIdCustNameCustPhoneCustStreetCustCityCustStateCustZipCustNation
StoreStoreIdStoreManagerStoreStreetStoreCityStoreStateStoreZipStoreNationDivIdDivNameDivManager
SalesSalesNoSalesUnitsSalesDollarSalesCost
ItemItemIdItemNameItemUnitPriceItemBrandItemCategory
TimeDimTimeNoTimeDayTimeMonthTimeQuarterTimeYearTimeDayOfWeekTimeFiscalYear
ItemSales
CustSales
TimeSales
StoreSales
InventoryInvNoInvQOHInvCostInvReturns
SupplierSuppIdSuppNameSuppCitySuppStateSuppZipSuppNation
SuppInv
ItemInv
StoreInv
TimeInv
16-27
Snowflake Schema Example
CustomerCustIdCustNameCustPhoneCustStreetCustCityCustStateCustZipCustNation
StoreStoreIdStoreManagerStoreStreetStoreCityStoreStateStoreZipStoreNation
SalesSalesNoSalesUnitsSalesDollarSalesCost
ItemItemIdItemNameItemUnitPriceItemBrandItemCategory
TimeDimTimeNoTimeDayTimeMonthTimeQuarterTimeYearTimeDayOfWeekTimeFiscalYear
ItemSales
CustSales
TimeSales
StoreSales
DivisionDivIdDivNameDivManager
DivStore
16-28
Handling M-N Relationships Source data may have M-N relationships,
not 1-M relationships Adjust fact or dimension tables for a fixed
number of exceptions More complex solutions to support M-N
relationships with a variable number of connections
16-29
Time Representation
Timestamp Time dimension table for organization
specific calendar features Two fact tables for international operations Accumulating fact table for representation
of multiple events
16-30
Level of Historical Integrity
Primarily an issue for dimension updates Type I: overwrite old values Type II: version numbers for an unlimited
history Type III: new columns for a limited history
16-31
Historical Integrity Example
CustomerCustIdVersionNoCustNameCustPhoneCustStreetCustCityCustCityBegEffDateCustCityEndEffDateCustStateCustZipCustNation
CustomerCustIdCustNameCustPhoneCustStreetCustCityCurrCustCityCurrBegEffDateCustCityCurrEndEffDateCustCityPrevCustCityPrevBegEffDateCustCityPrevEndEffDateCustCityPastCustCityPastBegEffDateCustCityPastEndEffDateCustStateCustZipCustNation
Type II Representation Type III Representation
16-32
Dimension Representation Star schema and variations lack
dimension representation Explicit dimension representation
important to data cube operations and optimization
Proprietary extensions for dimension representation
Represent levels, hierarchies, and constraints
16-33
Oracle Dimension Representation
Levels: dimension components Hierarchies: may have multiple hierarchies Constraints: functional dependency
relationships
16-34
CREATE DIMENSION ExampleCREATE DIMENSION StoreDim LEVEL StoreId IS Store.StoreId LEVEL City IS Store.StoreCity LEVEL State IS Store.StoreState LEVEL Zip IS Store.StoreZip LEVEL Nation IS Store.StoreNation LEVEL DivId IS Division.DivId HIERARCHY CityRollup ( StoreId CHILD OF City CHILD OF State CHILD OF Nation )HIERARCHY ZipRollup ( StoreId CHILD OF Zip CHILD OF State CHILD OF Nation )HIERARCHY DivisionRollup ( StoreId CHILD OF DivId JOIN KEY Store.DivId REFERENCES DivId )ATTRIBUTE DivId DETERMINES Division.DivNameATTRIBUTE DivId DETERMINES Division.DivManager ;
16-35
GROUP BY Extensions ROLLUP operator CUBE operator GROUPING SETS operator Other extensions
Ranking Ratios Moving summary values
16-36
CUBE ExampleSELECT StoreZip, TimeMonth, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear = 2005 GROUP BY CUBE (StoreZip, TimeMonth)
16-37
ROLLUP ExampleSELECT TimeMonth, TimeYear, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear BETWEEN 2005 AND 2006 GROUP BY ROLLUP (TimeMonth,TimeYear);
16-38
GROUPING SETS ExampleSELECT StoreZip, TimeMonth, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear = 2005 GROUP BY GROUPING SETS((StoreZip, TimeMonth), StoreZip, TimeMonth, ());
16-39
Variations of the Grouping Operators Partial cube Partial rollup Composite columns CUBE and ROLLUP inside a GROUPIING
SETS operation
16-40
Materialized Views
Stored view Periodically refreshed with source data Usually contain summary data Fast query response for summary data Appropriate in query dominant
environments
16-41
Materialized View ExampleCREATE MATERIALIZED VIEW MV1BUILD IMMEDIATEREFRESH COMPLETE ON DEMANDENABLE QUERY REWRITE ASSELECT StoreState, TimeYear, SUM(SalesDollar) AS SUMDollar1 FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND TimeYear > 2003 GROUP BY StoreState, TimeYear;
16-42
Query Rewriting
Substitution process Materialized view replaces references to
fact and dimension tables in a query Query optimizer must evaluate whether
the substitution will improve performance over the original query
More complex than query modification process for traditional views
16-43
Query Rewriting Process
Rewrite SQL EngineQueryFD QueryMV
Results
QueryFD: query that references fact and dimension tables
QueryMV: rewrite of QueryFD such that materialized views are substituted for fact and dimension tables whenever justified by expected performance improvements.
16-44
Query Rewriting Matching
Row conditions: query conditions at least as restrictive as MV conditions
Grouping detail: query grouping columns at least as general as MV grouping columns
Grouping dependencies: query columns must match or be derivable by functional dependencies
Aggregate functions: query aggregate functions must match or be derivable from MV aggregate functions
16-45
Query Rewriting Example-- Data warehouse querySELECT StoreState, TimeYear, SUM(SalesDollar) FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND StoreNation IN ('USA','Canada') AND TimeYear = 2005 GROUP BY StoreState, TimeYear;-- Query Rewrite: replace Sales and Time tables with MV1SELECT DISTINCT MV1.StoreState, TimeYear, SumDollar1FROM MV1, StoreWHERE MV1.StoreState = Store.StoreState AND TimeYear = 2005 AND StoreNation IN ('USA','Canada');
16-46
Storage and Optimization Technologies MOLAP: direct storage and manipulation
of data cubes ROLAP: relational extensions to support
multidimensional data HOLAP: combine MOLAP and ROLAP
storage engines
16-47
ROLAP Techniques Bitmap join indexes Star join optimization Query rewriting Summary storage advisors Parallel query execution
16-48
Maintaining a Data Warehouse Data sources Workflow representation Optimizing the refresh process
16-49
Data Sources Cooperative:
Notification using triggers Requires source system changes
Logged Readily available Extraneous data in logs
Queryable Queries using timestamps Requires timestamps in source data
Snapshot Periodic dumps of source data Significant processing for difference operations
16-50
Maintenance Workflow
PreparationPhase
IntegrationPhase
UpdatePhase
Propagation
Notification
Extraction
Cleaning
Auditing
Transportation
Merging
Auditing
16-51
Data Quality Problems Multiple identifiers Multiple field names Different units Missing values Orphaned values Multipurpose fields Conflicting data Different update times
16-52
ETL Tools Extraction, Transformation, and Loading Specification based Eliminate custom coding Third party and DBMS based tools
16-53
Refresh Processing
Accounting
UnknownProcesses
ExternalData Sources
InternalData Sources
DataWarehouse
ETLTools
Valid Time Lag
Load Time Lag
Fact andDimensionChanges
PrimarilyDimensionChanges
16-54
Determining the Refresh Frequency Maximize net refresh benefit Value of data timeliness Cost of refresh Satisfy data warehouse and source
system constraints
16-55
Refresh Constraints
Source access: restrictions on time and frequency
Integration: restrictions that require concurrent reconciliation
Completeness/consistency: loading in the same refresh period
Availability: load scheduling restrictions due to storage capacity, online availability, and server usage
16-56
Summary Data warehouse requirements differ from
transaction processing. Architecture choice is important. Multidimensional data model is intuitive Relational representation and storage
techniques are significant. Maintaining a data warehouse is an
important, operational problem.