Upload
stella-sims
View
237
Download
0
Tags:
Embed Size (px)
Citation preview
Chapter 16Chapter 16Data Warehouse Technology and Management
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Outline Outline
Basic concepts and characteristicsBusiness architectures and applicationsData cube concepts and operatorsRelational DBMS featuresPopulating a data warehouse
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Comparison of Processing Comparison of Processing EnvironmentsEnvironmentsTransaction processing
– Uses operational databases– Short-term decisions: fulfill orders, resolve
complaints, provide staffingDecision support processing
– Uses integrated and summarized data– Medium and long-term decisions: capacity
planning, store locations, new lines of business
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Data Warehouse Definition and Data Warehouse Definition and CharacteristicsCharacteristics
A central repository for summarized and integrated data from operational databases and external data sources
Key Characteristics– Subject-oriented– Integrated– Time-variant– Nonvolatile
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Data ComparisonData ComparisonCharacteristic Operational
Database Data Warehouse
Currency Current Historical
Detail level Individual Individual and Summary
Orientation Process orientation
Subject orientation
Number of records processed
Few Thousands
Normalization level Mostly normalized
Mostly denormalized
Update level Volatile Nonvolatile (refreshed)
Data model Relational Multidimensional
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Business Architectures and Business Architectures and ApplicationsApplicationsData warehouse projectsTop-down architecturesBottom-up architectureApplications and data mining
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Data Warehouse ProjectsData Warehouse Projects
Large efforts with much coordination across departments
Enterprise data model– Important artifact of data warehouse project– Structure of data model– Meta data for data transformation
Top-down vs. bottom-up business architectures
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Two Tier ArchitectureTwo Tier Architecture
Data warehouse
Operationaldatabase
Operationaldatabase
Externaldata source
EDM
Summarizeddata
Transformationprocess
Data warehouseserver
User departments
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Three Tier ArchitectureThree Tier Architecture
Data warehouse
Operationaldatabase
Operationaldatabase
Externaldata source
EDM
Summarizeddata
Transformationprocess
Data warehouseserver
Userdepartments
Data mart
Data mart
Data mart tier
Extractionprocess
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Bottom-up ArchitectureBottom-up Architecture
Operationaldatabase
Operationaldatabase
Externaldata source
Transformationprocess
Userdepartments
Data mart
Data mart
Data mart tier
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
ApplicationsApplications
Industry Key Applications Airline Yield management,
route assessment Telecommunications Customer retention,
network design Insurance Risk assessment, product
design, fraud detection Retail Target marketing,
supply-chain management
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Data MiningData Mining
Discover significant, implicit patterns– Target promotions– Change mix and collocation of items
Requires large volumes of transaction dataImportant application for data warehouses
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Data Cube Concepts and Data Cube Concepts and OperatorsOperatorsBasicsDimension and measure detailsOperators
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Data Cube BasicsData Cube Basics
Multidimensional arrangement of dataUsers think about decision support data as
data cubesTerminology
– Dimension: subject label for a row or column– Member: value of dimension– Measure: quantitative data stored in cells
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Data Cube ExampleData Cube Example
Product
Soda Diet Orange Lime soda soda soda
Lo
cati
on California
Utah
Arizona
Washington
Colorado Time 1/1/2003 1/2/2003
….. 12/31/2003
80 110 60 25
40 90 50 30
70 55 60 35
75 85 45 45
65 45 85 60
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Dimension and Measure Dimension and Measure DetailsDetailsDimensions
– Hierarchies: members can have sub members– Sparsity: many cells do not have data
Measures– Derived measures– Multiple measures in cells
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Time Series DataTime Series Data
Common data type in trend analysisReduce dimensionality using time seriesTime series properties
– Data type– Start date– Calendar– Periodicity– Conversion
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Slice OperatorSlice Operator
Focus on a subset of dimensions Set dimension to specific value: 1/1/2003
Location Product Soda Diet soda Lime soda Orange soda
California 80 110 60 25 Utah 40 90 50 30 Arizona 70 55 60 35 Washington 75 85 45 45 Colorado 65 45 85 60
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Dice OperatorDice OperatorFocus on a subset of member valuesReplace dimension with a subset of valuesDice operation often follows a slice
operation
PRODUCT
Soda Diet Orange Lime soda soda soda
LO
CA
TIO
N
Utah 40 90 50 30
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Other OperatorsOther Operators
Operators for hierarchical dimensions– Drill-down: add detail to a dimension– Roll-up: remove detail from a dimension– Recalculate measure values
Pivot: rearrange dimensions
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Operator SummaryOperator Summary
Operator Purpose Description
Slice Focus attention on a
subset of dimensions Replace a dimension with a single member value or with a summary of
its measure values Dice Focus attention on a
subset of member values Replace a dimension with a subset of members
Drill-down Obtain more detail about a dimension
Navigate from a more general level to a more specific level
Roll-up Summarize details about
a dimension Navigate from a more specific level to a more general level
Pivot Present data in a different order
Rearrange the dimensions in a data
cube
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Relational DBMS SupportRelational DBMS Support
Data modelingDimension representationGROUP BY extensionsMaterialized views and query rewritingStorage structures and optimization
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Relational Data ModelingRelational Data Modeling
Dimension table: contains member valuesFact table: contains measure values1-M relationships from dimension to fact
tablesGrain: most detailed measure values stored
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Star Schema ExampleStar Schema Example
CustomerCustIdCustNameCustPhoneCustStreetCustCityCustStateCustZipCustNation
StoreStoreIdStoreManagerStoreStreetStoreCityStoreStateStoreZipStoreNationDivIdDivNameDivManager
SalesSalesNoSalesUnitsSalesDollarSalesCost
ItemItemIdItemNameItemUnitPriceItemBrandItemCategory
TimeTimeNoTimeDayTimeMonthTimeQuarterTimeYearTimeDayOfWeekTimeFiscalYear
ItemSales
CustSales
TimeSales
StoreSales
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Constellation Schema ExampleConstellation Schema Example
CustomerCustIdCustNameCustPhoneCustStreetCustCityCustStateCustZipCustNation
StoreStoreIdStoreManagerStoreStreetStoreCityStoreStateStoreZipStoreNationDivIdDivNameDivManager
SalesSalesNoSalesUnitsSalesDollarSalesCost
ItemItemIdItemNameItemUnitPriceItemBrandItemCategory
TimeTimeNoTimeDayTimeMonthTimeQuarterTimeYearTimeDayOfWeekTimeFiscalYear
ItemSales
CustSales
TimeSales
StoreSales
InventoryInvNoInvQOHInvCostInvReturns
SupplierSuppIdSuppNameSuppCitySuppStateSuppZipSuppNation
SuppInv
ItemInv
StoreInv
TimeInv
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Snowflake Schema ExampleSnowflake Schema Example
CustomerCustIdCustNameCustPhoneCustStreetCustCityCustStateCustZipCustNation
StoreStoreIdStoreManagerStoreStreetStoreCityStoreStateStoreZipStoreNation
SalesSalesNoSalesUnitsSalesDollarSalesCost
ItemItemIdItemNameItemUnitPriceItemBrandItemCategory
TimeTimeNoTimeDayTimeMonthTimeQuarterTimeYearTimeDayOfWeekTimeFiscalYear
ItemSales
CustSales
TimeSales
StoreSales
DivisionDivIdDivNameDivManager
DivStore
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Handling M-N RelationshipsHandling M-N Relationships
Source data may have M-N relationships, not 1-M relationships
Adjust fact or dimension tables for a fixed number of exceptions
More complex solutions to support M-N relationships with a variable number of connections
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Dimension RepresentationDimension Representation
Star schema and variations lack dimension representation
Explicit dimension representation important to data cube operations and optimization
Proprietary extensions for dimension representation
Represent levels, hierarchies, and constraints
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Oracle Dimension RepresentationOracle Dimension Representation
Levels: dimension componentsHierarchies: may have multiple hierarchiesConstraints: functional dependency
relationships
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
CREATE DIMENSION ExampleCREATE DIMENSION ExampleCREATE DIMENSION StoreDim LEVEL StoreId IS Store.StoreId LEVEL City IS Store.StoreCity LEVEL State IS Store.StoreState LEVEL Zip IS Store.StoreZip LEVEL Nation IS Store.StoreNation LEVEL DivId IS Division.DivId HIERARCHY CityRollup ( StoreId CHILD OF City CHILD OF State CHILD OF Nation )HIERARCHY ZipRollup ( StoreId CHILD OF Zip CHILD OF State CHILD OF Nation )HIERARCHY DivisionRollup ( StoreId CHILD OF DivId JOIN KEY Store.DivId REFERENCES DivId )ATTRIBUTE DivId DETERMINES Division.DivNameATTRIBUTE DivId DETERMINES Division.DivManager ;
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
GROUP BY ExtensionsGROUP BY Extensions
ROLLUP operatorCUBE operatorGROUPING SETS operatorOther extensions
– Ranking– Ratios– Moving summary values
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
CUBE ExampleCUBE Example
SELECT StoreZip, TimeMonth, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear = 2002 GROUP BY CUBE (StoreZip, TimeMonth)
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
ROLLUP ExampleROLLUP Example
SELECT TimeMonth, TimeYear, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear BETWEEN 2002 AND 2003 GROUP BY ROLLUP (TimeMonth,TimeYear);
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
GROUPING SETS ExampleGROUPING SETS Example
SELECT StoreZip, TimeMonth, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear = 2002 GROUP BY GROUPING SETS((StoreZip, TimeMonth), StoreZip, TimeMonth, ());
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Variations of the Grouping Variations of the Grouping OperatorsOperatorsPartial cubePartial rollupComposite columnsCUBE and ROLLUP inside a
GROUPIING SETS operation
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Materialized ViewsMaterialized Views
Stored viewPeriodically refreshed with source dataUsually contain summary dataFast query response for summary dataAppropriate in query dominant
environments
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Materialized View ExampleMaterialized View Example
CREATE MATERIALIZED VIEW MV1BUILD IMMEDIATEREFRESH COMPLETE ON DEMANDENABLE QUERY REWRITE ASSELECT StoreState, TimeYear, SUM(SalesDollar) AS SUMDollar1 FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND TimeYear > 2000 GROUP BY StoreState, TimeYear;
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Query RewritingQuery Rewriting
Substitution processMaterialized view replaces references to
fact and dimension tables in a query Query optimizer must evaluate whether the
substitution will improve performance over the original query
More complex than query modification process for traditional views
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Query Rewriting ExampleQuery Rewriting Example-- Data warehouse querySELECT StoreState, TimeYear, SUM(SalesDollar) FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND StoreNation IN ('USA','Canada') AND TimeYear = 2002 GROUP BY StoreState, TimeYear;-- Query Rewrite: replace Sales and Time tables with MV1SELECT DISTINCT MV1.StoreState, TimeYear, SumDollar1FROM MV1, StoreWHERE MV1.StoreState = Store.StoreState AND TimeYear = 2002 AND StoreNation IN ('USA','Canada');
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Storage and Optimization Storage and Optimization TechnologiesTechnologiesMOLAP: direct storage and manipulation
of data cubesROLAP: relational extensions to support
multidimensional dataHOLAP: combine MOLAP and ROLAP
storage engines
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
ROLAP TechniquesROLAP Techniques
Bitmap join indexesStar join optimizationQuery rewritingSummary storage advisorsParallel query execution
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Populating a Data WarehousePopulating a Data Warehouse
Data sourcesWorkflow representationOptimizing the refresh process
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Data SourcesData Sources
CooperativeLoggedQueryableSnapshot
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Maintenance WorkflowMaintenance Workflow
PreparationPhase
IntegrationPhase
UpdatePhase
Propagation
Notification
Extraction
Cleaning
Auditing
Transportation
Merging
Auditing
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Data Quality ProblemsData Quality Problems
Multiple identifiersMultiple field namesDifferent unitsMissing valuesOrphaned valuesMultipurpose fieldsConflicting dataDifferent update times
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
ETL ToolsETL Tools
Extraction, Transformation, and LoadingSpecification basedEliminate custom codingThird party and DBMS based tools
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Refresh OptimizationRefresh Optimization
Accounting
UnknownProcesses
ExternalData Sources
InternalData Sources
DataWarehouse
ETLTools
Valid Time Lag
Load Time Lag
Fact andDimensionChanges
PrimarilyDimensionChanges
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Determining the Refresh Determining the Refresh FrequencyFrequencyMaximize net refresh benefitValue of data timelinessCost of refreshSatisfy data warehouse and source system
constraints
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
Determining the Level of Historical Determining the Level of Historical IntegrityIntegrityPrimarily an issue for dimension updatesType I: overwrite old valuesType II: version numbers for an unlimited
historyType III: new columns for a limited
history
McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.
SummarySummary
Data warehouse requirements differ from transaction processing.
Architecture choice is important.Multidimensional data model is intuitiveRelational representation and storage
techniques are significant.Maintaining a data warehouse is an
important, operational problem.