1
Dimensional Data ModelingDimensional Data Modeling
Module 6
2
Course AgendaCourse Agenda
Rationale for dimensional modeling Dimensional modeling basics Dimensional modeling details Fact table details Dimension table details Design process Aggregate schemas Multiple fact tables Architected data marts
3
Rationale for DimensionalRationale for DimensionalModelingModeling
4
OperationsSales and Marketing
Customer Services
Product Development
The Business Value ChainThe Business Value Chain
A series of interrelated business processes which contribute to increased product value for the customer, and to profit for the enterprise Porter 1985
5
Drive to CompeteDrive to Compete
Businesses constantly strive to optimize each process in the value chain
Optimization requires measuring and analyzing the effectiveness of each process as well as the value chain as a whole
OperationsSales and Marketing
Customer Services
Product Development
6
The Role of Information The Role of Information TechnologyTechnology Process optimization
Supported by on-line transaction processing systems
OLTP
Measuring and analyzing processes Supported by 'analytic' systems Data warehouse
OperationsSales and Marketing
Customer Services
Product Development
7
Example OLTP SystemsExample OLTP Systems
Manufacturing and Process
Control
Sales Order Entry and Campaign
Management
Customer Support and Relationship Management
Shipping and Inventory
Management
OperationsSales and Marketing
Customer Services
Product Development
8
OLTP Systems & Business EventsOLTP Systems & Business Events
Events are the heart of every business Book an order Print a pick list Record a cash withdrawal Post a payment
Event detail is collected by OLTP systems Atomic focus Transaction consistency
9
OLTP System ReportingOLTP System Reporting
OLTP systems answer event-oriented questions well Run invoices Print ledger Pull up customer detail
Operational reporting Focused on detail Predictable requirements
and query patterns Does not reveal the
overall performance of a process
10
OLTP Design CharacteristicsOLTP Design Characteristics
Focus of OLTP Design Individual data elements Data relationships
Design goals Accurately model
business Remove redundancy
11
OLTP Design ShortcomingsOLTP Design Shortcomings
Complex Unfamiliar to business
people Incomplete history Slow query performance
12
Emergence of Dimensional Emergence of Dimensional ModelModel Logical modeling technique
For designing relational database structures
Addresses OLTP design shortcomings For use in analytic systems
First developed early 1980's Packaged goods industry
Popularized by Ralph Kimball, PhD. 1996 book: 'The Data Warehouse Toolkit'
14
Dimensional ModelingDimensional ModelingBasicsBasics
15
Sample Value Chain AnalysisSample Value Chain Analysis
"I need to see overall gross margin by category"
"How do inventory levels compare with sales by product and warehouse?"
"What are outstanding receivables by G/L account?"
What is the return rate for each supplier?
Process-oriented business questions
OperationsSales and Marketing
Customer Services
Product Development
16
Measurement FocusMeasurement Focus
Process-oriented business measures
gross margin inventory levels, sales
receivables return rate
OperationsSales and Marketing
Customer Services
Product Development
17
Brand
Captain Coffee
Product
Standard Coffee Maker
Thermal Coffee Maker
Deluxe Coffee Maker
All Products
Units Sold
5,000
2,400
2,073
9,473
Units Shipped
3,800
1,632
1,658
7,090
% Shipped
76%
68%
80%
75%
Coffee Maker Fulfillment Report
FactsFacts
Process MeasurementProcess Measurement
Measures Metrics or indicators by
which people evaluate a business process
Referred to as “Facts”
Examples Margin Inventory Amount Sales Dollars Receivable Dollars Return Rate
18
Perspective FocusPerspective Focus
Process-oriented business perspectives
category Product, warehouse
G/L account supplier
OperationsSales and Marketing
Customer Services
Product Development
19
Brand
Captain Coffee
Product
Standard Coffee Maker
Thermal Coffee Maker
Deluxe Coffee Maker
All Products
Units Sold
5,000
2,400
2,073
9,473
Units Shipped
3,800
1,632
1,658
7,090
% Shipped
76%
68%
80%
75%
Coffee Maker Fulfillment Report
DimensionsDimensions
Process PerspectivesProcess Perspectives
Dimensions The parameters by which
measures are viewed Used to break out, filter or
roll up measures Often found after the word
“by” in a business question Descriptive business terms
Examples Product Warehouse Customer Supplier
20
Dimensional ModelDimensional Model
Definition Logical data model used to represent the
measures and dimensions that pertain to one or more business subject areas
Dimensional Model = Star Schema
Serves as basis for the design of a relational database schema
Can easily translate into multi-dimensional database design if required
Overcomes OLTP design shortcomings
21
Dimensional Model AdvantagesDimensional Model Advantages
Understandable
Systematically represents history
Reliable join paths
High performance query
Enterprise scalability
22
StoreStore
Star SchemaStar Schema
TimeTime
ProductProduct
FactsFacts
Schema SimplicitySchema Simplicity
Fewer tables Denormalized
Consolidated
Dimensional Familiar to users
Facts go in the fact tables
Dimensions in dimension tables
Increases understandability
23
Time Dimension
year
quarter
month
date
day of the week
holiday flag
ord_date
Data FamiliarityData Familiarity
Adding business context Single source field
Expanded into parts
Decoded into business terms
Add special indicators and flags
e.g. time dimension
Increases understandability
24
Store
Product
Facts
Time DimensionTime Dimension
Time Dimension
year
quarter
month
date
day of the week
holiday flag
Representing HistoryRepresenting History
Time dimension Part of every star schema
Marks the date when the facts (process measurements) occurred
Allows the schema to easily add and query data over time
Especially useful for performing comparison queries
25
Fewer Join PathsFewer Join Paths
Star schema joins Defined during schema
design - not runtime
Business people can easily understand these relationships
One-to-many relations between dimensions and facts
Referential integrity always enforced
26
High Performance DesignHigh Performance Design
Fewer joins means less 'expensive' queries
Deterministic query patterns
Star schema query optimization supported by all major RDBMS vendors
27
Subject area dimensional
models
Subject Area ModelsSubject Area Models
Manufacturing and Process Control
Sales Order Entry and Campaign Management
Customer Support and Relationship
Management
Shipping and Inventory
Management
Subject area E/R
models
OperationsSales and Marketing
Customer Services
Product Development
28
Enterprise ModelsEnterprise Models
Enterprise Scope E/R model
Enterprise scope dimensional model
29
Exercise 1Exercise 1
Scenario Industry: Automobile manufacturing Company: Millennium Motors Value chain focus: Sales
Sample business questions: What are the top 10 selling car models this month? How do this months top 10 selling models compare to the
top 10 over the last six months? Show me dealer sales by region by model by day What is the total number of cars sold by month by dealer by
state?
List facts and dimensions
30
Exercise 1Exercise 1 -- worksheetworksheet
31
Exercise 1 SolutionExercise 1 Solution
Facts Sales revenue Quantity sold
Dimensions Model name Month Dealer name Region State Date
33
Dimensional DesignDimensional DesignDetailsDetails
34
Dimension
Dimension
Dimension
Star Schema Dimension TablesStar Schema Dimension Tables
Dimension tables Store dimension
values Textual content Dimension tables
usually referred to simply as 'dimensions'
Spend extra effort to add dimensional attributes
35
key
key
key
Dimension
Dimension
Dimension
Dimension KeysDimension Keys
Synthetic keys Each table assigned a
unique primary key, specifically generated for the data warehouse
Primary keys from source systems may be present in the dimension, but are not used as primary keys in the star schema
36
Key
attribute
attribute
attribute
Key
attribute
attribute
attribute
Key
attribute
attribute
attribute
Dimension
Dimension
Dimension
Dimension ColumnsDimension Columns
Dimension attributes Specify the way in
which measures are viewed: rolled up, broken out or summarized
Often follow the word “by” as in “Show me Sales by Region and Quarter”
Frequently referred to as 'Dimensions'
37
Fact Table
fact1
fact2
fact3
Star Schema Fact TableStar Schema Fact Table
Process measures Start by assigning one
fact table per business subject area
Fact tables store the process measures (akaFacts)
Compared to dimension tables, fact tables usually have a very large number of rows
38
Fact Table
fact1
fact2
fact3
keykeykey
Fact Table Primary KeyFact Table Primary Key
Every fact table Multi-part primary key
added Made up of foreign
keys referencing dimensions
39
Fact Table SparsityFact Table Sparsity
Sparsity Term used to describe the very common situation
where a fact table does not contain a row for every combination of every dimension table row for a given time period
Because fact tables contain a very small percentage of all possible combinations, they are said to be "sparsely populated" or "sparse"
40
Fact Table
Fact Table GrainFact Table Grain
Grain The level of detail
represented by a row in the fact table
Must be identified early Cause of greatest
confusion during design process
Example Each row in the fact table
represents the daily item sales total
41
Sparsity ExampleSparsity Example
Assume 5,000 rows in 'dealer' dimension 50 rows in 'model' dimension
If all dealers sold all models every day: 5,000 * 50 = 250,000 sales every day 91,250,000 sales every year Assuming only one model sold in every dealer!
Sparsity Means that only a small fraction of the total possible
250,00 will be sold on a given day Generally, only record sales - not zeroes in fact table
42
Designing a Star SchemaDesigning a Star Schema
Five initial design steps Based on Kimball's six steps Start designing in order Re-visit and adjust over project life
43
Identify fact table
Start by naming the fact table with the name of the business subject area
Step OneStep One
44
StepStep TwoTwo
Identify fact table grain
Describe what a row in the fact table represents - in business terms
45
StepStep ThreeThree
Identify dimensions
46
StepStep FourFour
Select facts
47
StepStep FiveFive
Identify dimensional attributes
48
Exercise 2Exercise 2
Scenario Industry: Automobile manufacturing Company: Millennium Motors Value chain focus: Sales
Sample business questions: What are the top 10 selling car models this
month? How do this months top 10 selling models
compare to the top 10 over the last six months? Show me dealer sales by region by model by day What is the total number of cars sold by month by
dealer by state?
49
Exercise 2 Exercise 2 -- continuedcontinued
Using these sources data elements, design a star schema that answers the proposed business questions Sales revenue Quantity sold Model name Dealer name Dealer city Product line Region where sold State Vehicle category Month Date of sales
50
Exercise 2Exercise 2 –– sample datasample data
51
Exercise 2Exercise 2 -- worksheetworksheet
52
Exercise 2 Exercise 2 -- solutionsolution
Step 1 - Fact table name: 'Sale facts'
Step 2 - Fact table grain: Every row in the sales facts table is a summary of
car model sales for that day at a single dealer
Step 3 - Dimensions: Time, Model, Dealer
Step 4 - Facts: Total revenue, Quantity sold
Step 5 - Dimensional attributes: See next page
53
Exercise 2Exercise 2 –– Dimensional ModelDimensional Model
Modelmodel_key
category
line
model
Sales Factsmodel_key
dealer_key
time_key
revenue
quantity
Timetime_key
year
quarter
month
date
Dealerdealer_key
region
state
city
dealer
55
Fact Table DetailsFact Table Details
56
Example Fact TableExample Fact Table
Sales Factsmodel_key
dealer_key
time_key
revenue
quantity
57
Example Fact Table RecordsExample Fact Table Records
time_key model_key dealer_key revenue quantity
1 1 1 75840.27 2
1 2 1 152260.37 3
1 3 1 28360.15 1
1 4 1 132675.22 4
1 5 1 43789.45 1
1 1 2 35678.98 1
1 3 2 57864.78 2
1 5 2 92876.67 2
Primary Key Facts
Sales Facts
58
FactsFacts
Fully additive Can be summed across any and all dimensions Stored in fact table Examples: revenue, quantity
59
Example: Additive FactsExample: Additive Facts
Modelmodel_key
brand
category
line
model
Sales Factsmodel_key
dealer_key
time_key
revenue
quantity
Timetime_key
year
quarter
month
date
Dealerdealer_key
region
state
city
dealer
60
FactsFacts
Semi-additive Can be summed across most dimensions but not
all Examples: Inventory quantities, account
balances, or personnel counts Anything that measures a “level” Must be careful with ad-hoc reporting Often aggregated across the “forbidden
dimension” by averaging
61
Example: SemiExample: Semi--additive Factsadditive Facts
Sales Factsmodel_key
dealer_key
time_key
inventory
Modelmodel_key
brand
category
line
model
Timetime_key
year
quarter
month
date
Dealerdealer_key
region
state
city
dealer
62
FactsFacts
Non-Additive
Cannot be summed across any dimension
All ratios are non-additive
Break down to fully additive components, store
them in fact table
63
Example: NonExample: Non--Additive FactsAdditive Facts
Margin_rate is non-additiveMargin_rate = margin_amt/revenue
model_keydealer_keytime_key
revenuemargin_amt
time_key
yearquartermonthdate
model_key
brandcategorylinemodel
Model Sales Facts
dealer_key
regionstatecitydealer
Dealer
Time
64
Unit AmountsUnit Amounts
Unit price, Unit cost, etc.
Are numeric, but not measures
Store the extended amounts which are additive Unit amounts may be useful as dimensions for
“price point analysis”
May store unit values to save space
65
FactlessFactless Fact TableFact Table
A fact table with no measures in it Nothing to measure... …Except the convergence of dimensional
attributes Sometimes store a “1” for convenience Examples: Attendance, Customer
Assignments, Coverage
67
Dimension TableDimension TableDetails
68
Example Dimension TablesExample Dimension Tables
dealer_key
regionstatecitydealer
model_key
brandcategorylinemodel
Model time_key
yearquartermonthdate
Time
Dealer
69
Example Dimension TableExample Dimension TableRecordsRecords
time_key year quarter month date
1 1997 Q1 January 1/15/97
2 1997 Q1 January 1/16/97
3 1997 Q1 January 1/17/97
150 1997 Q2 April 4/1/97
777 1998 Q4 October 10/13/98
Synthetic Key Attributes
Time Dimension
70
Example Dimension TableExample Dimension TableRecordsRecords
dealer_key region state city dealer
1 Northeast Massachusetts Boston Honest Ted's
2 Northeast Massachusetts Boston Stoller Co.
3 Southwest Arizona Tucson Wright Motors
12 Southwest California San Diego American
245 Central Illinois Chicago Lugwig Motors
Synthetic Key Attributes
Dealer Dimension
71
Dimension TablesDimension Tables
Characteristics
Hold the dimensional attributes
Usually have a large number of attributes (“wide”) Add flags and indicators that make it easy to
perform specific types of reports Have small number of rows in comparison to fact
tables (most of the time)
72
DonDon’’t Normalize Dimensionst Normalize Dimensions
Saves very little space
Impacts performance Can confuse matters when multiple
hierarchies exist A star schema with normalized dimensions is
called a "snowflake schema" Usually advocated by software vendors
whose product require snowflake for performance
73
Example Snowflake SchemaExample Snowflake Schema
category_key
categorybrand_key
brand_key
brand
Brand
Category
line_key
linecategory_key
Line
model_key
modelline_key
Model
model_keydealer_keytime_key
revenuequantity
Sales Facts
date_key
datemonth_key
Day
month_key
monthquarter_key
Monthquarter_key
quarteryear_key
Quarteryear_key
year
Year
dealer_key
dealercity_key
Dealercity_key
citystate_key
Citystate_key
stateregion_key
Stateregion_key
region
Region
74
Slowly Changing DimensionsSlowly Changing Dimensions
Dimension source data may change over time Relative to fact tables, dimension records
change slowly Allows dimensions to have multiple 'profiles'
over time to maintain history Each profile is a separate record in a
dimension table
75
Slowly Changing Dimension Slowly Changing Dimension ExampleExample Example: A woman gets married
Possible changes to customer dimension• Last Name• Marriage Status• Address• Household Income
Existing facts need to remain associated with her single profile
New facts need to be associated with her married profile
76
Slowly Changing Dimension Slowly Changing Dimension TypesTypes Three types of slowly changing dimensions
Type 1• Updates existing record with modifications• Does not maintain history
Type 2• Adds new record• Does maintain history• Maintains old record
Type 3: • Keep old and new values in the existing row• Requires a design change
77
Designing Loads to Handle SCDDesigning Loads to Handle SCD
Design and implementation guidelines Gather SCD requirements when designing data
mapping and loading SCD needs to be defined and implemented at the
dimensional attribute level Each column in a dimension table needs to be
identified as a Type 1 or a Type 2 SCD If one Type 1 column changes, then all Type 1
columns will be updated If one Type 2 column changes, then a new record
will be inserted into the dimension table
78
Designing Loads to Handle SCDDesigning Loads to Handle SCD
Design and implementation guidelines For large dimension tables, change data capture
techniques may be used to minimize the data volume
For smaller dimension tables, compare all OLTP records with dimension table records
Balance data volume with change data capture logic complexities
79
Customer Dimension TableColumn Name SCD TypeCustomer Key N/ACustomer ID 1Name 1Marital Status 1Home Income 1
Designing Loads to Handle SCDDesigning Loads to Handle SCD
Type 1 example: a woman gets married
80
Type 1 ExampleType 1 Example
CustID Name
MaritalStatus
123 Sue Jones S $30K
HomeIncome
CustID Name
MaritalStatus
1 123 Sue Jones S $30K 0
HomeIncome
CustKey
CustKey
DayKey Sales
1 1 $40
Day Dim
DayKey
BusinessDate
1 1/31/01
Sales FactsCustomer DimCustomer OLTP
DayKey
BusinessDate
1 1/31/01
2 2/01/01
Day Dim
CustKey
DayKey Sales
1 1 $40
1 2 $50
Sales Facts
CustID Name
MaritalStatus
123 Sue Smith M $60K
HomeIncome
Customer OLTP
Status
Customer Dim
CustID Name
MaritalStatus
1 123 Sue Smith M $60K 0
HomeIncome
CustKey Status
OLTP Star Schema
Sue Gets Married 2/1/01
81
Type 1 ExampleType 1 Example
Observations Customer history is not maintained in the OLTP
system Customer history is not maintained in the star
schema Sue only has one customer 'profile' in customer
dimension table Sue’s sales facts across all history are associated
with her married profile Sales facts that were associated with Sue’s single
profile have been lost
82
Customer Dimension TableColumn Name SCD TypeCustomer Key N/ACustomer ID 2Name 2Marital Status 2Home Income 1
Designing Loads to Handle SCDDesigning Loads to Handle SCD
Type 2 example: a woman gets married
83
Type 2 ExampleType 2 Example
CustID Name
MaritalStatus
123 Sue Jones S 30K
Day Dim
HomeIncome
CustID Name
MaritalStatus
1 123 Sue Jones S $30K 0
HomeIncome
CustKey
CustKey
DayKey Sales
1 1 $40
DayKey
BusinessDate
1 1/31/01
Sales FactsCustomer DimCustomer OLTP
CustKey
DayKey Sales
1 1 $40
2 2 $50
Sales Facts
CustID Name
MaritalStatus
1 123 Sue Jones S $30K 1
HomeIncome
CustKey Status
2 123 Sue Smith M $60K 0
Customer Dim
CustID Name
MaritalStatus
123 Sue Smith M $60K
HomeIncome
Customer OLTP
Status
OLTP Star Schema
Sue Gets Married 2/1/01
Day DimDayKey
BusinessDate
1 1/31/01
2 2/01/01
84
Type 2 ExampleType 2 Example
Type 2 Observations Customer history is not maintained in the OLTP
system
Customer history is maintained in the star schema
Sue has two 'profiles' in the customer dimension
Sue’s sales facts may be analyzed for when she was single, when she was married, and across all history by using the customer id field
Home income was updated in the new profile record
85
Slowly Changing Dimension Slowly Changing Dimension AdviceAdvice 'When in doubt, design type 2'
86
Rapidly Changing Dimension (RCD)Rapidly Changing Dimension (RCD)
Values change rapidly over time . No yardstick for telling when a dimension is
slowly changing or not and this is based on the judgment of the data modeler.
An SCD may become a RCD over time or vice versa.
87
Large DimensionsLarge Dimensions
Dimensions containing several million records!!!
HOW TO SUPPORT? Database to support indexing technology that
support rapid browsing Find and suppress duplicate entries in the
dimension (eg. Name and address matching) Never use Type 2 to solve changing
dimensions (adding records)
88
Rapidly Changing Monster Rapidly Changing Monster DimensionsDimensions
Dimensions containing > 100 million records!!!
HOW TO SUPPORT? Break the Monster dimension into separate
dimension tables Constant information in original table New dimension table can have discrete values
for each attribute Choose pre-defined set of values per attribute
89
IndexingIndexing
Bitmap Indexes on the foreign key columns in the fact tables.
Bitmap Indexes on low cardinality columns in dimensional tables like Month, Product Category, Store category, etc…
B-Tree Indexes on Dimension key columns.
90
Rapidly Changing Monster Rapidly Changing Monster DimensionsDimensions
Build the data in this dimension with all possible combinations of values for each attribute
Identify each combination uniquely Everytime an event occurs and is recorded in
fact table, attach it with the unique combination ID.
91
Example RCDExample RCD
92
Example RCDExample RCD
93
Degenerate DimensionsDegenerate Dimensions
Dimensions with no other place to go Stored in the fact table Are not facts Common examples include invoice numbers
or order numbers
94
Example Degenerate DimensionExample Degenerate Dimension
95
Junk/Dirty DimensionJunk/Dirty Dimension
A convenient grouping of random flags and attributes.
After carving out all the dimensions some flags or text attributes that are left over in the fact table but do not belong to any of the dimension tables.
96
Junk/Dirty DimensionJunk/Dirty Dimension
Alternatives to be avoided: Leaving the flags and attributes unchanged in the fact table
record Making each flag and attribute into its own separate
dimension Stripping out all of these flags and attributes from the
design
Make a convenient grouping of the flags and attributes to get them out of a fact table into a useful dimensional framework.
97
Region
Northeast
Southeast
Units Sold Revenue
Quarterly Auto Sales Summary
State
Maine
New York
Massachusetts
Florida
Georgia
Virginia
Region
Northeast
Southeast
Central
Northwest
Southwest
Units Sold Revenue
Quarterly Auto Sales Summary
DrillingDrilling
Drilling down Adding dimensional detail Further breaks out a
measure in some way
98
Region
Northeast
Southeast
Units Sold Revenue
Quarterly Auto Sales Summary
State
Maine
New York
Massachusetts
Florida
Georgia
Virginia
Region
Northeast
Southeast
Central
Northwest
Southwest
Units Sold Revenue
Quarterly Auto Sales Summary
DrillingDrilling
Rolling up Removing dimensional
detail Rolls up a measure
99
DrillingDrilling
Drilling across A query that involves more than one fact table Not necessarily an action that changes how a user
is looking at the data Best resolved by multiple SQL passes
101
Dimensional DesignDimensional DesignProcessProcess
Project Context
102
Development Phase
Deployment PhaseDesign Phase
Data Mart DevelopmentData Mart Development
Dimensional modeling is a critical part of the data mart development effort
103
Data Mart DevelopmentData Mart Development
Design phase Determine requirements and design schema
Development phase Iterative build and feedback
Deployment phase Automate load, document, train users
104
Project DeliverablesProject Deliverables
Design Project definition
document Project plan Schema design Mapping document Report design
Development Populated data mart Load routines
(Sagent “Plans”) Query and reporting
environment
Deployment Automation Documentation Training materials
105
Development Phase
Deployment PhaseDesign Phase
Project ApproachProject Approach
The dimensional model is developed during the design stage
Scope of the project has already been determined
106
Development Phase
Deployment PhaseDesign Phase
Design Stage ActivitiesDesign Stage Activities
Gather requirements through requirements workshops
Develop star schema Conduct design review
107
Gather RequirementsGather Requirements
Requirements definition User workshops Spreadsheets Sample reports
Source systems analysis DBA interviews Copybooks E/R diagrams
108
Design DeliverablesDesign Deliverables
Deliverables The star schema itself Load mapping document
How these primary components are delivered will depend on needs and format chosen Modeling tools Spreadsheets Text documents
109
Sales Factstime_keymodel_keydealer_key
time_key
Time
model_key
Model
dealer_key
Dealer
Notation ExampleNotation Example
IDEF1X Dependent entities - fact tables Independent entities - dimension tables
110
Sales Facts
Time
Dealer
Model
Notation ExampleNotation Example
Martin IE Entities - fact or dimension tables Attributes not shown
111
time_key
Time
model_key
Model
dealer_key
Dealer
time_keymodel_keydealer_key
Sales Facts
Notation ExampleNotation Example
Kimball Simple structure Cardinality implied
112
Design Naming StandardsDesign Naming Standards
Responsibility of data administration Extended to the data warehouse Important to start early in the project
Suggested conventions Fact tables Dimension tables Aggregate tables Keys
113
Data Element DefinitionsData Element Definitions
Clear descriptions Facts
Calculated formulae
Dimensional attributes
Multiple meanings/synonymous terms
Aliases
114
Data Element InstancesData Element Instances
Example of Data
As it will exist in the warehouse
After decoding
Adds to model understanding
Removes ambiguity/uncertainty
115
Data Element MappingData Element Mapping
Where is the data coming from
Source system
Table
Column
Record
Field
116
Data TransformationData Transformation
Changing the data
Serves as spec for ETL process
Decodes
Type conversion
Conditional logic
Handling of NULL’s
118
Aggregates SchemasAggregates Schemas
119
Aggregate DesignsAggregate Designs
Aggregates Pre-stored fact summaries Along one or more dimensions The most effective tool for improving performance
Examples Summary of sales by region, by product, by
category Monthly sales
120
Aggregate BackgroundAggregate Background
Aggregate rationale Improve end user query performance Reduce required CPU cycles Powerful cost saving tool
Restrictions Additive facts only Must use dimensional design
121
Aggregate GuidelinesAggregate Guidelines
Don’t start with aggregates
Design and build based on usage
Sooner or later you'll need to build aggregates
122
Aggregate TypesAggregate Types
Separate Tables Separate fact table for every aggregate Separate dimension table for every aggregate
dimension Same number of fact records as level field tables
Advantage Removes possibility of double counting Schema clarity
Caveat Requires software with aggregate navigation
capability
123
One WayAggregate
Separate TablesSeparate Tables
month_keyproduct_keymarket_keyQuantityAmount
Mthly Sales Facts Agg
time_keyproduct_keymarket_keyQuantityAmount
Sales Factsproduct_keyCategory BrandProductDiet Indicator
Product
month_keyYearFiscal PeriodMonth
Month
market_keyRegion DistrictStateCity
Market
time_keyYearFiscal PeriodMonthDayDay of Week
Time
124
Two WayAggregate
Separate TablesSeparate Tables
product_keyCategory BrandProductDiet Indicator
Product
category_keyCategory
Category
month_keycategory_keymarket_keyQuantityAmount
Mnthly Cat Sales Facts Agg
month_keyYearFiscal PeriodMonth
Month
market_keyRegion DistrictStateCity
Market
time_keyYearFiscal PeriodMonthDayDay of Week
Time
time_keyproduct_keymarket_keyQuantityAmount
Sales Facts
125
Aggregate PitfallsAggregate Pitfalls
Sparsity failure Term used to describe the result of building too
many aggregate fact that do not summarize enough rows.
When Sparsity failure occurs, a relatively small star schema can grow (in terms of disk size) thousands of times.
Sparsity failure = aggregate explosion
126
Aggregate Design GuidelinesAggregate Design Guidelines
Rule of twenty
To avoid aggregate explosion Make sure each aggregate record summarizes 20
or more lower-level records
Remember Total number of possible fact tables in any given
dimensional model = cartesian product of all levels in all the dimensions
127
Year (1)
Quarter (4)
Month (12)
Date (365)
Time
5 years
20 quarters
60 months
1825 days
Hierarchies & Aggregate DesignHierarchies & Aggregate Design
Hierarchy diagram Helps visualize
options for building aggregates
Adding cardinalities insures following the rule of 20
Not required to build initial star schema
128
Aggregate NavigationAggregate Navigation
Description Function provided by software layer: Aggregate
Navigator Directs user queries to the most favorable
available aggregate
Transparent to the end user
129
Business View
Designer View
Aggregate FrameworkAggregate Framework
130
Aggregate Aware SQL
Client PCSQL
RDBMS
Client PC
Application Server
SQLAggregate Aware SQL
RDBMS
Client PCAggregate Aware SQL
RDBMS
Aggregate ArchitectureAggregate Architecture
131
Aggregate DeploymentAggregate Deployment
Incremental
Based on usage
Transparent to users
Typically warehouse DBA responsibility
132
Build SubjectArea 1No aggregates
Build SubjectArea 2No aggregates
BuildBuildaggregatesaggregatesforforSubject area 1Subject area 1
Build SubjectArea 3No aggregates
BuildBuildaggregatesaggregatesforforSubject area 2Subject area 2
Build SubjectArea 4No aggregates
BuildBuildaggregatesaggregatesforforSubject area 3Subject area 3
Some reSome re--work requiredwork required
Aggregate DeploymentAggregate Deployment
133
Exercise 3Exercise 3
Scenario Given the original star schema and the following
hierarchy, design a two-way aggregate table structure that will drastically increase performance
Make your own assumptions about summary levels
134
Exercise 3Exercise 3 –– Dimensional ModelDimensional Model
Modelmodel_key
category
line
model
Sales Factsmodel_key
dealer_key
time_key
revenue
quantity
Timetime_key
year
quarter
month
date
Dealerdealer_key
region
state
city
dealer
135
Exercise 3Exercise 3 Scenario
Industry: Automobile manufacturing Company: Millennium Motors Value chain focus: Sales
Sample business questions: What are the top 10 selling car models this month? How do this months top 10 selling models compare to the
top 10 over the last six months? Show me dealer sales by region by model by day What is the total number of cars sold by month by dealer by
state?
136
Exercise 3Exercise 3
All
Category
Line
Model name
All
Year
Quarter
Month
Date
TimeModel
All
Region
State
City
Dealer name
Dealer
Millennium Motors' dimensions
5
50
1000
1000 40
10
20
20
60
1825
5
137
Exercise 3 WorksheetExercise 3 Worksheet
138
Exercise 3 SolutionExercise 3 Solution
model_key
categorylinemodel
model_keydealer_keytime_key
revenuequantity
time_key
yearquartermonthdate
dealer_key
regionstatecitydealer
month_key
yearquartermonth
state_key
regionstate
state_keymonth_keymodel_key
revenuequantity
Dealer
Time
MonthAgg Sales Facts
State
Model Sales Facts
140
Multiple Fact TablesMultiple Fact Tables
141
Multiple Fact TablesMultiple Fact Tables
Different business processes usually require different fact tables
There are also several cases where a single business process will require multiple fact tables Core and custom Snapshot and transaction Coverage Aggregates
142
Different Business ProcessesDifferent Business Processes
Different business processes usually require different fact tables
In practice, it may be hard to identify what a “process” is
Sometimes you can spot different processes because measures are recorded With different dimensions At differing grains
143
Different Dimensions or GrainDifferent Dimensions or Grain
product_keyCategory BrandProductDiet Indicator
Product
time_keyproduct_keyshipper_keymarket_keyQuantityWeight
Shipment Facts
shipper_keynametypemodeaddress
Shipper
time_keyYearFiscal PeriodMonthDayDay of Week
Time
market_keyRegion DistrictStateCity
Markettime_keyproduct_keymarket_keyQuantityAmount
Sales Facts
144
Different Dimensions or GrainDifferent Dimensions or Grain
Don’t take shortcuts with grain
The 'not applicable' dimension value
Using a 'not applicable' row in a dimension confuses the grain and can introduce reporting difficulty
145
Different Points in TimeDifferent Points in Time
Sometimes, it is not easy to identify the discrete business processes
All measures may have the same dimensionality or grain
Different measures are recorded at different times Quantity sold is not recorded at the same time as
quantity shipped
146
Different TimingDifferent Timing
Building a single fact table would require recording zero or null for measures that are not applicable at a point in time
Reports would contain a confusing combination of zeros, nulls, and absence of data
147
market_keyRegion DistrictStateCity
Different TimingDifferent Timing -- One FactOne FactTableTable
Initially will be null
time_keyproduct_keymarket_keyQuantity_soldAmount_soldQuantity_shippedAmount_shipped
Sales and Shipment Facts
time_keyYearFiscal PeriodMonthDayDay of Week
Time
Market
product_keyCategory BrandProductDiet Indicator
Product
148
time_keyproduct_keymarket_keyQuantityAmount
Different TimingDifferent Timing -- Two FactTwo FactTablesTables
product_keyCategory BrandProductDiet Indicator
Product
Shipment Facts
time_keyproduct_keymarket_keyQuantityAmount
Sales Facts market_keyRegion DistrictStateCity
Market
time_keyYearFiscal PeriodMonthDayDay of Week
Time
149
Identifying Different ProcessesIdentifying Different Processes
Look at the measures in question
Sort them into fact tables based on
Dimensions
Grain
Differing timings of events measured
150
One Process, Multiple Fact One Process, Multiple Fact TablesTables
Core and custom
Coverage
Snapshot and transaction
Aggregates
151
Core and Custom SchemasCore and Custom Schemas
There is a set of dimension attributes and measures shared in all cases
Depending on the value in a dimension, certain extra dimension attributes or measures are recorded
Heterogeneous products
Types of customers
152
Core and CustomCore and Custom
product_key...
Product
customer_key ...
Customer
checking_key...custom checking attributes
Checking Account time_keychecking_keybranch_keycustomer_keyBalanceTransaction_count...custom checking facts
Checking Account Facts
time_keyproduct_keybranch_keycustomer_keyBalanceTransaction_count
Account Facts
time_key...
Time
branch_key...
Branch
153
Core and CustomCore and Custom
Core fact table and dimensions All attributes shared no matter what Appropriate for analysis across entire subject area
Custom fact table and/or dimensions Contain attributes specific to a particular
dimension value (e.g. “Checking”) Only appropriate when the business question is
limited to that particular dimension value Should repeat shared facts to minimize need to
access two fact tables
154
Coverage SchemaCoverage Schema
A star schema usually measure events that happen
Relationships between the dimensions involved are not captured if events do not happen
A coverage table fills the gap What did not sell that was on promotion? Who was assigned to that customer?
Usually “factless”
155
product_keyCategory BrandProductSKU
Product
customer_keyNameCompanyAccountPhone_num
Customer
time_keyproduct_keycustomer_keyrep_keyquantitysales_dollars
Sales Facts
time_keyYearFiscal PeriodMonthDayDay of Week
Time
rep_keyrep_namerep_phoneRegion DistrictStateCity
Sales_rep
Measuring What HappenedMeasuring What Happened
Sales facts does not reveal who is assigned to a customer if they do not sell
156
Coverage TableCoverage Table
Customer_coverage_facts shows who is assigned to a customer at a point in time
customer_keyNameCompanyAccountPhone_num
Customer
time_keycustomer_keyrep_key
Customer Coverage Facts
time_keyYearFiscal PeriodMonthDayDay of Week
Time
rep_keyrep_namerep_phoneRegion DistrictStateCity
Sales_rep
157
Snapshot and TransactionSnapshot and Transaction
Viewing a single process multiple ways Transactions
The changes to what is being measured
Snapshot The status at a point in time
Example Changes to inventory Current status of inventory
158
time_keyYearFiscal PeriodMonthDayDay of Week
SnapshotSnapshot
How much is on hand today? How much was on hand yesterday?
product_keyCategory BrandProductSKU
Product
location_keyWarehouseWH_codeCityState
Location
time_keyproduct_keylocation_keyquantity_on_hand
InventorySnapshot Time
159
TransactionTransaction
How did inventory change today? How much product was returned due to failed
inspection?
product_keyCategory BrandProductSKU
Product
location_keyWarehouseWH_codeCityState
Location
time_keyproduct_keylocation_keytransaction_type_keytransaction_amount
InventoryTransactions
time_keyYearFiscal PeriodMonthDayDay of Week
Time
transaction_type_keytransaction_type_codetransaction_typetransaction_category
Transaction_type
160
Aggregate TablesAggregate Tables
Aggregate table
A fact table that summarizes another fact table
Created for performance reasons
Covered in previous section
161
Design Tools for Multiple TablesDesign Tools for Multiple Tables
Create a set of matrices Facts vs dimension Facts vs dimensional attributes
Mark where facts apply to dimensions Mark where facts apply to dimensional
attributes When facts don't apply, assume separate fact
table
162
Bus MatrixBus Matrix
A Planning Methodology for Large Data Warehouses with multiple data marts or dimensional models.
Enables technical planning as well as executive communication.
Exceptionally effective for distributed data warehouses without a center.
Is simply a vertical list of data marts and a horizontal list of dimensions.
163
Attr
ibut
e 1
Attr
ibut
e 2
Attr
ibut
e 3
Attr
ibut
e 4
Attr
ibut
e 5
Attr
ibut
e 6
Attr
ibut
e 7
Attr
ibut
e 8
Fact 1 X X X X
Fact 2 X X X X
Fact 3 X X X X X
Fact 4 X X X X X
Fact Table 1
Fact Table 2
Example MatrixExample Matrix
Fact vs dimensional attribute matrix
164
Exercise 4Exercise 4
Scenario Industry: Automobile manufacturing Company: Millennium Motors Value chain focus: Sales
Sample business questions: What are the top 10 selling car models this month? How do this months top 10 selling models compare to the
top 10 over the last six months? Show me dealer sales by region by model by day. How many cars have been purchased over the last six
months by customers with yearly household incomes greater than $200,000?
165
Exercise 4 Exercise 4 -- continuedcontinued
Using these sources data elements, design a star schema that answers the proposed business questions
Daily sales revenue Daily quantity sold Model Dealer Dealer city Product line Region where sold State Vehicle category Date of sales
Customer name Customer zip code Customer yearly income P.O. Number Purchase price Discount amount Brand of car
166
Exercise 4Exercise 4 -- worksheetworksheet
167
Exercise 4 SolutionExercise 4 Solution -- MatrixMatrix
facts
daily_sales X X X X X X X X X
daily_quantity X X X X X X X X X
purchase_price X X X X X X X X X X X X X
discount_amount X X X X X X X X X X X X X
Cust
omer
nam
e
Cust
omer
zip
cod
e
Mod
el
Cust
omer
inco
me
Dea
ler
P.O
. N
umbe
r
Dea
ler
city
Prod
uct
line
Bran
d of
car
Reg
ion
whe
re s
old
Stat
e
Vehi
cle
cate
gory
Dat
e of
sal
es
168
Exercise 4Exercise 4 -- Star schemaStar schema
customer_key
customer_namecustomer_zipyearly_income
Customer
model_key
brandcategorylinemodel
Model
model_keydealer_keytime_key
revenuequantity
Daily Sales Facts
model_keydealer_keytime_keycustomer_key
po_numberpurchase_pricediscount_amt
Customer Sales Facts
time_key
yearquartermonthdate
Time
dealer_key
regionstatecitydealer
Dealer
170
Architected Data MartsArchitected Data Marts
171
Data MartData Mart
Meaning of the term 'data mart' has shifted over the last several years...
172
Operational Systems
E.T.L.E.T.L.
SoftwareSoftware
Data Warehouse Analysis Users
Query & Query & Reporting Reporting SoftwareSoftware
E.T.L.E.T.L.
SoftwareSoftware
Data Marts
Data Mart Architecture 1993Data Mart Architecture 1993
173
Operational Systems
E.T.L.
SoftwareData Marts
Analysis Users
Query & Reporting Software
Data Mart Architecture 1997Data Mart Architecture 1997
174
Operational Systems Analysis Users
Data Mart
Data Warehouse
Architected Data MartsArchitected Data Marts
E.T.LSoftware
Query & Reporting Software
175
Data MartData Mart
Warehouse Subject Area
Incremental warehouse development
Centralized architecture
Not new
Well - suited to star schemas
176
Store Sales Facts
Product
Time (Day)
Product
Time (Day)
Shipments Facts
Warehouse
Warehouse Inventory Facts
Product
Month
““StovepipeStovepipe”” Data MartsData Marts
“Stovepipe” data marts Inconsistent and
overlapping data Difficult and costly to
maintain Redundant data load Can’t drill across Integration requires
starting over
Dimensions not conformed
177
Conformed DimensionsConformed Dimensions
Definition Dimensions are conformed when they are the
same -or-
When one dimension is a strict rollup of another
178
Conformed DimensionsConformed Dimensions
Same dimensions must:
1. ... have exactly the same set of primary keysand
2. ... have the same number of records
179
Conformed DimensionsConformed Dimensions
Rolled up dimension When one dimension is a strict rollup of another
Which means Two conformed dimensions can be combined into
a single logical dimension by creating a union of the attributes
180
Conformed DimensionsConformed Dimensions
Description
Shared common dimensions
Integrates logical design
Ensures consistency between data marts
Allows incremental development
Independent of physical location
Some re-work may be required
181
Conformed DimensionsConformed Dimensions
Advantages
Enables an incremental development approach
Easier and cheaper to maintain
Drastically reduces extraction and loading complexity
Answers business questions that cross data marts
Supports both centralized and distributed architectures
182
Store Dimension Sales
Facts
Product Dimension
Time Dimension
Shipment Facts
Warehouse Dimension
Inventory Facts
Month Dimension
Conformed DimensionsConformed Dimensions
Interlocking Star SchemasInterlocking Star Schemas
183 Store Product Day Warehouse Month
Sales Facts
Shipment Facts
Inventory Facts
KimballKimball’’s Data Warehouse Buss Data Warehouse Bus
184
When to ConformWhen to Conform
Two approaches Up-front As-you-go Both approaches work
Choose the approach that works for you
185
CrossEnterpriseAnalysis
CreateFirst-CutStars
All SubjectAreas
Conform all Dimensions
FinalizeDesign &BuildSubjectArea 1
FinalizeDesign &BuildSubjectArea 2
FinalizeDesign &BuildSubjectArea 3
Conform Up FrontConform Up Front
186
Design &BuildSubjectArea 1
Design &BuildSubjectArea 2
ConformDimensions
Design &BuildSubjectArea 3
ConformDimensions
Design &BuildSubjectArea 4
ConformDimensions
Some re-work required
Conform AsConform As--YouYou--GoGo
188
Course ReviewCourse Review
Rationale for dimensional modeling Dimensional modeling basics Dimensional modeling details Fact table details Dimension table details Design process Aggregate schemas Multiple fact tables Architected data marts