Upload
mark-fowler
View
220
Download
3
Embed Size (px)
Citation preview
MIS 451
Building Business Intelligence Systems
Logical Design (1)
2
Project Planning
Requirements Analysis
Physical Design
Logical Design
Data Staging
Data Analysis (OLAP)
3
Introduction to Dimensional Modeling
Dimensional Modeling is a DW logical design technique that seeks to present data in a standard framework that is intuitive for data access and allows for high performance data access.
Intuitive: easy to write SQL High performance: high performance SQL
4
Customer
Places
1
Order
M
Contain
1
OrderLine
MOrder
M
Product
1
Belong to
M
ProductCategory
1
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
ER Model
Dimensional Model (Star Schema)
For detailed information, please refer handout 1.
5
Introduction to Dimensional Modeling
Analytical Report: 2-dimension January sales report by customer state and product category
Query: list sales in Jan. by customer state and product category?
6
Introduction to Dimensional Modeling
Query based on ER Model:
Select State, PCName, SUM(Price*Quantity)
From OrderLine OL, Customer C, Product_Category PC, Product P, Order O
Where OL.OID = O.OID and OL.PID = P.PID and O.CID = C.CID and to_char(O.OrderDate,’MON’) = ’JAN’ and P.PCID = PC.PCID
Group by State, PCName
Join: 5 tables
Query based on Dimensional Model:
Select State, PCName, SUM(Sales)
From Sales S, Customer C, Product P, Time T
Where S.Time_ Key = T.Time_Key and S.Product_ Key = P.Product_Key and S.Customer_Key = C.Customer_Key and T.Month= ’JAN’
Group by State, PCName
Join: 4 tables
7
Fact and Dimension
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
Fact table
Dimension table
8
Fact and Dimension
There are two types of tables in dimensional modeling: Fact table: attributes in fact tables are
measurements for analysis or contents in reports.
Dimension table: attributes in dimension tables are constraints for the measurements or headers in reports.
Dimensions Facts
9
Facts and Dimensions
Criteria Fact Attributes Dimension Attributes
Purpose Measurements for analysis Constraints for the measurements
Reporting use Report content Row or column report headers
Data type Most facts are numeric and additive. There are semi-additive or no-additive facts.
Textual, descriptive
Size Larger number of records Smaller number of records
10
Facts and Dimensions
How to identify facts and dimensions? Requirements Analysis:
Analytical requirements: Marketing managers want to know sales performance for different product category in different states?
Information requirements: quantity of product sold, sales amount, product category, and customer states
ER Model
11
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
12
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
F1: Calculation
F: refers to special considerations for fact table or special type of fact table
13
F1: Calculation
Normalization in RDB 1NF 2NF 3NF
Non-volatile property of data warehouse enables DW design to resist normalization and improve query performance.
14
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
D1: Slowly changing dimension
D: refers to special considerations for dimension table or special type of dimension table
15
D1: Slowly changing dimension
Values of attributes in dimension tables may evolve over time. For example, customers moved from one city to another city.
CID CName State City
101 Jon Arizona Tucson
102 Tom Arizona Tucson
103 Mark Arizona Phoenix
Tom moved from Tucson to Phoenix
Phoenix
16
D1: Slowly changing dimension There are three ways to handle slowly changing dimension. Method 1: Overwrite old values with new values
CID CName State City
101 Jon Arizona Tucson
102 Tom Arizona Tucson
103 Mark Arizona Phoenix
CID CName State City
101 Jon Arizona Tucson
102 Tom Arizona Phoenix
103 Mark Arizona Phoenix
17
D1: Slowly changing dimension
Drawbacks of method 1:
Historical information is totally lost.
We will never know that customer 102 lived in Tucson before.
Moreover, when listing sales by city, all the sales of customer 102 will be counted as part of Phoenix sales, although 102 was in Tucson before.
18
D1: Slowly changing dimension Method 2: Add a new attribute to record current value of the changing attribute.
CID CName State City
101 Jon Arizona Tucson
102 Tom Arizona Tucson
103 Mark Arizona Phoenix
CID CName State Original City Current City
101 Jon Arizona Tucson Tucson
102 Tom Arizona Tucson Phoenix
103 Mark Arizona Phoenix Phoenix
19
D1: Slowly changing dimension
Drawbacks of method 2:
Only partial Historical information (original & current) is kept.
Considering that customer 102 moved from Tucson to Flagstaff then to Phoenix, the customer information of customer 102 only includes Tucson and Phoenix.
20
D1: Slowly changing dimension Method 3: Add a record whenever a dimension attribute changes.
CID CName State City
101 Jon Arizona Tucson
102 Tom Arizona Tucson
103 Mark Arizona Phoenix
21
D1: Slowly changing dimension
Method 3 keep all the information. However,
Is there any problem?
22
D1: Slowly changing dimension Method 4: warehouse key + method 3 Warehouse key is a sequence of non-negative integers served as primary keys of tables in data warehouse.
CID CName State City
101 Jon Arizona Tucson
102 Tom Arizona Tucson
103 Mark Arizona PhoenixWarehouse key
23
D1: Slowly changing dimension
Why warehouse key is needed in data warehouse?
Solve slowly changing dimension problem
Compared with natural keys (i.e., primary keys of tables in RDB, such as CID of customer table), warehouse keys have high join performance.
24
D1: Slowly changing dimension
Warehouse key
Primary keys in dimensional tables are warehouse keys.
Primary key in fact table is a collection of warehouse keys of all/part of its associated dimensions.
25
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
D1: Slowly changing dimension
Notation: Primary key
26
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
D2: Time Dimension
D: refers to special considerations for dimension table or special type of dimension table
27
D2: Time Dimension Data warehouse needs an explicit time
dimension table instead of just a time attribute (e.g, ORDERDATE).
Besides the time attribute, time dimension table includes the following additional attributes:
Day_of_week (1-7); Day_number_in_month (1-31); Day_number_in_year (1-365) Week_number (1-52); month (1-12), Quarter (1-4) Holiday_flag (y/n) Fiscal_quarter, Fiscal_year
28
D2: Time Dimension
Time dimension can:
Save computation effort and improve query performance
Complex queries regarding calendar calculation are hidden from end users of data warehouse.
29
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
D3: Snowflake
D: refers to special considerations for dimension table or special type of dimension table
30
D3: Snowflake
PRODUCT_CATEGORY# PRODUCT_CATEGORY_KEY* PCID* PCNAME
CUSTOMERTIME
SALES
PRODUCT# PRODUCT_KEY* PID* PNAME* PRODUCT_CATEGORY_KEY
REFERECEREFERENCED BY
REFERENCE
REFERENCED BY
REFERENCE
REFERENCED BY
REFERENCE
REFERENCED BY
Snowflake structure
31
D3: Snowflake
Snowflake structure should be avoided in data warehouse design
Tradeoff of avoiding snowflake
Advantage: improve query performance
Disadvantage: require more storage space