SQLBits 8, 9 th April 2011, Brighton Vincent Rainardi
[email protected] Blog: dwbi1.wordpress.com
Slide 2
Advanced Dimensional Modelling 1. Dimensions - Structure SCD
Type 6 1 or 2 Dimensions When To Snowflake A Dimension with Only 1
Attribute Transaction Level Dimension 2. Fact Tables Fact Table
Primary Key Snapshotting Transaction Fact Tables Aggregate Fact
Tables Vertical Fact Tables 3. Dimensions - Behavior Rapidly
Changing Dimension Very Large Dimensions Banding Dimension Rows
Stamping Dimension Rows Dimensions with Multi Valued Attributes 4.
Combinations Real Time Fact Table Dealing with Currency Rates
Dealing with Status 4 sections: 2 dims, 1 fact, 1 combi. Lots of
material, may not able to finish. 44 slides, some slides we may
have to touch lightly. Questions between sections, available
after.
Slide 3
SCD Type 6 is a combination of Type 1, 2 & 3 e.g. type 2 +
type 1 : DimAccount (telco example) 6 = 1 + 2 + 3 (Ref: Ross &
Kimball 2005, Wikipedia)
http://www.rkimball.com/html/articles_search/articles%202005/0503IE.html
http://en.wikipedia.org/wiki/Slowly_changing_dimension
Business/Natural Key SCD Type 6 1/2
Slide 4
Used for As Was reporting e.g. balances by tariff (price plan)
at the end of last year, if the customers were on todays tariff.
Fact Dim SCD Type 6 Type 12 Natural Key 2/2
Slide 5
1 or 2 dimensions Simplicity, 1 dim Hierarchy from customer
attribute &account attribute Use when we dont have fact tables
requiring customer grain. We can get the customer attributes
without knowing the account key Disadvantage: cant go from account
to customer without going through the fact table - performance 1/4
customer attributes Fact Table Dim Account a) One Dimensionb) Two
Dimensions Fact Table Dim Account Dim Customer
Slide 6
1 or 2 dimensions Dim customer is needed by another fact table
Modular: 2 separate dim tables but we can combine them easily to
create a bigger dimension To get the breakdown of a measure by a
customer attribute is a bit more complicated than a) select c.
attribute, sum(f.measure1) from fact1 f inner join dim_account a on
f.account_key = a.account_key inner join dim_customer c on
a.customer_key = c.customer_key group by c. attribute 2/4 c)
Snowflake Fact Table Dim Account Dim Customer
Slide 7
1 or 2 dimensions Try to fix weakness on b and c: We can go
direct from account dim to customer dim We can access dim customer
directly from the fact table. Weakness: maintain customer key in 2
places: fact table and dim account. a.k.a. Star with a Back Door
3/4 d) Two Dimensions with inter-dimension link Fact Table Dim
Account Dim Customer
Slide 8
1 or 2 dimensions Try to fix weakness of a: unable to build a
fact table with grain = customer. Add a column in dim account:
customer key Not as popular as c) and d) in solving Dim Customer
issue. It is indecisive : trying to create Dim Customer but doesnt
want to create Dim Customer. 4/4 e) One Dimension with Customer Key
Fact Table Dim Account Fact Table Disadvantage: Dim Customer is
hidden inside Dim Account, making it: a) more difficult to maintain
(especially for a type 2), and b) less modular/flexible
Slide 9
When to Snowflake 1. When the sub dim is used by several dims
City-Country-Region columns exist in DimBroker, DimPolicy,
DimOffice and DimInsured Replaced by Location/GeoKey pointing to
DimLocation / DimGeography Advantage: consistent hierarchy, i.e.
relationship between City, Country & Region. Weakness: we would
lose flexibility. City to Country are more or less fixed, but the
grouping of countries might be different between dimensions.
1/3
Slide 10
When to Snowflake 2. When the sub dim is used by both the main
dim and the fact table(s) DimCustomer is used in DimAccount, and is
also used in the fact table. DimManufacturer is used in DimProduct,
and is also used in the fact table. DimProductGroup is used in
DimProduct, and is also used in some fact table. The alternative is
maintaining two full dimensions (star classic). 2/3
Slide 11
4. To enrich a date attribute When to Snowflake 3. To make base
dim and detail dim Insurance classes, account types (banking),
product lines, diagnosis, treatment (health care) Policies for
marine, aviation & property classes have different attributes.
Pull common attributes into 1 dim: DimBasePolicy Put class-specific
attributes into DimMarine, DimProperty, DimAviation Month, Quarter,
Year, etc. Like #1, a sub dim used by several dims. Ref: Kimball DW
Toolkit 2 nd edition page 213 3/3
Slide 12
A dimension with only 1 attribute Reasons for putting single
attribute in its own dim: Keep fact table slim (4 bytes int not 100
bytes varchar) When the value changes, we dont have to update the
BIG fact table ETL performance Grain is much lower than fact table
small dim Yes its only 1 attribute today, but in the future there
could be another attribute. Could become a junk dim. Should we put
the attribute in the fact table? (like DD = Degenerate Dim)
Probably, if the grain = fact table, and its short or its a number.
1/2
Slide 13
A dimension with only 1 attribute Exception: snapshot month (or
day/week/quarter) Snapshot month is used in periodic snapshot fact
table. Snapshot month is in the form of an integer (201104 for
April 2011). Doesnt violate the 3 points above. It is an integer,
not char(6). The value never changes, April 2011 will be April 2011
forever There will not be other attributes in the dim 2/2
Slide 14
Transaction Level Dimension A dim with grain = the transaction
fact table Most granular event in any business process Examples: IT
Helpdesk DW: Dim Ticket Telco DW: Dim Call Banking/Asset Mgt DW:
Dim Trade Insurance DW: Dim Premium Transaction Level Dim
Transaction, not accumulative or periodic snapshot 1/5
Slide 15
Transaction Level Dimension 1.Query Performance DD columns are
moved to a dim, away from the heavy traffic in fact tables. DW
queries dont touch those DD columns unless they need to
performance. DD attributes totalling 30 bytes, replaced by 4 bytes
int column. Slimmer fact table, better for queries. 2.Periodic
Snapshot Fact Table For periodic snapshot fact table, saving is
even greater. Monthly snapshot fact, 10 years / 120 months. Rather
than specifying the DDs repeatedly 120x, they are specified once in
the transaction dim. All that is left on the fact table is a slim 1
int col: the transaction key. Advantages: 2/5
Slide 16
3.Some fact tables have grains greater than the transaction A
payment from a customer is posted into 4 accounts in the GL fact
table. That single financial transaction becomes 4 fact rows but
only has 1 row in the trans dim. Fact table with 10m rows, trans
dim only 3 million rows. 4.Related Transactions Some transactions
are related, e.g. in retail, a purchase of a kitchen might need to
be created as 2 related orders, because the worktop is made-to-
order. Rather than creating a related order column on the fact
tables, it might be better (depends on how its used) to create it
on the trans dim because: a) an order can consist of many fact rows
(1 row per item) so the related order number will be duplicated
across these fact rows b) slimmer fact table c) the transaction
could be on many fact tables, not only one. Transaction Level
Dimension 3/5
Slide 17
Transaction Level Dimension Transaction fact table and the
grain of the trans dim = grain of the fact table, and only 1 DD
column: perhaps better leave the DD in the fact table. Not a lot of
space/speed gain by putting it on trans dim. Mart/DW only used for
SSAS: there is little point of having trans dim physically. In SSAS
we can create the transaction dimension on the fly from the fact
table (fact dimension). Disadvantages/not suitable: Using trans dim
to put attributes as opposed to put them in the main dimensions,
with the argument of: thats the value of the attribute when the
transaction happened this is not right, use type 2 SCD for this.
Main Trans Acct type Location 4/5
Slide 18
Transaction Level Dimension Any dim with grain = fact table
(like trans dim) is questionable Do we really need this dim at this
grain? Perhaps it should be divided into several dims instead? A
dim with grain = fact table - potential performance issue (unless
the fact table is small). e.g. fact table = 10m rows, trans dim =
10m rows. Joining 10m to 10m potentially slow, especially if the
physical ordering of the trans dim is not the joining column.
Disadvantages/not suitable: 5/5
Slide 19
1. Dimensions - Structure SCD Type 6 1 or 2 Dimensions When To
Snowflake A Dimension with Only 1 Attribute Transaction Level
Dimension 2. Fact Tables Fact Table Primary Key Snapshotting
Transaction Fact Tables Aggregate Fact Tables Vertical Fact Tables
3. Dimensions - Behavior Rapidly Changing Dimension Very Large
Dimensions Banding Dimension Rows Stamping Dimension Rows
Dimensions with Multi Valued Attributes 4. Combinations Real Time
Fact Table Dealing with Currency Rates Dealing with Status 25%
Time, Questions
Slide 20
Fact Table Primary Key Should we have a PK? Yes, if we need to
be able to identify each fact row 1. Need to refer to a fact row
from another fact row e.g. chain of events 2. Many identical fact
rows and we need to update/delete only one 3. To link the fact
table to another fact table Some experts totally disagree PK FK (no
RI) PK FK (not enforced) Related Trans Header - Detail PK
Uniqueness previous/next transaction 1/3
Slide 21
Fact Table Primary Key Single or Multi Column? Single Column:
Generated Identity Multi Column: Dimension Keys Single-column PK is
better than multi-column PK because : 1) A multi-column PK may not
be unique. A single-column PK guarantees that the PK is unique,
because it is an identity column. 2) A single-column PK is slimmer
than a multi-column PK, better query performance. To do a self join
in the fact table (e.g. to link the current fact row to the
previous fact row), we join on a single integer column. 2/3
Slide 22
Fact Table Primary Key Advantage: Prevent duplicate rows, query
performance Disadvantage: loading performance Indexing the PK:
cluster or not? Cluster the PK if: the PK is an identity column
Dont cluster the PK if: the PK is a composite, or when you need the
cluster index for query performance (with partitioning) Example of
not having a PK If duplicate fact rows are allowed. e.g. retail DW:
Store Key, Date Key, Product Key, Customer Key Same customer buying
the same milk in the same shop on the same day twice --- Order Line
ID as DD to make it unique (not all EPOS has it) 3/3
Slide 23
Snapshotting Transaction Fact Tables Potentially huge billions
rows Only take what you need Smart date key/month, e.g. 20110409
Monthly or daily Trunc-reload of current month/day Daily (4 wk),
Weekly (1 yr), Monthly (10 yr) Purging & Archiving Load from
staging (cached) Index/partition on snapshot date Trans Snapshot
Staging 1/1
Slide 24
Aggregate Fact Tables What are they? High level aggregation of
base fact tables A select group by query on a 2 billion rows fact
table can take 30 mins if it joins with two big fact tables, even
with indexes in place So we do this query in advance as part of the
DW load and store it as an Aggregate Fact Table The report only
takes 1 second to run. Aggregate Fact Table Base Fact Tables Report
30 mins 1 sec 1/2
Slide 25
Aggregate Fact Tables What For? For report performance (group
by is costly) BO: aggregate aware Not SSAS: aggregate in cubes, not
tables Loading & indexing: Best to load from staging (at the
same time as loading the main fact table) not from the main fact
table (this would be working 2x) Partition for data distribution or
narrow query Indexing: by the main dim keys 2/2
Slide 26
Vertical Fact Tables Normalised 1 measure column The meaning of
that measure column depends on measure type column Used for
Finance/GL mart Advantage: flexibility: using accounts, balance, Dr
Cr Disadvantage: non additive Normal Fact Table Vertical Fact Table
many measures 1 measure (actual & budget) 1/1 Measure Type Dim
Key
Slide 27
1. Dimensions - Structure SCD Type 6 1 or 2 Dimensions When To
Snowflake A Dimension with Only 1 Attribute Transaction Level
Dimension 2. Fact Tables Fact Table Primary Key Snapshotting
Transaction Fact Tables Aggregate Fact Tables Vertical Fact Tables
3. Dimensions - Behavior Rapidly Changing Dimension Very Large
Dimensions Banding Dimension Rows Stamping Dimension Rows
Dimensions with Multi Valued Attributes 4. Combinations Real Time
Fact Table Dealing with Currency Rates Dealing with Status 50%
Time, Questions
Slide 28
Rapidly Changing Dimension Why is it a problem Large SCD2 dim
Attributes change every day Slow query when join with large fact
tables What to do Put into a separate dim, link direct to fact
table. Just store the latest, type 1 attributes (or dual) Store in
the fact table (for small attribute, e.g. indicator) Type2 Type1
Type2 1/1
Slide 29
Very Large Dimension Why is it a problem SSAS: 4 GB string
store limit for dimension SSAS: dim is select distinct on each
attribute long processing time Valid date join on SCD2 for as was
Usually customer dim where the quality stamp changes daily or
because of high number of distinct values Difficult to browse high
cardinality attribute Join with fact tables performance 1/2
Slide 30
What to do Split into 2 dims, same grain. Always cut
vertically. Remove SCD2, or at least only certain columns. Most
common: separate the attributes with high cardinality/change
frequency Bucketing/banding, group values into ranges Very Large
Dimension VLD 2/2
Slide 31
Banding Dimension Rows It is grouping numerical values
(numerical attributes, not measure) into several bands, e.g. engine
size, distance from station, amount purchased (last complete year).
Benefits: easier for analysis & reporting, comparing between
categories. Issue/problem: limit e.g. bucketing criteria 1 hour to
implement, 3 months to argue 1/1
Slide 32
Stamping Dimension Row Calculate internally or buy data from
outside Customer categories (loyalty programme) e.g. A, B, C of
customer class. To reflect c0nsumer interest on the product
(product categorisation based on customer interest level) Any other
dates or measures summarized as stamped attribute, i.e. new
customer, big spender, or results from recommendation
analysis/algorithm e.g. customer behaviour based on previous
purchases. Used for analysis / reporting Stamped Attributes
1/1
Slide 33
Dimensions with Multi Valued Attributes What is a Multi Valued
Attribute? An attribute which has more than 1 value per dimension
row. MV Attribute or MV Dimension? MV Dim = For each fact row there
could be more than 1 dimension row Why do I need to know this? To
be able to model it If wrong, difficult at BI/report 1/4
Slide 34
Dimensions with Multi Valued Attributes Approaches to deal with
MV Attributes 1. Lower the grain of the dim 2. Put the MV
attributes in a separate dim, link direct to the fact table Before
After Before After Fact table requires that the product dimension
is at Product Code grain, e.g. no sales info per size, but only per
product code. Often we dont have the allocation information e.g.
50-50 or 30-70, we only know that product1 has 2 sizes 2/4
Slide 35
Dimensions with Multi Valued Attributes 3. Use a bridge table
to link the 2 dims Fact Table Dim Product Bridge Table Dim Size 4.
Have several columns in the dim for that attribute If the number of
attributes is small and fixed, this is a popular approach. But if
the number of attributes is large (e.g. >10) or if its variable
(e.g. sometimes 2, sometimes 20), approach 2 and 3 above are more
popular, and more appropriate. 3/4
Slide 36
5. Put the attribute in a snowflake sub dim Dimensions with
Multi Valued Attributes We cant really do this, as it is 1 to many
(1 row in the main dim corresponds to many rows in the sub dim). So
we need a bridge table, which brings us back to approach 3. 6. Keep
in one column using delimiters e.g. Small|Medium . A crazy idea.
More flexible than having several columns (approach 4) and simpler
than approach 3 or 2. If the purpose of the attribute is display
only on a report (rather than analyse or slice & dice), there
is an argument for using this approach, particularly if the number
of attributes is small (e.g. 1 to 4). 4/4
Slide 37
Dimensions - Structure SCD Type 6 1 or 2 Dimensions When To
Snowflake A Dimension with Only 1 Attribute Transaction Level
Dimension Fact Tables Fact Table Primary Key Snapshotting
Transaction Fact Tables Aggregate Fact Tables Vertical Fact Tables
Dimensions - Behavior Rapidly Changing Dimension Very Large
Dimensions Banding Dimension Rows Stamping Dimension Rows
Dimensions with Multi Valued Attributes Combinations Real Time Fact
Table Dealing with Currency Rates Dealing with Status 75% Time,
Questions
Slide 38
Real Time Fact Table Reporting the transaction system in real
time View to union with the normal fact table, or use partitions
Freezing the dims for key lookup, -3 unknown key Key corrections
next day Real time partition (intraday today) Dims as of yesterday
Main partition (up to last night) 1/1 -1 null in source -2 not in
dim table -3 not in dim table as dim was frozen to be resolved next
batch Unknown keys: dim key
Slide 39
Dealing with Currency Rates What for/background/requirements
Report in 3 reporting currencies, using today rates or past Analyse
over time without the impact of currency rates (using fixed
currency rates, e.g. 2010 EOY rates) Had the transactions happened
today Currency rates historical analysis Transaction Currency DW
Currency Reporting Currency Transaction Rates Reporting Rates (many
transaction dates) ( 1 reporting date) 100 countries 40 currencies
1 currency 3-4 currencies GBP, USD, EUR, Original 1/3 e.g. GBP
Slide 40
Dealing with Currency Rates Approaches Store in original
currencies, convert to DW currency at runtime. Or convert at load,
store in DW currency inaccuracy. Or store in both original and DW
currency Currency rate fact table (date, currency, rate) Or store
rates in the fact table On report/cube: date input at run time
(default = today) Fact Tables FX Fact Table 2/3 Rate
Slide 41
Dealing with Currency Rates Concept of FX Rate Type/Profile in
original currency, DW currency or both 3/3
Slide 42
Dealing with Status What/background Workflow (policies,
contracts, documents) Bottleneck analysis (no of days between
stages) How many on each stage Status 1 Status 3 Status 4 Status 5
Status 6Status 2 date1 date4 date3 date2 1/2
Slide 43
Dealing with Status Approaches Accumulative Snapshot Fact, 1
row per application SCD2 on DimApp App Status fact table 2/2
Slide 44
Thanks Email: [email protected] Blog:
http://dwbi1.wordpress.com Covers many of the topics in this
presentation This PowerPoint: in my blog, scroll to bottom, click
on SQLBit8 Special thanks to Guang Ming Xing and Simon Jensen who
helped reviewing this presentation and provided useful comments
(doesnt mean that they agree with the content)