34
DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th March 2009 Vincent Rainardi

DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th March 2009 Vincent Rainardi

  • Upload
    bly

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th March 2009 Vincent Rainardi. 2. Vincent Rainardi Data warehousing & BI Data warehousing book on SQL Server Data warehousing articles in SQLServerCentral.com [email protected] About you Data warehousing Data modelling - PowerPoint PPT Presentation

Citation preview

Page 1: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

DATA WAREHOUSE DATA MODELLING

SQLbits IVManchester

28th March 2009

Vincent Rainardi

Page 2: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

Vincent Rainardi•Data warehousing & BI•Data warehousing book on SQL Server•Data warehousing articles in SQLServerCentral.com•[email protected]

About you•Data warehousing•Data modelling•Dimensional modelling

2

Page 3: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

3Data Warehouse Data Modelling

•What is it•Why is it important•How to do it (case study)•Miscellaneous topics (time permitting)•Questions

Page 4: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

4Data Warehouse

A data warehouse is a system that retrieves and consolidates data periodically from source systems into a dimensional or normalized data store. It usually keeps years of history and is queried for business intelligence or other analytical activities. It is typically updated in batch not every time a transaction happens in the source system.

Page 5: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

5Data Store

•Flat files•Cubes•Database•Relational•Normalised•Denormalised•Dimensional•Flat

• Stage• Operational Data Store (ODS)• Normalized Data Store (NDS)• Dimensional Data Store (DDS)• Multi-dimensional Database (MDB)• Metadata• Data Quality• Standing Data

Page 6: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

6

Stage

Defines how the data is arranged within the data storeDefines relationship between entities (elements)

The data model most appropriate for a data store depends on the function of the data store.

Data Model

Dimensional? Normalised?ODS Dimensional? Flat?

Dimensional•Particular business events•Query oriented•Large data packets•Multiple versions•Analytics

Normalised•All business events•Efficient to update•Small data packets•Single version•Operational

Page 7: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

7

• Functionality: it defines the data warehouse what’s available and what’s not

• Foundation on which ETL, DQ, reports, cubes are built costly to rectify

• Performance loading and query

Why is it important

ETL report

Data Model

cubeDQ

Page 8: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

8Case Study: Valerie Media Group

• Daily, weekly, monthly• IT, travel, health care, consumer retail (Business Unit)• Email, RSS, text, web site

Publications are managed by business units.Customers subscribe via agencies.

The business needs to analyze subscription by:customer demographic, publication type, media and cost

Publish and send newsletters, articles, white papers, news alerts

Page 9: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

9Business Events• Event 1: A customer subscribes via an agent to a publication issued by a business unit to be delivered via a certain media

• Event 2: A business unit sends a certain edition of a publication to 2M subscribers via certain network, on a certain media

• Other events: customer payment/refund, renewal, publish a new pub, deactivate/reactivate a pub, change email address, agency payment, cancel subscription, ...

Page 10: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

10Source System

Page 11: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

11Star Schema

fact

dimension

dimension

dimension

dimension

dimensiondimension

Dimensional Model aka Kimball methodQuery performance (OLAP) and flexibility

Page 12: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

12Steps

1. Identify event, dimensions, measures2. Define grain3. Add attributes and measures4. Add natural keys5. Add surrogate keys6. Add role-playing dimensions7. Add degenerate dimensions8. Add junk dimensions9. Add fact key

Page 13: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

13

Measure: the amount in the event unit, fee, discount, paid

Event: a point in the business process A customer subscribes via an agent to a publication issued by a business unit to be delivered via a certain media

Dimension: party/object involved in the event The who, what, whom customer, publication, BU, media, agent

Event, Dimension, Measure

(+ when, where)

Subscription Event

Page 14: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

14Dimensions

Subscription

Date

Media

Customer

Agent

PublicationBusiness Unit

Grain: a row in this fact table correspond to ... A customer subscribes to a publication

Page 15: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

15Attributes & Measures

Grain: a customer subscribes to a publication

Customer NameAddressEmail AddressRegistration Date...

Customer

Agent NameCategoryFee TypeActive Subscribers...

Agent

Publication TitleFrequencyEditorFirst Edition Date...

PublicationShort NameIndustryManager...

Business Unit

Media CodeMedia NameFormat...

Media

DateMonthYear ...

Date

UnitFeeDiscountPaid

Subscription

Page 16: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

16Natural Key

Customer IDCustomer NameAddressEmail AddressRegistration Date

Customer

Agent IDAgent NameCategoryFee TypeActive Subscribers

Agent

Publication IDPublication TitleFrequencyEditorFirst Edition Date

PublicationBusiness Unit IDShort NameIndustryManager

Business Unit

Media CodeMedia NameFormat

Media

DateMonthYear

Date

UnitFeeDiscountPaid

Subscription

The primary key in the source system

Page 17: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

17Surrogate Keys

• Multiple sources• Change of natural key• Maintain history• Unknown, N/A, Late Arriving• Performance

• Integer• Identity• 0, -1• Dim PK• Clustered index

Page 18: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

18Result

Page 19: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

19What Date?

Role-playing dimension

Page 20: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

20Degenerate Dimension

The identifier (PK) of a transaction table

Page 21: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

21Junk Dimension

Low cardinality

Page 22: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

22Fact Key

• To enable referring to a fact table row• SQL Server: clustered index

• Identity• Bigint

Page 23: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

23Result

Page 24: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

24So Far• Event, Dimensions, Measures• Grain• Attributes & Measures• Natural Keys• Surrogate Keys• Role-playing Dimension• Degenerate Dimension• Junk Dimension• Fact Key

Next• Slowly Changing Dimension• Snowflake

Page 25: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

25Slowly Changing DimensionType 1: Overwrite old values

Key Name Email1 Andy [email protected]

Key Name Email1 Andy [email protected]

Before: After:

Type 2: Create a new row (keep old values)

Key Name Email1 Andy [email protected]

Key Name Email1 Andy [email protected] Andy [email protected]

Before: After:

Type 3: Put old values in another column

Key Name Email1 Andy [email protected]

Key Name Email Previous Email1 Andy [email protected] [email protected]

Before: After:

Page 26: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

26Slowly Changing Dimension Type 2

Key Name Email Valid From Valid To Current1 Andy [email protected] 1900-01-01 2009-03-27 N2 Andy [email protected] 2009-03-28 9999-12-31 Y

• Valid From & Valid To (a.k.a. Effective Date & Expiry Date)To put the right surrogate key in the fact tableDatetime (not date)

• Current Flag: to query the current version

Not all attributes are type 2:• Attribute 1,2,3: type 1 (update)• Attribute 4,5,6: type 2 (new row)

Page 27: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

27Snowflake

fact

maindimension

maindimension

maindimension

maindimension

maindimension

maindimension

dimension

dimension

dimension

dimension

dimension

dimension

dimension

dimension

dimension

dimension

dimension

dimension

dimension dimension

dimension dimension

Page 28: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

28Snowflake

Product, product group, product category

Page 29: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

29Miscellaneous Topics

•Smart Date Key•Dimensional Grain•Real Time Fact Table

•What is it•Why is it important•How to do it•Miscellaneous topics

•Questions

Page 30: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

30Smart Date Key

Why use Smart Date Key? Why not?• Fact table partitioning• Reference dimension• Measure group partition• No lookup (everywhere)

• Multiple sources X• Change of natural key X• Maintain history X• Unknown, N/A, Late Arriving X• Performance X

Unknown date?

8 digit integer YYYYMMDD

Page 31: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

31Dimension Grain• Dim Product Line: 2 attributes, product_key• Dim Product: 10 attributes, product_grp_key• Dim Product Group: 5 attributes

3 tables:• Different surrogate keys• More flexible (attributes)

1 table with 3 views:• Same surrogate keys• Simpler load

PLFact 1

Fact 2

Snowflake StarP PG

P PG

Fact 3 PG

PLFact 1

Fact 2 P

Fact 3 PG

2 10 517

15

5

Combine into 1 dimension?

3 tables, linked FK-PK

Page 32: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

32Real Time Fact Table

Updated every time a transaction happens in the source system

• Depends on frequency: telco, retail, insurance, utilities, CRM• 1-2 fact table only transactional, narrow table• Stored in natural keys look up SK on query

• Today’s transactions only• Stored in surrogate keys• Limited dim updates -> unknown SK• Heap• Union with main fact table on query

Page 33: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

33Questions

• Event, dimensions, measures• Grain• Attributes and measures• Natural keys• Surrogate keys• Role-playing dimensions• Degenerate dimensions• Junk dimensions• Fact key• Slowly Changing Dimension• Snowflake• Smart Date Key• Dimensional Grain• Real Time Fact Table

Page 34: DATA WAREHOUSE DATA MODELLING SQLbits IV Manchester 28 th  March 2009 Vincent Rainardi

34

•Kimball & Ross: Data Warehouse Toolkit•Imhoff, Galemmo, Geiger: Mastering Data Warehouse Design•Kimball Group’s articles: www.kimballgroup.com•Kimball Forum: forum.kimballgroup.com

Further Resources