1
The Warehouse Designer’s School Of Hard Knocks
A Graduate’s Perspective
David StanfordDavid StanfordSr. Vice PresidentSr. Vice President
Cognicase Inc.Cognicase [email protected]@cognicase.com
The Data Warehouse Designer’s The Data Warehouse Designer’s School of Hard KnocksSchool of Hard Knocks
International Oracle Users GroupApril 15, 2002
2
Objectives
• Obtain a clear understanding of data warehouse design ‘hot points’
• Identify solutions and alternatives for these ‘hot points’
• See how real world solutions are implemented
3
Agenda
• Top 10 Gotchya’s• Warehouse Design• Surrogate Keys• Tracking History• Row Level Security In The Warehouse• BAM Rules, Audit and Administrative Fields• Other Tidbits of Advice
4
Dave’s Top 10 Gotchya’s
1. Failing to model for both a) view of the data when the event occurred and b) view of the data as of today’s reality
2. Limiting the number of dimensions3. Failing to model and populate a meta data
repository4. Failing to provide sufficient audit capabilities
to verify loads against source systems5. Not using surrogate keys for everything
5
Dave’s Top 10 Gotchya’s (cont’d)
6. Failing to design an error correction process7. Normalizing too much8. Not using a staging area9. Failing to load ALL of the fact data10.Failing to classify incorrect data
7
Overall Architecture
• At the 20,000 foot level we must decide on the use of:– The Operational Data Store (ODS)– The Staging Area– The Data Warehouse proper– The Data Mart(s)
• All of which are being coined as “The Corporate Information Factory” or CIF
8
Design Considerations
• The ODS vs the Warehouse• The ODS vs the Staging Area• The Warehouse vs the Data Mart• The lines become blurred
• Plan for the world, design for the future, and build for today
9
Not Designing For The Future…
SourceOLTPSystems
SourceOLTPSystems
HRHR
SalesSales
ManagementManagement
ManufacturingManufacturing
UsersUsersStove Pipe data marts...
11
Basic Data Warehouse Architecture
SourceSourceOLTPOLTP
SystemsSystems
DataWarehouse
StagingStagingAreaArea
12
Data Mart DW Architecture
Source OLTPSource OLTPSystemsSystems
Data MartsData Marts
DataWarehouse
Source: Enterprise GroupSource: Enterprise Group
StagingStagingAreaArea
13
Data Warehouse Process
Source OLTPSource OLTPSystemsSystems Data MartsData Marts
•Design•Mapping
•Design•Mapping
•Extract•Scrub•Transform
•Extract•Scrub•Transform
•Load•Index•Aggregation
•Load•Index•Aggregation
•Replication•Data Set Distribution
•Replication•Data Set Distribution
•Access & Analysis•Resource Scheduling & Distribution
•Access & Analysis•Resource Scheduling & Distribution
Meta DataMeta Data
System MonitoringSystem Monitoring
• Raw Detail• No/Minimal History
• Integrated•Scrubbed
• History•Summaries
• Targeted• Specialized (OLAP)
Data Characteristics
DataWarehouse
Source: Enterprise GroupSource: Enterprise Group
StagingStagingAreaArea
14
Where The Work Is
Source OLTPSource OLTPSystemsSystems Data MartsData Marts
•Design•Mapping
•Design•Mapping
•Extract•Scrub•Transform
•Extract•Scrub•Transform
•Load•Index•Aggregation
•Load•Index•Aggregation
•Replication•Data Set Distribution
•Replication•Data Set Distribution
•Access & Analysis•Resource Scheduling & Distribution
•Access & Analysis•Resource Scheduling & Distribution
Meta DataMeta Data
System MonitoringSystem Monitoring
DataWarehouse
Over 80% of the work is here
Source: Enterprise GroupSource: Enterprise Group
StagingStagingAreaArea
15
Data Warehouse Architecture
Components of a Data Warehousing ArchitectureComponents of a Data Warehousing Architecture
WarehouseAdmin.
Transformand Load
DataModeling
Tool
DataModeling
Tool
CentralMeta Data
DataExtractProcess
StagingArea
Data Transforma -tion & Load
SourceDatabases
DataStaging
DataExtraction
ArchitectedData Marts
Data Accessand Analysis
Central DataWarehouse
CentralData
Warehouse
Local Metadata
MetadataExchange
Mid -TierMid -Tier
Mid -Tier
Local Metadata
Local Metadata
Local Metadata
DataMart
MDB
DataMart
RDBMS MDBMDB
DataMart
RDBMS
RDBMS
ETL Tool
DataCleansing
ToolDataCleansing
Tool
RelationalRelational
ERP
e-Commerce
External
LegacyLegacy
16
The Staging Area
• Holds a mirror copy of the extract files• Allows pre-processing of the data before
loading• Allows easier reloading (you WILL do this)• Keeps more control with the DW team, rather
than an external group (the extract team)• Facilitates easier audit processes• Can facilitate error correction processes
17
Modelling is not straight forward
Donation
Member
IncomeCampaign
Time Gender
Marital Status
Location
Age
18
Should These Dimensions Be Combined?
Donation
Member
IncomeCampaign
Time Gender
Marital Status
Location
Age
19
The 10 Step Process-Data Model Design
1. Identify major subject areas or topics2. Add element of time to the tables3. Create appropriate names for tables, columns, and
views4. Add derived fields where applicable5. Add administrative fields6. Consider security and privacy in design7. Make sure data model answers the critical business
questions8. Consider meta data9. Consider error correction10. Performance considerations: Tune, Tune, Tune
20
Independent of Approach…
…the goal of the data model is to satisfy two primary criteria:
1. Meet Business Objectives2. Provide Good Performance
21
Warehouse Design
• Normalized (Relational) Design• Dimensional Design• Hybrid Design• Behind the Scenes
22
Normalized/Relational Schema
• Usually As Normalized as Possible• Used mostly in OLTP databases• Uses entities and relations to describe data• Fast for Inserts and Updates
24
The Star Schema
STAR SchemaSTAR Schema
• Used in OLAP (BI) and DWH• Uses FACT and DIMENSION Tables• Normalized FACT table• Dimensions Denormalized
26
Snowflake Schema
• Contains FACT and DIMENSION Tables
• Dimension Tables can be FACT for other STAR
• Dimension Hierarchies are normalized
28
Hybrid
• In reality, the DW is more normalized but has elements of dimensional design
• The data marts are star schemas but have elements of normalization
29
Behind The Scenes
• There are several aspects of a design that users don’t directly see:– Meta Data– Error Correction– Audit– Load Control (if not using a scheduling tool)– Transformation Tables (used for transforming the
data prior to being loaded into the DW)
30
Behind The Scenes
Data MartsData Marts
DataWarehouse
Error Correction
Meta Data
Audit
Load Control
Transform Tables
Source OLTPSource OLTPSystemsSystems
StagingStagingAreaArea
32
Surrogate Keys
• A surrogate key is a single column, unique identifier for each row within a table
• Always use surrogate keys for dimensions• Always use surrogate keys for the time
dimension• Always use surrogate keys for facts• Always use surrogate keys for transformation
tables• Always use surrogate keys for EVERY table
33
Surrogate Keys Avoid…
• Duplicate keys from different source systems• Recycling of primary keys• Use of the same key for different business
rows• Lengthy composite key joins• Space in fact tables• Application changes or upgrades in source
systems
34
Using Surrogates In Fact Tables
• You will need a surrogate key on the fact table if you allow ‘unknown’ values into the fact table (which is recommended by the way)
• The PK of a fact is typically the combination of the base dimensions
35
Surrogates In Fact Tables
DIM_DATES_OF_FIRST_SERVICE
Date_Of_First_Service_Key: NUMBER(10,0)
DIM_ICD9_PRIMARY_DIAGNOSES
Primary_Diagnosis_key: NUMBER(10,0)
DIM_BENEFIT_PACKAGES
Benefit_package_key: NUMBER(10,0)
DIM_MEMBERS
Member_key: NUMBER(10,0)
DIM_SERVICE_PROVIDERS
Provider_key: NUMBER(10,0)
FCT_CLAIMS
Product_Key: NUMBER(10,0)Primary_Diagnosis_key: NUMBER(10,0)Date_Of_First_Service_Key: NUMBER(10,0)Provider_key: NUMBER(10,0)Contract_key: NUMBER(10,0)Member_key: NUMBER(10,0)Benefit_package_key: NUMBER(10,0)
DIM_CONTRACTS
Contract_key: NUMBER(10,0)
DIM_PRODUCTS
Product_Key: NUMBER(10,0)
36
Surrogates In Fact Tables
Date Of First Service 15-Jan-2001
Benefit Package Family, Eye Coverage
Contract 123456789
Product ExtendaGroup
Member David Stanford
Service Provider Dr. Walters
Primary Diagnosis Broken Arm
Amount $123.34
37
Surrogates In Fact Tables
Date Of First Service 15-Jan-2001
Benefit Package Family, Eye Coverage
Contract 123456789
Product ExtendaGroup
Member David Stanford
Service Provider Dr. Walters
Primary Diagnosis MISSING (Broken Arm)
Amount $123.34
38
Surrogates In Fact Tables
Date Of First Service 15-Jan-2001
Benefit Package Family, Eye Coverage
Contract 123456789
Product ExtendaGroup
Member David Stanford
Service Provider Dr. Walters
Primary Diagnosis MISSING (Heart Attack)
Amount $16,239.00
• This results in a duplicate primary key in the table
39
Surrogates In Fact Tables
DIM_DATES_OF_FIRST_SERVICE
Date_Of_First_Service_Key: NUMBER(10,0)
DIM_ICD9_PRIMARY_DIAGNOSES
Primary_Diagnosis_key: NUMBER(10,0)
DIM_BENEFIT_PACKAGES
Benefit_package_key: NUMBER(10,0)
DIM_MEMBERS
Member_key: NUMBER(10,0)
DIM_SERVICE_PROVIDERS
Provider_key: NUMBER(10,0)
FCT_CLAIMS
Claim_Line_Key: NUMBER(10,0)
DIM_CONTRACTS
Contract_key: NUMBER(10,0)
DIM_PRODUCTS
Product_Key: NUMBER(10,0)
• Thus the need for a surrogate primary key
41
Tracking History in Dimensions
• Type 1 – No history• Type 2 – All history• Type 3 – Some history
42
Type 1 – No History
Source Transaction #1
Id 1
Name Sandy Rubble
Address 23 Boulder Rd
City Bedrock
Salutation Ms.
Warehouse Record #1
Key 100
Id 1
Name Sandy Rubble
Address 23 Boulder Rd
City Bedrock
Salutation Ms.
Date 01-Jan-2001
43
Type 1 – No History
Source Transaction #1
Id 1
Name Sandy Rubble
Address 23 Boulder Rd
City Bedrock
Salutation Ms.
Warehouse Record #2
Key 100
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Salutation Mrs.
Date 15-Mar-2001
Source Transaction #2
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Salutation Mrs.
44
Type 2 – All History
Source Transaction #1
Id 1
Name Sandy Rubble
Address 23 Boulder Rd
City Bedrock
Salutation Ms.
Warehouse Record #1
Key 100
Id 1
Name Sandy Rubble
Address 23 Boulder Rd
City Bedrock
Salutation Ms.
Date 01-Jan-2001
Source Transaction #2
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Salutation Mrs.
Warehouse Record #2
Key 101
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Salutation Mrs.
Date 15-Mar-2001
45
Type 3 – Some History
Source Transaction #1
Id 1
Name Sandy Rubble
Address 23 Boulder Rd
City Bedrock
Salutation Ms.
Warehouse Record #1
Key 100
Id 1
Name Sandy Rubble
Address 23 Boulder Rd
City Bedrock
Original Salutation
Ms.
Salutation Ms.
Date 01-Jan-2001
46
Type 3 – Some History
Source Transaction #1
Id 1
Name Sandy Rubble
Address 23 Boulder Rd
City Bedrock
Salutation Ms.
Warehouse Record #1
Key 100
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Original Salutation
Ms.
Salutation Mrs.
Date 15-Mar-2001
Source Transaction #2
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Salutation Mrs.
47
More Dimension Types…Combinations
• Type 3 Prime – Types 1 and 2 (the most common)
• Type 4 – Types 1 and 3• Type 5 – Types 2 & 3• Type 6 – Types 1, 2, and 3 (the second most
common)
48
Trigger Fields
• Trigger Fields are fields within a table that you want to track history
• Non-Trigger fields are those which you do not want to track history
49
Type 3 Prime –All and No History
Source Transaction #1
Id 1
Name Sandy Rubble
Address 23 Boulder Rd
City Bedrock
Salutation Ms.
50
Type 3 Prime –All and No History
Source Transaction #1
Id 1
Name Sandy Rubble
Address 23 Boulder Rd
City Bedrock
Salutation Ms.
Warehouse Record #2
Key 100
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Salutation Ms.
Date 15-Mar-2001
Source Transaction #2
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Salutation Ms.
51
Type 3 Prime –All and No History
Source Transaction #1
Id 1
Name Sandy Rubble
Address 23 Boulder Rd
City Bedrock
Salutation Ms.
Warehouse Record #1
Key 100
Id 1
Name Sandy Rubble
Address 23 Boulder Rd
City Bedrock
Salutation Ms.
Date 01-Jan-2001
Source Transaction #2
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Salutation Mrs.
Warehouse Record #2
Key 101
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Salutation Mrs.
Date 15-Mar-2001
52
Expect To Track Everything
• Users want to view the data as it was when the transaction or event occurred
AND…
• Users want to view the data in the context of today’s realities
THUS, model for both!
53
Add ‘Current’ Columns
• In order to provide these two views, consider adding ‘current’ columns to tables. This is a special Type 6.
• These fields get updated in historical records when a trigger field changes value in the current record.
• This simplifies the use of the DW by the users• It’s easier to understand than having to write
complex SQL
54
Type 6 – All, Some, and No History, Ex#1
Source Transaction #1
Id 1
Name Sandy Rubble
Address 23 Boulder Rd
City Bedrock
Salutation Ms.
Warehouse Record #1
Key 100
Id 1
Name Sandy Rubble
Address 23 Boulder Rd
City Bedrock
Salutation Ms.
Current Sal’n Mrs.
Date 01-Jan-2001
Source Transaction #2
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Salutation Mrs.
Warehouse Record #2
Key 101
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Salutation Mrs.
Current Sal’n
Mrs.
Date 15-Mar-2001
55
Type 6 – All, Some, and No History, Ex#2
Source Transaction #1
Id 1
Name Sandy Rubble
Address 23 Boulder Rd
City Bedrock
Salutation Ms.
Warehouse Record #1
Key 100
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Salutation Ms.
Current Sal’n Ms.
Date 01-Jan-2001
Source Transaction #2
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Salutation Ms.
56
Type 6 – All, Some, and No History Ex#2
Warehouse Record #1
Key 100
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Salutation Ms.
Current Sal’n Mrs.
Date 01-Jan-2001
Source Transaction #3
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Salutation Mrs.
Warehouse Record #2
Key 101
Id 1
Name Sandy Rubble
Address 42 Slate Ave
City GravelPit
Salutation Mrs.
Current Sal’n
Mrs.
Date 15-Mar-2001
57
History Tracking – Some Closing Thoughts
• Double Keying - Slowly Changing Dimensions (SCD’s)– Consider adding a second surrogate key for the
business keys– Only if you know you have a volatile, multiple source
systems
• Rapidly Changing Dimensions (RCD’s) need to be partitioned– Use Oracle partitioning– Include the native partition key in the dimension– Or split into several tables
59
Row Level Security
• Three key pieces of data are required:– Who are the users? – The relationship to the system?– The relationship amongst each other?
• Combine these 3 pieces of information and you have the key to row level security
60
An Example … The Users Table
• Identifies who and the relationship to the DWUser Id Broker # Broker/MGA Name
Fred 1 Broker #1
Barney 1 Broker #1
Wilma 2 Broker #2
Betty 2 Broker #2
Joe 3 Broker #3
Slate 3 Broker #3
Dino 4 Broker #4
Gazoo 4 Broker #4
Pebbles 5 Best Insurance
Bambam 5 Best Insurance
Arnold 6 Life R Us
Elmo 6 Life R Us
61
The Hierarchy Table
CREATE TABLE BROKER_HIERARCHIES ( BBH_KEY NUMBER(12) DEFAULT -99 NOT NULL, PARENT_BROKER_KEY NUMBER(12) DEFAULT -99 NOT NULL, CHILD_BROKER_KEY NUMBER(12) DEFAULT -99 NOT NULL, PARENT_ROLE_CD VARCHAR2(10), PARENT_ROLE_DSC VARCHAR2(240), CHILD_ROLE_CD VARCHAR2(10), CHILD_ROLE_DSC VARCHAR2(240), TOP_MOST_FLG NUMBER(1), BOTTOM_MOST_FLG NUMBER(1), BBH_REPORTING_ORDER NUMBER(10), BBH_DEPTH_FROM_PARENT NUMBER(10), BBH_EFFECTIVE_DT DATE, BBH_END_DT DATE, BBH_CONTRACT_TYPE_CD VARCHAR2(10), BBH_CONTRACT_TYPE_DSC VARCHAR2(240), BBH_ACTIVE_FLG NUMBER(1), BBH_CREATE_DT DATE, BBH_CREATE_SOURCE VARCHAR2(25), BBH_UPDATE_DT DATE, BBH_UPDATE_SOURCE VARCHAR2(25), BBH_BK VARCHAR2(50), CONSTRAINT PK_BRG_BROKER_HIERARCHIES PRIMARY KEY ( BBH_KEY ) ) ;
• Identifies the relationship amongst each other
62
Which Looks Like
Parent Broker Key
Child Broker Key Parent Name Child Name
1 1 Broker #1 Broker #1
2 2 Broker #2 Broker #2
3 3 Broker #3 Broker #3
4 4 Broker #4 Broker #4
5 1 Best Insurance Broker #1
5 2 Best Insurance Broker #2
6 3 Life R Us Broker #3
6 4 Life R Us Broker #4
5 5 Best Insurance Best Insurance
6 6 Life R Us Life R Us
63
Combining Users & Hierarchies
• Every time we query, we need to pass through these two tables
• Hierarchies will result in self joins or the dreaded CONNECT BY
• We are data warehousers – let’s build a helper or bridge table –
The User Broker Security (UBS) table
64
User Broker Security (UBS) Table
• Every possible combination of who reports to who, denormalized, by user
• Only has 2 columns – skinny table
CREATE TABLE USER_BROKER_SECURITY
(USER_ID VARCHAR2 (15),BROKER_KEY NUMBER (12,0));
65
Results in…
• We’ve added the ‘From’ and ‘Broker Name’ columns for simplicityUser Id From Broker Key Broker Name
Pebbles Best Insurance 5 Best Insurance
Pebbles Best Insurance 1 Broker #1
Pebbles Best Insurance 2 Broker #2
Bambam Life R Us 6 Life R Us
Bambam Life R Us 3 Broker #3
Bambam Life R Us 4 Broker #4
66
The UBS Table Is The Key
• The User Broker Security table is the key to all solutions
• Row level security can be built using:– Materialized View– Virtual Private Database (VPDB)
• No matter what option is used to implement row level security, you will need a table like this in some shape or form
67
What It All Looks Like
Hier-archies
UBS
Fact Policies – Full Table
Fact Policies -Row SecurityOracle
Users
RefreshScript
68
Materialized Views
Hier-archies
UBS
Fact Policies – Full Table
Fact Policies -Row Security
Oracle Users
RefreshScript
69
Virtual Private Database
Fact Policies -Row Security
Components
Oracle Policy
Context
DB Trigger
Hier-archies
UBS Oracle Users
71
Bad & Missing Data
• Bad and/or missing data will be always be an issue
• The source data is never completely clean• There are always exceptions• Recall that you need to tie back into the
source systems for your audit, thus you must load this ‘incorrect’ data
• Put the decisions into the hands of your users – don’t decide for them whether the data is good enough or not
• Need to develop Bad & Missing (BAM) Rules
72
BAM Rules
• Used in the ETL process when loading data that references other tables (e.g. loading a fact table and looking up the dimension record)
• Need a series of rules to follow if the lookup fails
• Create a set of ‘dummy’ records for each referenced table (for Referential Integrity purposes)
• In snapshots, may need a set of dummy records per snapshot period
73
BAM Rules – Dummy Records
-99 Error/Missing
-88 Not Available
-77 Acceptable Error
-66 Temporarily Not Available
-1 Not Applicable
A great hockey team!A great hockey team!
GretzkyGretzky
LindrosLindros
CoffeyCoffey
LemieuxLemieux
Bunny Bunny LarocqueLarocque
74
Dummy Record Meanings
-99 A data element is missing or a lookup into another table cannot find a matching value (e.g. Missing foreign key). The source record is still loaded and the column value is set to –99.
-88 ‘Not Applicable’. This data element is not required in the context of the record.
-77 ‘Acceptable Errors’ that will not be corrected. This data element was invalid (set to -99) during the initial load and will not be corrected or reloaded.
-66 Data is temporarily not available. Usually used in a multiple pass loading process.
-1 ‘Data not available’. This data element is not available from the source record.
75
Error Correction Process
• An area that you can report from and reload from
• Hold or point to the original source record and be able to recreate it (the DW has lost the original value once tagged to a BAM rule)
• Can be one summary table with standard error types
• For more detail, create one error table for each target table
• Create a series of error flag columns in the error table indicating what went wrong
76
Error Correction Model – Summary Mode
Error_type
error_type_cd: VARCHAR(2) NOT NULL
error_type_desc: VARCHAR(255)last_update_ts: TIMESTAMP NOT NULLrecord_expiry_ts: TIMESTAMP
Severity_Level
severity_cd: VARCHAR(3) NOT NULL
severity_desc: VARCHAR(255)last_update_ts: TIMESTAMP NOT NULLrecord_expiry_ts: TIMESTAMP ETL_ERROR
etl_load_key: INTEGER NOT NULL (FK)sys_load_col_name: VARCHAR(30) NOT NULLsource_name: VARCHAR(80) NOT NULL (FK)error_type_cd: VARCHAR(2) NOT NULL (FK)source_row_id: INTEGER
severity_cd: VARCHAR(3) NOT NULL (FK)
78
Audit Considerations
• A key area that is quite often ignored• You must match to the source systems or be
able to explain the differences• Auditing data loads (when did we start a load
and what is the status?)
• Without proof, you will not get the credibility!
79
Audit Model
ETL_AUDIT
etl_load_key: INTEGER NOT NULL
academic_yr: CHAR(9)prev_etl_load_key: INTEGERmost_rcnt_fy_ind: CHAR NOT NULLsystem_cd: VARCHAR(5) NOT NULL (FK)load_status_flg: VARCHAR(12)load_type_flg: CHARstage_archvd_date: DATEwh_archvd_date: DATEstage_start_ts: TIMESTAMPwarehouse_start_ts: TIMESTAMPnum_rows_read: INTEGERfct_cleanup_ind: CHARacad_yr_transt_ind: CHAR
ETL_Source_System
system_cd: VARCHAR(5) NOT NULL
system_name: VARCHAR(20)system_desc: VARCHAR(255)sys_req_file_cnt: INTEGER
ETL_AUDIT_TABLE_LOADS
etl_load_key: INTEGER NOT NULL (FK)source_name: VARCHAR(80) NOT NULL
num_rows_read: INTEGERnum_records_reqd: INTEGERload_status_flg: VARCHAR(12)extract_num: INTEGERextract_ts: TIMESTAMPstop_source_row_id: INTEGERload_session_name: VARCHAR(80)load_start_ts: TIMESTAMPload_stop_ts: TIMESTAMP
80
Pulling Audit & Error Correction Together
ETL_AUDIT
etl_load_key: INTEGER NOT NULL
academic_yr: CHAR(9)prev_etl_load_key: INTEGERmost_rcnt_fy_ind: CHAR NOT NULLsystem_cd: VARCHAR(5) NOT NULL (FK)load_status_flg: VARCHAR(12)load_type_flg: CHARstage_archvd_date: DATEwh_archvd_date: DATEstage_start_ts: TIMESTAMPwarehouse_start_ts: TIMESTAMPnum_rows_read: INTEGERfct_cleanup_ind: CHARacad_yr_transt_ind: CHAR
ETL_Source_System
system_cd: VARCHAR(5) NOT NULL
system_name: VARCHAR(20)system_desc: VARCHAR(255)sys_req_file_cnt: INTEGER
ETL_AUDIT_TABLE_LOADS
etl_load_key: INTEGER NOT NULL (FK)source_name: VARCHAR(80) NOT NULL
num_rows_read: INTEGERnum_records_reqd: INTEGERload_status_flg: VARCHAR(12)extract_num: INTEGERextract_ts: TIMESTAMPstop_source_row_id: INTEGERload_session_name: VARCHAR(80)load_start_ts: TIMESTAMPload_stop_ts: TIMESTAMP
Error_type
error_type_cd: VARCHAR(2) NOT NULL
error_type_desc: VARCHAR(255)last_update_ts: TIMESTAMP NOT NULLrecord_expiry_ts: TIMESTAMP
Severity_Level
severity_cd: VARCHAR(3) NOT NULL
severity_desc: VARCHAR(255)last_update_ts: TIMESTAMP NOT NULLrecord_expiry_ts: TIMESTAMP ETL_ERROR
etl_load_key: INTEGER NOT NULL (FK)sys_load_col_name: VARCHAR(30) NOT NULLsource_name: VARCHAR(80) NOT NULL (FK)error_type_cd: VARCHAR(2) NOT NULL (FK)source_row_id: INTEGER
severity_cd: VARCHAR(3) NOT NULL (FK)
81
Administrative Fields
• Supports the ‘behind the scenes’ aspects– Loading– Querying
• Different requirements for dimensions and facts
• But try to standardize across all tables, even if the fields aren’t utilized today
82
Dimension Tables
• Record Type – indicates New, Modify, Delete, Correction
• Active Flg - indicates a business key is active• Most Recent Flg - indicates the most recent
row loaded within a business key• Effective Date - for the instance of that row• End Date - for the instance of that row• Create Date• Update Date• Create User• Update User
83
Fact Tables
• Record Type• Active Flg• Most Recent Flg• Row Cnt• Partition Date – store the actual date value• Create Date• Update Date• Create User• Update User
85
Random Thoughts
• Ensure you secure…– Budget– Top management commitment
• Have focus (scope definition)• Develop incrementally• Have a business driven solution• Use experienced designers and implementers• Use industry tools for development
86
More Random Thoughts
• Generally, make all of your column names unique across tables
• Conform fact table measures (same name)• Don’t normalize too much – jump right into a
dimensional design• Avoid retroactive changes• Don’t be afraid of many dimensions
87
Don’t Be Afraid Of Too Many Dimensions
• 1 Fact, 41 CONFORMED dimensions
DIM_PRODUCTS
DIM_ICD9_ADMITTING_DIAGNOSES
DIM_AUTHORIZING_PROVIDERS
DIM_PROVIDER_ROLES
DIM_HCP_CODES
DIM_CONTRACTS
DIM_PCP_PANELS
DIM_AGES
DIM_MR_CLASSIFICATIONS
DIM_PLACES_OF_SERVICE
DIM_SEXES
DIM_MARITAL_STATUSES
FCT_CLAIMS
DIM_SERVICE_PROVIDERS
DIM_MEMBERS
DIM_BENEFIT_PACKAGES
DIM_TIER_PLAN_TYPES
DIM_CORPORATIONS
DIM_EMPLOYER_GROUPS
DIM_MODIFIERS
DIM_ICD9_PRIMARY_DIAGNOSES
DIM_ICD9_SECONDARY_DIAGNOSES
DIM_MEMBER_LOCATIONS
DIM_PROVIDER_LOCATIONS
DIM_DATES_OF_FIRST_SERVICE
DIM_DATES_OF_LAST_SERVICE
DIM_PAID_DATES
DIM_CLAIM_RECEIVED_DATES
DIM_ADMISSION_DATES
DIM_CLAIM_INVOICE_DATES
DIM_DISCHARGE_DATES
DIM_REFERRING_PROVIDERS
DIM_PCPS
DIM_CPT4_CODES
DIM_HCPCS_CODES
DIM_REVENUE_CODES
DIM_ICD9_PROCEDURE_CODES
DIM_PCP_NETWORKS
DIM_MEDICAL_PCP_NETWORKS
DIM_SERV_PROV_NETWORKS
CLAIMS_DETAIL
DIM_DRG_CODES
88
12 Common DW Design Mistakes(Intelligent Enterprise: Ralph Kimball Oct 2001)
1. Place text attributes in a fact table when you want to use them as constraints and groupings
2. Limit the use of verbose descriptions in your dimensions to save space
3. Split hierarchy and hierarchy levels into multiple dimension tables
4. Delay dealing with slowly changing dimensions 5. Use smart keys to join dimension and fact tables6. Add dimensions to fact tables before declaring the
grain
89
7. Declare that the dimensional model is based on a specific report
8. Mixing different grains in one fact table9. Leave lowest-level atomic data in non-
dimensional format10.Avoid building aggregates and use hardware
for performance improvements11.Fail to conform fact data12.Fail to conform dimension data
12 Common DW Design Mistakes(Intelligent Enterprise: Ralph Kimball Oct 2001)
90
Dave’s Top 10 Gotchya’s
1. Failing to model for both a) view of the data when the event occurred and b) view of the data as of today’s reality
2. Limiting the number of dimensions3. Failing to model and populate a meta data
repository4. Failing to provide sufficient audit capabilities
to verify loads against source systems5. Not using surrogate keys for everything
91
Dave’s Top 10 Gotchya’s (cont’d)
6. Failing to design an error correction process7. Normalizing too much8. Not using a staging area9. Failing to load ALL of the fact data10.Failing to classify incorrect data
92
In Summary
• Be careful in your design• Meet business
requirements• Address the ‘behind the
scenes’ issues• Remember: DW design is
not a science, it is an art• Thus be an artist and
create
93
AQ&Q U E S T I O N SQ U E S T I O N SA N S W E R SA N S W E R S
David [email protected]
& D R A W & D R A W
Thank You!