Worst Practices in Data Warehouse
Design
Kent Graziano
Data Warrior LLC
Twitter @KentGraziano
Agenda
My Bio
My Book
Survey
Backstory
What’s wrong with this picture?
The fallacy of the unconstrained data warehouse
Moral of the Story
© Data Warrior LLC
My Bio
Kent Graziano
● Oracle ACE Director (BI/DW)
● Data Architecture and Data Warehouse Specialist
● 30+ years in IT
● 20+ years of Oracle-related work
● 15+ years of data warehousing experience
● Member: Boulder BI Brain Trust
(http://www.boulderbibraintrust.org/ )
● Co-Author of
● The Business of Data Vault Modeling
● The Data Model Resource Book (1st Edition)
● Past-President of Oracle Development Tools User Group and
Rocky Mountain Oracle User Group
© Data Warrior LLC
Most recent book:
http://www.amazon.com/Check-Doing-Design-Reviews-ebook/dp/B008RG9L5E/
Survey
Who are you? ● Data Modeler or Architect
● Project Managers
● IT Managers
● DBA
● Developer
Experience ● Data Warehousing?
● Less than 1 yr?
● 1-5 yrs?
● Over 5 years?
© Data Warrior LLC
The Backstory
Metrics data mart
Outsourced
POC worked great
● 500 records loaded!
Real world: 100K ++ rows
● 1st run – DBA cancelled after 8 hours
● Filled up 665GB temp space
Something wrong?
© Data Warrior LLC
Next step
DBA says
● Too many parallel sessions
● Too many partitions on fact table
● Load includes
● Select *
● Select distinct
Me
● Reverse engineer the tables first
● Look at the design
● Yikes!
© Data Warrior LLC
My email to management
“In general, the designs of both the source star schema and the target reporting table do not conform to best practices from either an Oracle tuning or data warehouse design perspective. “
“My only conclusion is that the folks who did the design were not well versed or experienced in designing high performance, high volume data warehouse databases on Oracle.”
“Some of the omissions are so basic as it is hard to comprehend how this could have been considered a completed system. “
© Data Warrior LLC
What’s wrong with this picture?
● All optional
columns
● The
measure is
optional!
● Even meta
data!
● Extra
Varchar
columns
● No PK
● No UK
● No FKs
● No
Indexes!
© Data Warrior LLC
So what?
Works fine for 500 rows
● Full table scans
No clues for the optimizer
No clues for customer!
● Design intent?
● Data profile?
No PK/UK – could get duplicates in load
No FK – could be missing dimension keys
Lazy design!
© Data Warrior LLC
What’s wrong with this picture?
● All
optional
columns
● Even the
PK and
meta
data!
● No UK
● PK on an
optional
column?
© Data Warrior LLC
So what?
No clue on business key
SCD Type 1 or 2?
There is a CRC Key and CRC Attr
● But which date is the Type 2 date?
Again no clues in the indexes or NOT NULL
Have to look at data to see if
DW_REC_CREATED_DT and
DW_REC_UPDATED_DT are different
Can’t discern the intent
© Data Warrior LLC
How about the Date Dimension?
● All
optional
columns
● Assume
1st column
is PK?
● No PK
● No UK
● No Indexes
© Data Warrior LLC
More examples
Let’s look into the data model….
© Data Warrior LLC
Other Stuff
Untested partitioning scheme
● Target report table partitioning and sub-partition is
non-standard – not on date field
● Pre-created 200 list-based partitions
● But the domain only had 37 values!
Did not use partition-aware loading approach
No indexes on partitions or sub partition
© Data Warrior LLC
Load approach
Uses a “select *” from source in a view
UPPER function in predicate
● Not needed
● Cancels index usage
Degree of parallelism hardcoded into view
Dummy columns coded into view
No documentation on why
NEVER TESTED with real data!
© Data Warrior LLC
The Fallacy of the Unconstrained Data Warehouse
Rationale ● Fast to load – no constraints
● All the validation is in the code
Reality ● May be fast load, but slow query
● Not tuned for extract!
● Code may not have been QA’d well ● No model to tell the programmers the rules
● What columns are required?
● What are the FKs to check?
● What defines a duplicate row?
Cost ● Slow query response
● Bad data loaded
● Few clues to help tune
© Data Warrior LLC
Moral of the story?
Be careful who you outsource to
Have someone independent do touch point
reviews of design
● Costs extra, but we have spent MONTHS fixing this
Insist on documentation
Insist on knowledge transfer with internal DBA
Require load testing with performance criteria
Trust but Verify! © Data Warrior LLC
Kscope15.com
SUBMIT YOUR ABSTRACTS TODAY!
Contact Information
Kent Graziano
The Oracle Data Warrior
Data Warrior LLC
On Twitter @KentGraziano
Visit my blog at
http://kentgraziano.com