43
1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. For more information on how you may use them, please see http://www.openlineconsult.com/db

1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

Embed Size (px)

Citation preview

Page 1: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

1

Theory, Practice & Methodology of Relational Database

Design and ProgrammingCopyright © Ellis Cohen 2002-2006

Introduction toData Warehouse

DesignThese slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License.

For more information on how you may use them, please see http://www.openlineconsult.com/db

Page 2: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 2

Topics

OverviewStar Schema:

Fact & Dimension TablesThe Star Schema &

DenormalizationThe Data CubeETL: Extraction,

Transformation & Loading

Page 3: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 3

Overview

Page 4: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 4

Data Warehousing & Data Mining

Data WarehousingTechniques for representing & querying

large amounts of relatively static dataPotentially stored in

Multi-Dimensional DatabasesOn-line Analysis & Decision Support

Data MiningAutomated analysis: Discovering

(potentially) unexpected patterns in large amounts of data

Page 5: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 5

Operational vs Analytical DBs

Operational DatabaseData needed and updated constantly to directly

support business operationsFocus on OLTP (on-line transaction processing):

Transactional access & modification of relatively small # of data points at a time

Analytical Database:Data Warehouse & Data MartCopious amounts of relatively static data, culled

& integrated across enterprise, cleansed & summarized, maintained historically, used for decision support and business intelligence (BI)

Focus on OLAP (on-line analytical processing): Querying large amounts of data, scheduled modifications

Page 6: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 6

Operational vs Analytical DBs

Operational Warehouse

Usage Transactional(OLTP)

Analytical(OLAP)

Organized for Modifications Queries

Modifications Continual Periodic

Queries Narrow-scopeLow-complexity

Broad-scopeHigh-complexity

Database Relational Relational/Dimensional

Data NormalizedDenormalizedAggregated &

Derived

Page 7: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 7

Central Data Warehouse

(from Oracle 9i Data Warehousing Guide)

Page 8: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 8

Warehouse Questions

How many red Bally shoes did we sell by region in the third quarter of each of the last 5 years?

What are the top 25 selling products by category and region for this past quarter?

What percent of the market do we own for each product we make?

Which of our customer's zipcodes were responsible for the top 10% of total sales over the last year.

Page 9: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 9

Star Schema:Fact & Dimension

Tables

Page 10: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 10

Star Schema

Stores (Dimension)

DailySales (Fact)

storidprodiddatepriceunits

storid…

Products (Dimension)

prodid…

Measures

A Star Schema has a central fact table, with a composite primary key, which references multiple Dimension tables

what each fact measures

Data Warehousesare organized usingStar Schema models

foreign key

Page 11: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 11

Subjects (Facts) & Dimensions

Instead of thinking about entities & relationships, design a data warehouse by thinking about

Subjects (represented by fact tables)

Sales, Distribution, Purchases

Dimensions (represented by dimension tables)

How to uniquely identify the facts about each subject– Sales: Product, Stores, Dates

(maybe also Employee, Customer: depends what you want to analyze)

– Distribution: Warehouses, Products, Stores, Dates (maybe Employees & Trucks)

– Purchases: Products, Vendors, Dates (maybe also Employees)

Page 12: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 12

Fact & Dimension Tables

Fact TablesComposite primary key

• identify dimensions• uniquely identify each fact (or measurement)

Additional attributes: measures• what is measured about each fact

Dimension TablesPrimary key

Surrogate key uniquely identifies each dimension value

Additional attributesProperties of each dimension value

Page 13: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 13

Dimensions & Granularity

Dimensions have different levels of granularity

Stores

Regions

Districts

Products

SubCategories

ProductTypes

Categories

Manufacturers

Page 14: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 14

Snowflake Schema(with Normalized Dimensions)

Stores (Dimension) DailySales (Fact)storidprodiddatepriceunits

storidstornamcitystatedistid

Products (Dimension)

prodidcolorsizeprodtyp

Districtsdistiddistnamdistarearegid

Regionsregidregnam

ProductTypes

prodtypprodnamprodescrsubcatidmanfid

SubCategories

subcatidsubnamsubdescrcatid

Categories

catidcatnamcatdescr

Manufacturers

manfidmanfnam

Page 15: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 15

Typical Warehouse Query

How many red Bally shoes did we sell in each region in 2002?

SELECT r.regnam as region, sum(f.units) as sumunitsFROM DailySales f NATURAL JOIN Stores NATURAL JOIN Districts NATURAL JOIN Regions r NATURAL JOIN Products p NATURAL JOIN ProductTypes NATURAL JOIN SubCategorie s NATURAL JOIN Manufacturers mWHERE to_char(f.date,'YYYY') = '2002' AND p.color = 'red' AND m.manfnam = 'Bally' AND s.subnam = 'Shoe'GROUP BY r.regnam

Page 16: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 16

The Star Schema & Denormalization

Page 17: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 17

Snowflake Schema is Normalized

Snowflake Schema has normalized dimension tables

• Each dimension is represented by multiple sub-dimension tables at different levels of granularity (Product, ProductType, Category, etc.)

• Each sub-dimension table has attributes appropriate to the level of granularity– Product: color, size

– ProductType: prodnam, prodescr

– etc.

Page 18: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 18

Denormalization

Products (Dimension)

prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr

Products (Dimension)

prodidcolorsizeprodtyp

ProductTypes

prodtypprodnamprodescrsubcatidmanfid

SubCategories

subcatidsubnamsubdescrcatid

Categories

catidcatnamcatdescr

Manufacturers

manfidmanfnam

Why is there redundancy

here?

Page 19: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 19

Star Schema is Denormalized

The Star Schema has denormalized dimension tables

• Each dimension by joining together the sub-dimension table to form a single dimension table

• The dimension table has attributes at different levels of granularity

• The dimension tables contain lots of redundancy, but queries use far fewer joins

• Does not dramatically impact space: dimension tables usually < 1% size of fact table (but some descriptions may need to be stored separately)

Page 20: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 20

Star Schema(Fully Denormalized Dimensions)

Stores (Dimension)

DailySales (Fact)

storidprodiddatepriceunits

storidstornamcitystatedistiddistnamdistarearegidregnam

Products (Dimension)

prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescrMaybe catdescr not

included here if it is a GIF or a 4000 byte

description

Why should this be

replaced by a dateid?

Page 21: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 21

Query with Denormalized Schema

How many red Bally shoes did we sell in each region in 2002?

SELECT s.regnam as region, sum(f.units) as sumunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p WHERE to_char(f.date,'YYYY') = '2002' AND p.color = 'red' AND p.manfnam = 'Bally' AND p.subnam = 'Shoe'GROUP BY s.regnam Costly

Page 22: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 22

Typical Date Dimension Attributes

Requires Month + Year to identify a month within a year.Might want to add a single MonthYr field to represent the pair

Field Example Value

Year 2005

Month Feb

Quarter 1

DayOfMonth 12

DayOfYear 43

WeekOfYear 7

DayOfWeek Sat

Note: Quarter is less granular than MonthAlso, DayOfYear, WeekOfYear & DayOfWeek can be derived form the other fields

It is common and almost always more efficient to treat Dates as a dimension with a number of attributes

Page 23: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 23

Extended Date Dimension Hierarchy

Date (e.g. Feb 12, 2005)

DayOfWeek(e.g. Sat)

WeekYr(e.g. 2005Wk7)

MonthYr(e.g. Feb2005)

QuarterYr(e.g. 2005Q1)

Year(e.g 2005)

Quarter(e.g. 1)

Month(e.g. Feb)

WeekOfYear(e.g. 7)

DayOfYear(e.g. 43)

DayOfMonth(e.g. 12)

Page 24: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 24

Star Schema with Date Dimension

Stores (Dimension)DailySales (Fact)

storidprodiddateidpriceunits

storidstornamcitystatedistiddistnamdistarearegidregnam

Products (Dimension)prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr

Dates(Dimension)

dateiddatedayofweekdayofmonthdayofyearweekyrweekofyearmonthyrmonthquarteryrquarteryear

In general, represent dates by a Dates dimension table

Page 25: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 25

Query using Dates DimensionHow many red Bally shoes did we sell

in each region in 2002?SELECT s.regnam as region,

sum(f.units) as sumunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates dWHERE d.year = 2002 AND p.color = 'red' AND p.manfnam = 'Bally' AND p.subnam = 'Shoe'GROUP BY s.regnam

Needs an extra join, but simpler query, Executes faster if Dates is indexed by year

Page 26: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 26

The Data Cube

Page 27: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 27

Data Cube Representation

Products dimension

Stores dimensio

n

Dates dimension

Sales of Beanie Babies in

Pittsburgh Store Today

Sales of Beanie Babies in Pittsburgh

Store Yesterday

All Sales(of all products

over time) in NYC Store

Pgh

NYC

Sales Cube

Page 28: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 28

Data Cube Characteristics

Each axis represents a dimension

– Elements along axis are at lowest granularity for that dimension

Measures are the data within the cells at intersections of the cube

– Information about the topic of the cube

– e.g. units & price for each sales fact (i.e. sales in a store of a product on a date)

Page 29: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 29

Data Cube ViewsSlice

View data relative to a point in one or more dimensions

View sales today (for each store & each product category)

View Bally shoe sales at the NYC store (for each date)

DiceView data relative to (sets of) ranges in one or

more dimensionsView sales for the last 4 days (for each store &

each product category)View sales for each type of shoes at all the NY

and NJ stores for each of the last 10 quarters

Page 30: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 30

MDDB: MultiDimensional DataBase

Knows about Fact & Dimension TablesUses direct (n dimensional) hypercube

representation to provide fast access to fact elements in query

Supports sparse representations– The Pittsburgh store doesn't sell lingerie– The Cape Cod store is not open in the winter– Baked Beanie Babies are only sold in the NE

regionUses specialized query language

e.g. MDX (used by Microsoft OLAP Server)w basic data types: cube, slice, dice

Page 31: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 31

ETL:Extraction,

Transformation & Loading

Page 32: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 32

ETL: Extraction, Transformation & Loading

80% of total cost of building warehouse

Extraction Loading

Transformation

Page 33: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 33

ExtractionSources

Multiple DB'sFlat FilesExternal Data Sources

• e.g. Census, Geographic, Weather, Financial, Unemployment Data

• Standard DB/Spreadsheet format or semi-structured data from the web

FrequencyPeriodic (hourly, daily, weekly, …)Triggered

• Single event• #, sequence, pattern of events

MechanismsSnapshots / Materialized Views / ReplicationDatabase TriggersProcess LogsQuery Sources (full vs incremental)

Page 34: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 34

TransformationCleaning

ScrubbingFilteringConformance

IntegrationRenamingFusion & MergingDetermine Surrogate KeysTimestampingSummarization

Schema OrganizationDimension TablesPre-Aggregation via Materialized Views Derivation

Page 35: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 35

(Transformation) Cleaning

ScrubbingUse domain-specific knowledgee.g. SS#, phone-number, zipcode

FilteringCheck for inconsistent dataUse data validation rules

ConformanceMap similarly typed data to standard

representation Convert

units (inch => cm, $ => euro)scale (mm => cm)formats (string => integer, string

with/wo $)

Page 36: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 36

(Transformation) IntegrationRenaming

Resolve name conflictsFusion - e.g. merge

– properties in city db– properties in developer lists

Determine Surrogate KeysDo not use keys from operational data as

primary key in warehouse dataTimestamping

Add timestamps to fact data where missing to enable historical queries

Reorganization & EvolutionSupport Data Reorganization & Schema

EvolutionSummarization

Summarize original operational data and combine into less detailed tables

Page 37: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 37

Integration (Data Reorganization)What do we do when attributes change?

Suppose districts are reorganized and a store is now part of a different district

Consistently changing mapping of store to district– Allows new and old data to be compared

reasonably by district– But causes incorrect comparisons by district

among older data alone

Solutions1. Keep fields for both old and new mapping -- in

fact, potentially a separate field for each reorganization

2. Add effective date to store dimension.Have multiple rows for same store - each with different effective date

Page 38: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 38

(Integration) Summarization

DailySales (Fact)storidprodiddatepriceunitsCustomerTransaction

transidcustidempidposidtime

ItemPurchasetransidlinenoprodidpriceunits

PointOfSaleTerminals

posidpostypstoridloc

Might build different fact tables for different purposes:

e.g. ones involving Customersones involving Store Locations

TradeoffSmaller Fact Tables vs.Missed Relationships

Page 39: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 39

Loading

Alternatives– Incremental vs Full Refresh:

most data is incrementally added to the warehouse– Off-line vs on-line– Frequency

• Nightly• Weekly• Monthly

– All-at-once vs StagedWhat indices to create or drop?What statistics to collect (& use)?

Page 40: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 40

Constellation SchemaData warehouses often are designed as

constellations• Multiple fact tables• Shared/related dimension tables

Examples– Sales: store, product, date– Distribution: distributor, store, product,

carrier, period– Advertising: store, medium, product, period

Query across same or related dimensions– Compare advertising and sales by store

within various periods

Page 41: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 41

Data Marts

Store different fact tables (or different groups of fact tables) in separate data marts

Page 42: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 42

Data Mart Architectures

Subset of Data WarehouseMeets needs of subgroup of users

• Top-down: – Extracted from Data Warehouse– Problem: early availability

• Bottom-up:– Built directly from staging area– Can be combined to form warehouse– Problem: Conformance.

ETL tool must provide metadata

• Hybrid:– Some data marts built directly from staging area– Others extracted from Data Warehouse

Page 43: 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design

© Ellis Cohen, 2003-2006 43

Metadata Management

Identify & define each attribute– Source(s)– Transformation(s) applied– How aggregated– Description of what it represents– Relationships to other attributes– History