42
NLS/IITB/DWH 1 Data Warehouse : Modeling and Design N. L. Sarda

Data Warehouse : Modeling and Design

  • Upload
    reuben

  • View
    49

  • Download
    0

Embed Size (px)

DESCRIPTION

Data Warehouse : Modeling and Design. N. L. Sarda. Outline. Introduction Warehouse structure A case study Dimensional analysis. Introduction. DW is a single, complete and consistent store of data from different sources to understand & analyze the business Contains history data - PowerPoint PPT Presentation

Citation preview

Page 1: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 1

Data Warehouse : Modeling and Design

N. L. Sarda

Page 2: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 2

Outline

• Introduction• Warehouse structure• A case study• Dimensional analysis

Page 3: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 3

Introduction

• DW is a single, complete and consistent store of data from different sources to understand & analyze the business

• Contains history data• Warehouse to facilitate browsing, navigating,

aggregating and visualization of related data to understand performance, problems, customer preferences, trends, etc.

• Warehouse data organized by important business subjects (customer, product, etc…)

Page 4: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 4

Warehouse Structure

• Organized to facilitate ease of access and aggregation

• warehouse structure decomposed into dimensions and facts– Dimensions like ‘independent variables’, represent

entities for analysis

– Fact represents business data; relates to a set of dimensions

– Eg : customer, time, type of account are dimensions, and balances are facts

Page 5: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 5

Warehouse Structure...

• The complex network of business entities and their relationships as depicted in an operational DB (using, say, ER model) is difficult for navigation and analysis

• A ‘2-level’ structure defined by ‘star schema’ is performed where a fact is at the center and dimensions form ‘spokes’

• Data not stored in ‘normalized’ form

Page 6: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 6

Star Schema

• Contains a fact table and for each dimension one dimension table

Time Prod

Cust

fact

date, custno, prodno, cityname, ...

City

Page 7: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 7

Dimensions

• Stored as a database table• Contains many descriptive attributes for analysis• Small and slowly changing data• Data often group-able for analysis

– Customers by age, occupation, income level

– Time by weeks, months, years

– Branches as rural, suburban or by size

• Thus, dimension data viewable as a hierarchy

Page 8: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 8

Facts

• Contain business activity data• May be at detailed level or status level; called

transaction-oriented or snap-shot oriented• Deciding on granularity : every sale or total sales

of a day ?• Often contain numeric attributes for aggregation

(additive, semi-additive,…)• Contain dimensional table keys also

Page 9: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 9

Snowflake Schema

• Hierarchies not captured explicitly in a star schema

• Snowflake schema represents hierarchy directly• Saves on storage but requires more join

Page 10: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 10

Snowflake Schema

• Represent dimensional hierarchy directly by normalizing tables.

T ime

prod

cust

city

fact

date, custno, prodno, cityname, ...

region

Page 11: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 11

Conformed Dimensions and Facts

• Goal is to produce a master suite of conformed dimensions and to standardize facts

• conformed dimension means same thing with every fact table (eg., customer, time, geography)

• it may contain data brought together from many sources

• ensures same units and meaning, same time durations and geographies across marts

Page 12: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 12

Financial Services : A Case Study

• A bank offers various products/services like saving/checking accounts, mortgage loans, personal loans, TD, credit cards, etc…

• Purpose : track various a/c, customer profiles, etc…, for marketing and offering new services

• Requirements:– Get end-of-month summary of a/c for last 5 years– Valid snapshot as of yesterday for current month (with

full details)– Ability to group a/c in various ways & compare

balances– demographic behavior

Page 13: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 13

Case Study ...

• Each account type has some unique attributes (requiring customized dimension and facts for each)

• Old data (a/c & customers ) may be incomplete or even different

• The warehouse data may come from multiple sources :– Loan processing system(customer,loan,dues,payment)– Fixed deposit system(customer,TD,…)– Front-office system(customer, account, transaction,..)– Credit-card system customer, transactions, interest,..)

Page 14: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 14

Case Study ...

• Must plan extraction, correlation, consistent representation,…

• Let us consider a possible warehouse design for the indicated requirements

• Core fact table : balance in each account, # of transactions, grain : month

• Dimensions : a/c, household, branch, product, status, time

• A/c and household separate : many accounts per family; household definitions change

Page 15: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 15

Case Study ...

• Product dimension permits hierarchy and defining specific attributes; separate because it changes

• Status : active or not, closed, etc. with reasons• Account contains customer’s data; for historical

reasons, customer to accounts relationship not well maintained

Page 16: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 16

account keyprimary_namesecondary_nameaccount_addressaccount_cityaccount_stateaccount_zipdate_openedprimary_ageprimary_sexprimary_marital

household keyhousehold_head_namehousehold_addresshousehold_cityhousehold_statehousehold_ziphousehold_incomehousehold_type

Household Facts

account_keyhousehold_keybranch_keyproduct_keystatus_keytime_keyprimary_balancetransaction_count

product keyproduct_descriptiontypecategory

time keymonthyearfiscal_quarter

status keystatus_descriptionstatus_reasonnew_account_flagclosed_account_flag

branch keybranch-namebranch_addressbranch_citybranch_statebranch_zipbranch_type

The household data warehouse

Page 17: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 17

Case Study ...

• Balance is semi-additive : can not be added across time

• Products highly heterogeneous : different attributes characterize different accounts (balance, deposit options, interest rate, over draft limit,..)

• Can’t combine all in a dimension as many not applicable to all products

Page 18: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 18

Case Study ...

• Solution: create many facts, customized for each product, and one core fact with a product dimension having common attributes; leads to 100% replication, but facilitates clarifications, browsing, etc. and avoids join of customized and core facts

• When many facts are to be stored together go for snapshots (eg. monthly)

Page 19: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 19

Case Study ...

• Transaction-grained facts usually have a single fact (eg. amount) that is directly involved in the transaction; we need a transaction dimension to represent these amounts

• In transaction grained fact table, we do not need customized fact tables per product; instead we create customized dimension tables

Page 20: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 20

BusinessRequirement

Definition

BusinessRequirement

Definition

TechnicalArchitecture

Design

TechnicalArchitecture

Design

ProductSelection &Installation

ProductSelection &Installation

DimensionalModeling

DimensionalModeling

Data StagingDesign &

Development

Data StagingDesign &

Development

End-UserApplication

Development

End-UserApplication

Development

Projectplanning

End-UserApplication

Specification

Project ManagementProject Management

Deploy-ment

Deploy-ment

Main-tenence &

Growth

Main-tenence &

Growth

PhysicalDesign

Data Warehouse Life Cycle

Page 21: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 21

Life Cycle : summary

• Project planning• Business requirements definition• Data track

– Dimensional modeling

– Physical design

– Data staging design and development

• Technology track– Technical architectural design

– Product selection and and installation

Page 22: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 22

Life Cycle...

• Application track– End user application specification

– End user application development

• Deployment• Maintenance and growth• Project management

Page 23: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 23

Collecting Requirements...

• Interviews/write-ups• Requirements findings document

– Project overview

– review of business objectives

– analytic and information requirements

– preliminary source systems analysis

– Preliminary success criteria

• Prepare and publish the requirements

Page 24: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 24

Collecting Data about Existing Systems

• Understanding the candidate data sources• Detailed criteria for selecting the data sources

– Data accessibility– Longevity of the feed– Data accuracy– Project scheduling

• Customer matching and house-holding• Browsing and data content• Mapping data from source to target

Page 25: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 25

Designing the Data Warehouse / Data Marts

• Identifying marts and dimensions• identify marts based on facts likely to be used

together, as a mart is a kind of subject area or application (divide-and-conquer strategy)

• often based on a single business process or a single source

• 10 to 30 marts common for a large organization• build a matrix of marts versus dimensions

Page 26: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 26

Designing a Fact

• Define fact grain based on the basic business facts stored in legacy systems

• Choose dimensions and match them with granularity of facts

• Combine as many facts as possible with the context of defined granularity

Page 27: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 27

Detailed Design Tips

• Names for dimensions and attributes should be chosen carefully to refer to corresponding business entities

• An attribute (in a dimension) is not replicated, but a fact may be present in many fact tables

• If a dimension occurs multiple times (eg, time), it is playing multiple roles; name them uniquely

• Every fact should have a default aggregation rule so that it is not aggregated wrongly

Page 28: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 28

Dimension Attributes

• The quality of the data warehouse is measured by the quality of the dimension attributes

• The user interface responses and final reports are restricted to the precise contents of the dimension table attributes

• Properties– Verbose, descriptive, complete

– Quality assured, indexed

– Equally available, documented

Page 29: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 29

Time Dimension

• Every data warehouse fact table is a time series of some observations

• We always seems to have one or more time dimensions in our fact table designs

• Provides useful hierarchies : week, month, quarter, year, etc

• Represents calendar with many useful attributes like day of week, day of month, week#, day#, quarter, weekday-flag, last-day-of-month-flag, holiday flag, etc.

Page 30: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 30

Slowly Changing Dimensions

• The product key or customer key does not change, but the description of the product or customer does

• The data warehouse has three options for above changes– Overwrite the dimension record with the new values,

thereby losing history

• It is used whenever the old value of the attribute has no significance

• The corrections of any error falls into this category

Page 31: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 31

Slowly Changing Dimensions...

– Create a new additional dimension record using a new value of the surrogate key

• is primary technique for accurately tracking a change in an attribute within a dimension

• requires use of a surrogate key

• a slowly changing dimension is used when a true physical change to the dimension entity has taken place

– Create an “old” field in the dimension record to store the immediate previous attribute value

• It is used when a change is tentative

Page 32: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 32

Time Stamping the Changes

• The design of slowly changing dimension may be established by adding begin and end time stamps and a transaction description in each instance of a dimension record

• This design allows very precise time slicing of the dimension by itself

Page 33: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 33

Large Dimensions

• Data warehouses that store extremely granular data may require some extremely large dimensions

• To support large dimensions we must choose the indexing technologies and data design approaches that:– supports rapid browsing of the unconditional

dimension, especially for low cardinality attributes

– Supports efficient browsing of cross-constrained values in the dimension table

– Find and suppress duplicate entries in the dimension

Page 34: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 34

Foreign Key, Primary Key, Surrogate Key

• All dimensional tables have single keys, which, by definition, are primary keys

• All data warehouse keys must be meaningless surrogate keys; you must not use the original production keys

• A four byte integer makes a good surrogate key• Surrogate date keys• Avoid smart keys• Avoid production keys

Page 35: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 35

Heterogeneous Product Schemas

• Multiple fact tables are needed when a business has heterogeneous products

• The global view needs a single core fact table crossing all lines of business, whereas local view focuses on specific product

• There are many attributes and facts which apply only to a specific product; a single fact table is not feasible

• create customized fact and (product) dimension table for each product, and build a core fact table with attributes that make sense across all lines of business; this allows to create a single portfolio (of products) for each customer

Page 36: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 36

Transaction Schema

• Every data mart needs two separate models– Transaction version

– Periodic snapshot version

• ‘rolling’ snapshot containing averages across time

• Snapshots allow us to quickly measure the status of the enterprise

• The Transaction schema– low level transactions in the organization makes for a

good dimensional frame work

– The fact record for an individual transaction frequently contains only a single value

Page 37: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 37

Transaction Schema..

• The transaction-based WH commonly used in– Time of day analysis

– Queue analysis

– Fraud detection

– Basket analysis

– Current status

Page 38: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 38

Factless Fact Tables

• useful to describe events and their coverage• an event fact table records occurrence of an

event; has only flag and dimension keys (eg, student attendance)

• coverage fact table is frequently needed when a primary fact table in a dimensional data warehouse is sparse; eg, primary fact table will not provide items which were on promotion but did not sale; the coverage table, containing only dimension keys, lists all items on sale

Page 39: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 39

Facts of Different Granularity

• The dimensional model gains power as the individual fact records become more and more atomic

• At the lowest level of individual transactions, the design is most powerful because– More of the descriptive attributes have single values

– The design withstands surprise in the form of new facts, new dimensions, or new attributes within existing dimensions

– More expressiveness at the lowest levels of granularity

Page 40: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 40

Metadata Catalog

• It is an integral part of the overall architecture• It contains information that describes the

warehouse and plays an active role in its creation, use, and maintenance

• Contains source system metadata (data and processes), data staging metadata (dimensions, transformations, aggregations), DBMS metadata (tables, indexes, stored procedures), and front-room metadata (users, applications)

Page 41: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 41

Technical Architecture

• Metadata driven– Metadata provides flexibility by buffering the various

components of the system from each other

– The metadata catalog provides parameters and information that allow the application to perform their task

Page 42: Data Warehouse :  Modeling and Design

NLS/IITB/DWH 42

Conclusion

• Building a corporate-wide data warehouse is a challenging task

• A systematic methodology essential• Plan the architecture globally but build it

incrementally• Keep user requirements at the core of all

development activities