139
NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Embed Size (px)

Citation preview

Page 1: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

NIC Exposure Level Training

Vijayendra GururaoBusiness Intelligence Consultant

Page 2: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Agenda

Data warehousing Concepts - Day 1 Govt Case Study - Day 2 Defense HR Case study - Day

3 Manufacturing Case Study - Day

4 Data mining - Day

5

Page 3: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

What’s What’s importantimportant

What itWhat itmeansmeans

What to doWhat to doabout it about it

ReportReport

Summarize Summarize (MIS)(MIS)

Focus Focus (EIS)(EIS)

Analyze Analyze (OLAP)(OLAP)

ActAct(Intelligent Agents)(Intelligent Agents)

RecommendRecommend(Data Mining)(Data Mining)

20012001

19961996

19911991

19861986

ActionAction

Type of AnalysisType of Analysis

ActiveActive

PassivePassive

HumanHuman TechnologyTechnology

The Evolution of Business Intelligence

Page 4: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Data Warehousing

Page 5: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Introduction:

Definitions Legacy Systems Dimensions Data Dependencies Model Dimensional Model

Page 6: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

An ER ModelShipType

Shipper

DistrictCredit

OrderItem

Ship To

Product

ContactLocat.

ProductLine

SalesOrder

Cust.Locat.

ProductGroup

Contract ContractType

Customer

SalesRep

SalesDistrict

SalesRegion

SalesDivision

Contact

Page 7: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Why Data Warehouses?

To meet the long sought after goal of providing the user with more flexible data bases containing data that can be accessed “every which way.”

Page 8: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

OLTP vs. OLAP

OLTP (Online transaction processing) has been the standard reason for IS and DP for the last thirty years. Most legacy systems are quite good at capturing data but do not facilitate data access.

OLAP (Online analytical processing) is a set of procedures for defining and using a dimension framework for decision support

Page 9: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

The Goals for and Characteristics of a DW

Make organizational data accessible Facilitate consistency Adaptable and yet resilient to change Secure and reliable Designed with a focus on supporting decision

making

Page 10: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

The Goals for and Characteristics of a DW

Generate an environment in which data can be sliced and diced in multiple ways

It is more than data, it is a set of tools to query, analyze, and present information

The DW is the place where operational data is published (cleaned up, assembled, etc.)

Page 11: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Data Warehousing is Changing!

ERPCampaign Management

Supply Chain

Customer Relationship

Mgmt.

E-commerce

Target Marketing

Knowledge Management

Call Center

ERP

Application requirements--not just data requirements--are now driving need.

Page 12: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Organization of data in the presentation area of the data warehouse

Data in the warehouse are dimensional, not normalized relations However, data that are ultimately

presented in the data warehouse will often be derived directly from relational DBs

Data should be atomic someplace in the warehouse; even if the presentation is aggregate

Uses the bus architecture to support a decentralized set of data marts

Page 13: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Updates to a data warehouse For many years, the dogma stated

that data warehouses are never updated.

This is unrealistic since labels, titles, etc. change.

Some components will, therefore, be changed; albeit, via a managed load (as opposed to transactional updates)

Page 14: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Basic elements of the data warehouse

Services:

Clean, combine, and standardizeConform DimensionsNo user query services

Data Store:

Flat files and relational tables

Processing:

Sorting and sequential processing

DataStaging

Area

Data Mart #1

DimensionalAtomic and summary dataBased on a single business process

Data Mart #2

Similar design

DW Bus:Conformed facts and dimensions

Ad hoc query tools

Report Writers

Analytical Applications

Modeling:

Forecasting

Scoring

Data MiningExtract

Extract

Extract

Load

Load Access

Access

Operational Source

Systems

DataPresentation

Area

DataAccessTools

Page 15: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Data Staging Area

Extract-Transformation-Load Extract: Reading the source data and

copying the data to the staging area Transformation:

Cleaning Combining Duplicating Assigning keys

Load: present data to the bulk loading facilities of the data mart

Page 16: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Dimensional Modeling Terms and Concepts Fact table Dimension tables

Page 17: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Fact Tables

Fact table: a table in the data warehouse that contains Numerical performance measures Foreign keys that tie the fact table to the

dimension tables

Page 18: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Fact Tables

Each row records a measurement describing a transaction

Where? When? Who? How much? How many?

The level of detail represented by this data is referred to as the grain of the data warehouse

Questions can only be asked down to a level corresponding with the grain of the data warehouse

Page 19: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Dimension tables

Tables containing textual descriptors of the business Dimension tables are usually wide (e.g.,

100 columns) Dimension tables are usually shallow

(100s of thousand or a few million rows) Values in the dimensions usually provide

Constraints on queries (e.g., view customer by region)

Report headings

Page 20: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Dimension tables

The quality of the dimensions will determine the quality of the data warehouse; that is, the DW is only as good as its dimension attributes

Dimensions are often split into hierarchical branches (i.e., snowflakes) because of the hierarchical nature of organizations Product part Product Brand

Dimensions are usually highly denormalized

Page 21: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Dimension tables

The dimension attributes define the constraints for the DW. Without good dimensions, it becomes difficult to narrow down on a solution when the DW is used for decision support

Page 22: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Bringing together facts and dimensions – Building the dimensional Model Start with the normalized ER Model Group the ER diagram components into

segments based on common business processes and model each as a unit

Find M:M relationships in the model with numeric and additive non-key facts and include them in a fact table

Denormalize the other tables as needed and designate one field as a primary key

Page 23: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

A Dimensional Model

time_key

day_of_Week

month

quarter

year

holiday_flag

time_key

product_key

store_key

dollars_sold

units_sold

dollars_cost

product_key

description

brand

category

store_key

store_name

address

floor_plan_type

Time DimensionSales Fact

Product Dimension

Store Dimension

Page 24: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Kimball Methodology

Conformed Dimensions

Page 25: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Review: A Private Data Mart

A data mart containing one fact table and three dimension tables. We delivered all the tables by executing a fact build.

What if we want to add another fact table called F_Sales that will reference the three existing dimension tables?

Page 26: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Understand Conformed Dimensions

Sales Fact

Distribution Fact

Order Fact

Location

Customer

Product

Time

Distributor

Promotion

Co

nfo

rmed

Dim

ensi

on

sC

on

form

ed D

imen

sion

s

A conformed dimension is a dimension that is standardized across all data marts.

Page 27: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Advantages of Conformed Dimensions

Deliver incremental data marts in a short period of time.

Independent data marts become part of a fully integrated data warehouse.

Deliver a consistent view across your business.

Page 28: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Conformed Dimensions Within Bus Architecture

Facts

Dimensions

Sales Fact

Distribution Fact

Order Fact

Lo

cati

on

Cu

sto

mer

Pro

du

ct

Tim

e

Dis

trib

uto

r

Pro

mo

tio

n

X

X

X

X X

X

X

X

X

X

X

X

Identifying and designing the conformed dimensions is a critical step in the architecture phase of data warehouse design.

Page 29: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Design of Conformed Dimensions

A commitment to using conformed dimensions is more than just a technical consideration. It must be a business mandate.

Lay out a broad dimensional map for the enterprise. Define conformed dimensions at the most granular

(atomic) level possible. Conformed dimensions should always use surrogate

keys. Define standard definitions for dimension and fact

attributes.

Page 30: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Granularity in Conformed Dimensions

D_ProductProduct IdDescriptionProduct TypeType DescriptionProduct LineLine Description

Order FactDay IdProduct IdCustomer IdCostNumberOrdered

D_TimeDayDay IdDayMonth Year

D_CustomerCustomer IdLast NameFirst NameAddress

Conformed dimensions should be defined at the most granular (atomic) level so that each record in these tables corresponds to a single record in the base-level fact table.

Page 31: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Flexibility of Conformed Dimensions

ProductProduct IdDescriptionProduct TypeProduct Line

CustomerCustomer IdLast NameFirst NameAddress

Time(Day)Day IdDayMonth Id Period

Time(Month) ViewMonth IdMonthPeriod

View or Snowflake table

Time Dimension

Conformed dimensions are usually designed within star schema data marts. For multiple granularity fact tables, higher level views of dimensions can be used (or a snowflake table).

Order FactDay IdProduct IdCustomer IdCostNumberOrdered

Sales FactMonth IdProduct IdCustomer IdAmountSoldRevenue

Page 32: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

So, What is a DW?

A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data in support of management’s decisions

W.H. Inmon (the father of DW)

Page 33: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Subject Oriented

Data in a data warehouse are organized around the major subjects of the organization

Page 34: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Integrated

Data from multiple sources are standardized (scrubbed, cleansed, etc.) and brought into one environment

Page 35: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Non-Volatile

Once added to the DW, data are not changed (barring the existence of major errors)

Page 36: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Time Variant

The DW captures data at a specific moment, thus, it is a snap-shot view of the organization at that moment in time. As these snap-shots accumulate, the analyst is able to examine the organization over time (a time series!)

The snap-shot is called a production data extract

Page 37: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Need for Data Warehousing Integrated, company-wide view of high-quality information (from

disparate databases) Separation of operational and informational systems and data

(for improved performance)

comparison of operational and informational systems

Page 38: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant
Page 39: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Data Warehouse Architectures Generic Two-Level Architecture Independent Data Mart Dependent Data Mart and Operational Data

Store Logical Data Mart and @ctive Warehouse Three-Layer architecture

All involve some form of extraction, transformation and loading (ETLETL)

Page 40: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Generic two-level architecture

E

T

LOne, company-wide warehouse

Periodic extraction data is not completely current in warehouse

Page 41: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Independent Data Mart Data marts:Data marts:Mini-warehouses, limited in scope

E

T

L

Separate ETL for each independent data mart

Data access complexity due to multiple data marts

Page 42: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Dependent data mart with operational data store

ET

L

Single ETL for enterprise data warehouse(EDW)(EDW)

Simpler data access

ODS ODS provides option for obtaining current data

Dependent data marts loaded from EDW

Page 43: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Logical data mart and @ctive data warehouse

ET

L

Near real-time ETL for @active Data Warehouse@active Data Warehouse

ODSODS and data warehousedata warehouse are one and the same

Data marts are NOT separate databases, but logical views of the data warehouse Easier to create new data marts

Page 44: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Three-layer architecture

Page 45: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

DW Design

Mainly consists of Logical Design Physical Design

Page 46: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Logical Design of DW

Identification of Entities Relationships Attributes Uniqe identifiers

Conceptual and abstract. Results in Fact and dimension tables

Created using Pen and Paper OR Modeling tools also

Page 47: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Physical Design of DW

Conversion of data gathered in Logical design to Physical database structure

Mainly driven for query performance

Page 48: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Logical Physical

Page 49: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Data CharacteristicsStatus vs. Event Data Example of DBMS log entry

Status

Status

Event = a database action (create/update/delete) that results from a transaction

Page 50: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Data CharacteristicsData CharacteristicsTransient vs. Periodic DataTransient vs. Periodic Data

Figure 11-8: Transient operational data

Changes to existing records are written over previous records, thus

destroying the previous data content

Page 51: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Data CharacteristicsData CharacteristicsTransient vs. Periodic DataTransient vs. Periodic Data

Periodic warehouse data

Data are never physically altered or deleted once they

have been added to the store

Page 52: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Data Reconciliation Typical operational data is:

Transient – not historical Not normalized (perhaps due to denormalization for

performance) Restricted in scope – not comprehensive Sometimes poor quality – inconsistencies and errors

After ETL, data should be: Detailed – not summarized yet Historical – periodic Denormalized Comprehensive – enterprise-wide perspective Quality controlled – accurate with full integrity

Page 53: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Extract Transform Load

Extract data from operational system, transform and load into data warehouse

Why ETL? Will your warehouse produce correct

information with the current data? How how can I ensure warehouse

credibility?

Page 54: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Excuses for NOT Transforming Legacy Data Old data works fine, new will work as

well. Data will be fixed at point of entry

through GUI. If needed, data will be cleaned after new

system populated; After proof-of-concept pilot.

Keys join the data most of the time. Users will not agree to modifying or

standardizing their data.

Page 55: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Levels of Migration Problem Existing metadata is insufficient and

unreliable Metadata must hold for all occurrences Metadata must represent business and

technical attributes Data values incorrectly typed and

accessible Values form extracted from storage Values meaning inferred from its content

Entity keys unreliable or unavailable Inferred from related values

Page 56: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Metadata Challenge

Metadata gets out of synch with details it summarizes Business grows faster than systems

designed to capture business info Not at the right level of detail

Multiple values in a single field Multiple meanings to a single field No fixed format for value

Expressed in awkward of limited terms Program/compiler view rather than business

view

Page 57: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Character-level Challenge Value instance level

Spelling, aliases Abbreviations, truncations, transpositions Inconsistent storage formats

Named type level Multiple meanings, contextual meanings Synonyms, homonyms

Entity level No common keys or representation No integrated view across records, files,

systems

Page 58: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

The ETL Process

Capture Scrub or data cleansing Transform Load and Index

ETL = Extract, transform, and load

Page 59: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

The ETL ProcessSource Systems

Extract Transform

Staging Area

Load

PresentationSystem

Page 60: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Source Data

Record the name location and data that exists in the TPS environment.

File names and location Layout Attribute meaning

 Source

BusinessOwner

ISOwner

 Platform

 Location

Data SourceDescription

           

Page 61: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Extraction

Copy specific data directly from the source tables into a working dataset in the staging area.

Target Table

Target Column

Data

Type

 Len

Target Column Description

Source

System

Source Table /

File

Source Col / Field

Data Txform Notes

                 

Page 62: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Transformation (Dimension Tables) Generate surrogate key in a primary-

surrogate table. Make this permanent. Insert the surrogate key into the working

dimension tables. Conduct any editing/cleaning operations

you need (usually on the working table) Generate any derived attributes you

need. Generate and retain process logs.

Page 63: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Transformation(Fact tables) Join all dimensions to the fact table

(using original primary keys). Insert surrogate keys

Generate derived facts Generate indicator flags

Chg

Flag

Fact Grou

p

Derived Fact Name

 Derived Fact Description

 Typ

e

Agg Rule

 Formula

 Constrai

nts

Transfor-

mations

                 

Page 64: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Target Data

Describe the presentation data structure.

Model Metadata Usage and constraints

TableName

ColumnName

Data

Type

 Len Null

s?

ColumnDescription

 PK

PKOrder

 FK

                 

Page 65: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Flow Documentation

DFD for the ETL process ERD for Source, Staging and

Target databases. Metadata Usage notes.

Page 66: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Steps in data reconciliation

Static extractStatic extract = capturing a snapshot of the source data at a point in time

Incremental extractIncremental extract = capturing changes that have occurred since the last static extract

Capture = extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse

Page 67: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Steps in data reconciliation (continued)

Scrub = cleanse…uses pattern recognition and AI techniques to upgrade data quality

Fixing errors:Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Also:Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data

Page 68: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Steps in data reconciliation (continued)

Transform = convert data from format of operational system to format of data warehouse

Record-level:Record-level:Selection – data partitioningJoining – data combiningAggregation – data summarization

Field-level:Field-level: single-field – from one field to one fieldmulti-field – from many fields to one, or one field to many

Page 69: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Steps in data reconciliation (continued)

Load/Index= place transformed data into the warehouse and create indexes

Refresh mode:Refresh mode: bulk rewriting of target data at periodic intervals

Update mode:Update mode: only changes in source data are written to data warehouse

Page 70: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Single-field transformation

In general – some transformation function translates data from old form to new form

Algorithmic transformation uses a formula or logical expression

Table lookup – another approach

Page 71: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Multi field transformation

M:1 –from many source fields to one target field

1:M –from one source field to many target fields

Page 72: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Derived Data Objectives

Ease of use for decision support applications Fast response to predefined user queries Customized data for particular target audiences Ad-hoc query support Data mining capabilities

Characteristics Detailed (mostly periodic) data Aggregate (for summary) Distributed (to departmental servers)

Most common data model = star schemastar schema(also called “dimensional model”)

Page 73: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Components of a star schemastar schemaFact tables contain factual or quantitative data

Dimension tables contain descriptions about the subjects of the business

1:N relationship between dimension tables and fact tables

Excellent for ad-hoc queries, but bad for online transaction processing

Dimension tables are denormalized to maximize performance

Page 74: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Star schema example

Fact table provides statistics for sales broken down by product, period and store dimensions

Page 75: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Star schema with sample data

Page 76: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Slowly Changing dimensions Ragged Hierarchies

Advanced concepts

Page 77: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

What if Our Data is not Static? Small occasional changes in dimension data

are normal in business. Examples of these changes include:

addition of new members (a new product is launched) changing of relationships within the dimension (a sales

rep moves to another branch) properties of members changed (a product is

reformulated or renamed) deletion of members (this is rare in data warehousing)

Page 78: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Understand Surrogates 4-byte integer key

(can hold more than 2 billion positive integers)

Internally assigned and meaningless

insures uniqueness

always known

Used in conjunction with business keys

business key is often mnemonic; for example, OTA used for Ottawa office

surrogate key is numeric; for example, 000128

Surrogate keys are never used in reports. They are used to link dimension tables to fact tables.

Page 79: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Understand Surrogate Keys Used In Operational Systems

Operational databases also sometimes use surrogate keys (for example, Employee_No). These keys typically cannot be used as the data mart surrogate keys.

A single member in a data mart (for example, a particular employee) may have several data mart surrogate keys assigned over time to deal with slowly changing dimensions.

You may have to merge entities from separate operational systems, each with its own operational surrogate key (for example, customers from separate banking and insurance applications).

Operational surrogate keys are usually considered business keys in the data mart.

Page 80: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Understand Natural Keys: Example

Product Dimension

Fact Table

Customer Dimension

Metrics

PR X 002 39 SA 1 11

PR X 003 40 LO 2 22

PR Y 003 40 SE 5 55

Prod Code Cust Code

Measures

PR X 002 39 Soup

PR X 003 40 Beans

PR Y 003 40 Peas

Prod Code Name

SA 1 11 Safeway

LO 2 22 Loblaws

SE 5 55 7-11

Cust Code Name

Page 81: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Understand Surrogate Keys: Example

Product Dimension

Fact Table

Customer Dimension

1 10

2 20

3 30

Prod Sur Cust Sur

Measures

1 PR X 002 39 Soup

2 PR X 003 40 Beans

3 PR Y 003 40 Peas

10 SA 1 11 Safeway

20 LO 2 22 Loblaws

30 SE 5 55 7-11

Cust Sur Cust Code NameProd Sur Prod Code Name

Page 82: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Track Dimensional Changes Over Time

Operational systems tend to contain data about the current state of the business.

A data warehouse is expected to hold data for five to 10 years.

Users may need to query data as of any particular date (for example, at which office(s) did Mary Smith work between January/1999 and December/1999?).

If Mary Smith changes offices, to which office do her sales apply, the old one or the new one?

Page 83: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Understand Slowly Changing Dimensions (SCD)

Operational dimensional data may often be thought of as static. It may only need to reflect the current state.

Data warehouse dimensional data often must show how the dimensional data changes over time. It is not static.

The term Slowly Changing Dimension (SCD) refers to the tracking of changes to dimensional data over time.

Page 84: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Understand Issues With Slowly Changing Dimensions Maintaining SCDs can be complex without surrogates.

*

* (Emp. No + Branch)

**

** (Emp. No + Branch + Position)

***

*** (Emp. No + Branch + Position + Salary)

Imagine the effect of having such a largeNatural key in the fact table.

10001 Jack OTA VP Jan88' 50K

Emp. No Name Branch Position Hire Date

Salary

10002 Jane ARL MK Jan92' 40K

10003 Tom NY SS Jan93' 35K

Business Key

Normal Type2 Type2 Normal Type2

10001 Jack SJ VP Jan88' 50K

10001 Jack SJ S-VP Jan88' 50K

10001 Jack SJ S-VP Jan88' 60K

Surrogate

1

2

3

4

5

6

Page 85: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Use Different Methods of Handling Dimensional Changes

Two most commonly used types of SCDs (according to Kimball):Type 1. Overwrite the old value with the new value

(do not track the changes).Type 2. Add a new dimension record with a new

surrogate key (track changes over time).

A single row may have a combination of columns of different types.

Page 86: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Type 1: Overwrite the Original Value

Sales Rep Dimension Table

Rep Key

Order Date

Cust Key

00128 1/1/1999 1234500128 2/1/1999 12345

Sales Fact TableRep Key

Name Marital Status

Office …

00128 Mary Smith Single Dallas

Rep Key

Name Marital Status

Office …

00128 Mary Jones Married Dallas

The organization may not choose to track certain data changes because: the original data may have been incorrect the change is not considered relevant for

tracking

Page 87: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

When a tracked change is detected, a new surrogate key is assigned and a new row is added to the dimension table.

Usually, an effective begin/end date is also updated on the new and old rows.

Multiple rows may have the same business key, but they will always have unique surrogate keys.

Type 2: Add a New Dimension Record

Sales Rep Dimension TableSales Fact Table

RepSur Key

Rep Key

Name Office Eff Date

11111 00128 Mary Smith Dallas 990111112 00128 Mary Smith NYC 9903

RepSur Key

Order Date Cust Key

11111 01/01/1999 1234511111 02/01/1999 1234511112 03/01/1999 1234511112 04/01/1999 12345

Page 88: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Balanced and Ragged Hierarchies

Dimensional data are usually structured as hierarchies, either balanced or ragged (unbalanced).

Balanced hierarchies (those with a fixed number of levels) are most common and are the easiest to understand and analyze.

In ragged hierarchies, each branch does not break down into the same number of levels. They are harder to analyze and report against.

Also, PowerPlay requires that all leaf (lowest-level) nodes be at the same level to aggregate properly.

Page 89: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Parent-Child Relationships

Employees

Orders

Reports To

Parent-child relationship are recursive relationships.

The levels of the hierarchy are determined by rows of the same table.

Page 90: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Ragged Hierarchies

Nancy Davolio Janet Leverling Margaret Peacock Laura Callahan

Michael Suyama Robert King Anne Dodsworth

Stephen Buchanan

Andrew Fuller

Employee Hierarchy

leaf leaf leaf leaf

leaf leaf leaf

Leaf nodes have no children. DecisionStream fact builds only look for leaf

nodes at the lowest level.

Page 91: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Resolve Ragged Hierarchies: Step 1

Create an auto-level hierarchy to obtain the number of levels.

Create a dimension build to create a physical table that will identify for each row the level it belongs to.

Page 92: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Resolve Ragged Hierarchies: Step 1 (cont’d) Use Auto-Level Hierarchies

Report to Steven Buchanan (Level 3)

Report to Andrew Fuller (Level 2)

Report to Andrew Fuller (Level 2)

Top Level (Level 1)

The purpose of auto-level hierarchies in DecisionStream is to determine the number of levels in a hierarchy.

Page 93: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Issues Regarding Star Schema

Dimension table keys must be surrogate (non-intelligent and non-business related), because: Keys may change over time Length/format consistency

Granularity of Fact Table – what level of detail do you want? Transactional grain – finest level Aggregated grain – more summarized Finer grains better market basket analysis capability Finer grain more dimension tables, more rows in fact table

Page 94: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

The User InterfaceMetadata (data catalog)

Identify subjects of the data mart Identify dimensions and facts Indicate how data is derived from enterprise data

warehouses, including derivation rules Indicate how data is derived from operational data

store, including derivation rules Identify available reports and predefined queries Identify data analysis techniques (e.g. drill-down) Identify responsible people

Page 95: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Q & A

Page 96: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Multi-dimensional data

Page 97: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

On-Line Analytical Processing (OLAP) The use of a set of graphical tools that provides users

with multidimensional views of their data and allows them to analyze the data using simple windowing techniques

Relational OLAP (ROLAP) Traditional relational representation

Multidimensional OLAP (MOLAP) CubeCube structure

OLAP Operations Cube slicing – come up with 2-D view of data Drill-down – going from summary to more

detailed views

Page 98: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Overall Plan

We need fast answers to analytical questions

Relational model may not be the answer

We can restructure data specifically for analysis

First we need to find out how people analyse data

Page 99: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Overall Plan

Analysis analysis reveals importance of measures and dimensions

So we structure the data with that in mind

The star schema is the physical structure that emerges

We can implement this as ROLAP, MOLAP and HOLAP

We achieve our objective – rapid analytical processing

Page 100: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

How do we make databases faster? Indexing Query design Application design Care with locking Lots of ways Data structuring

Page 101: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Relational

Page 102: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Data structuring –Relational model

Pros Data integrity in the face of user updates Small data volume Good for transactional queries

Cons Poor analytical query performance

Page 103: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Why are relational databases slow? Joins Functions Aggregations

Poor analytical query performance

Page 104: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Poor analytical query performance

So, there is a tension between: Transactions Analytical querying

Solution: Split them up Take a copy of the transactional database and structure it

in a totally different way that is optimised for analytical querying

Page 105: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Structure for analytical querying

Great idea, but first we need to find out how people analyse their data

Page 106: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant
Page 107: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant
Page 108: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant
Page 109: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

How people analyse their data

People analyse their data in terms of: Graphs Grids Reports

Do these have anything in common?

Page 110: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

How people analyse their data

Do these have anything in common? Measures

Numerical values Typically plotted on the Y axis

Dimensions Discontinuous variables Slice the measures into aggregated groups Typically plotted on the X axis

Page 111: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant
Page 112: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant
Page 113: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant
Page 114: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

How people analyse their data

Dimensions are often hierarchical People want to analyse:

Time by Year, Quarter, Month, Day Product by Warehouse, Type, Product Customer by Country, County, Person

Page 115: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

How people analyse their data

So, we need to summarise all of this….. Measures Dimensions Hierarchies

Page 116: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Employee

Region

Employee

Item

Year

Quarter

Class

Warehouse

Month

WeekDay

Time

Customer

Country

Region

Name

Delay

Profit

Quantity

County

Squashed Octopus

Product

Page 117: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Squashed Octopus

The SO is a logical model What about a physical model? (Recap why the relational model is slow for

analytical querying) Joins Functions Aggregations

Page 118: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Relational

Page 119: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Star Schema

Page 120: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Star Schema

What is in the fact table? Facts

Page 121: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant
Page 122: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Star Schema

What is in a dimension table? Dimensional information Hierarchical information

Page 123: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant
Page 124: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Star Schema

How do dimension and fact tables work together?

Page 125: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant
Page 126: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant
Page 127: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Star Schema

Is it faster?

Page 128: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Query Time

Relational

Time

Star Schema

Monthly totals 70 60

Sales in March by Product

18 6

Sales in March 2004 by Product

12 2

Page 129: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Star Schema

Is it faster? Yes

How can we make it even faster? Aggregation

Page 130: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant
Page 131: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Query Time

Relational

Time

Star Schema

Time

Aggregated

Star Schema

Monthly totals

70 60 <1

Sales in March by Product

18 6 <1

Sales in March 2004 by Product

12 2 <1

Page 132: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Star Schema

If we leave this as a set of tables then it is ROLAP – Relational OLAP

(OLAP – On-Line Analytical Processing) But it is a pain to manage

All those aggregation tables

Page 133: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Star Schema

So, the answer is MOLAP (Multi-dimensional OLAP)

Page 134: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Star Schema

Finally HOLAP (Hybrid OLAP)

Page 135: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

On-Line Analytical Processing (OLAP) OLAP Operations

Cube slicing – come up with 2-D view of data Drill-down – going from summary to more

detailed views

Page 136: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Figure 11-22: Slicing a data cube

Page 137: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Example of drill-down

Summary report

Drill-down with color added

Page 138: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Summary

We need fast answers to analytical questions

Relational model may not be the answer

We can restructure data specifically for analysis

First we need to find out how people analyse data

Page 139: NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Summary

Analysis analysis reveals importance of measures and dimensions

So we structure the data with that in mind

The star schema is the physical structure that emerges

We can implement this as ROLAP, MOLAP and HOLAP

We achieve our objective – rapid analytical processing