94
NLS/IITB/DWH 1 Data Warehouse : Design and Lifecycle N. L. Sarda Professor, IIT Bombay [email protected]

NLS/IITB/DWH1 Data Warehouse : Design and Lifecycle N. L. Sarda Professor, IIT Bombay [email protected]

Embed Size (px)

Citation preview

NLS/IITB/DWH 1

Data Warehouse : Design and Lifecycle

N. L. Sarda

Professor, IIT Bombay

[email protected]

NLS/IITB/DWH 2

Outline

• Introduction• Warehouse structure• A case study• Lifecycle for development• Dimensional analysis• Technical architecture• Conclusions

NLS/IITB/DWH 3

Introduction

• DW is a single, complete and consistent store of data from different sources to understand & analyze the business

• Contains history data• Typical decision support requires data to be co-

related, aggregated in an interactive manner• Warehouse to facilitate browsing, navigating,

aggregating and visualization of related data to understand performance, problems, customer preferences, trends, etc.

NLS/IITB/DWH 4

Introduction...

• Conventional MIS/reporting applications lacked interactivity and flexibility

• Warehouse data organized by important business subjects (customer, product, etc…)

NLS/IITB/DWH 5

Warehouse Structure

• Organized to facilitate ease of access and aggregation

• warehouse structure decomposed into dimensions and facts– Dimensions like ‘independent variables’, represent

entities for analysis

– Fact represents business data; relates to a set of dimensions

– Eg : customer, time, type of account are dimensions, and balances are facts

NLS/IITB/DWH 6

Warehouse Structure...

• The complex network of business entities and their relationships as depicted in an operational DB (using, say, ER model) is difficult for navigation and analysis

• A ‘2-level’ structure defined by ‘star schema’ is performed where a fact is at the center and dimensions form ‘spokes’

• Data not stored in ‘normalized’ form

NLS/IITB/DWH 7

Star Schema

• Contains a fact table and for each dimension one dimension table

Time Prod

Cust

fact

date, custno, prodno, cityname, ...

City

NLS/IITB/DWH 8

Dimensions

• Stored as a database table• Contains many descriptive attributes for analysis• Small and slowly changing data• Data often group-able for analysis

– Customers by age, occupation, income level

– Time by weeks, months, years

– Branches as rural, suburban or by size

• Thus, dimension data viewable as a hierarchy• For analysis, data here joined with facts

NLS/IITB/DWH 9

Dimensions...

• Joins very frequent; efficient access to dimension (through multiple indexes) and computation of join required

• Heavily used in constraints and GROUP-BY

NLS/IITB/DWH 10

Facts

• Contain business activity data• May be at detailed level or status level; called

transaction-oriented or snap-shot oriented• Deciding on granularity : every sale or total sales

of a day ?• Often contain numeric attributes for aggregation

(additive, semi-additive,…)• Contain dimensional table keys also

NLS/IITB/DWH 11

Snowflake Schema

• Hierarchies not captured explicitly in a star schema

• Snowflake schema represents hierarchy directly• Saves on storage but requires more join

NLS/IITB/DWH 12

Snowflake Schema

• Represent dimensional hierarchy directly by normalizing tables.

T ime

prod

cust

city

fact

date, custno, prodno, cityname, ...

region

NLS/IITB/DWH 13

Fact Constellation

• Fact Constellation– Multiple fact tables that share many dimension tables

– Booking and Checkout may share many dimension tables in the hotel industry

Hotels

Travel Agents

Promotion

Room Type

Customer

Booking

Checkout

NLS/IITB/DWH 14

Data Mart

• A subset of warehouse for use by individuals or departments

• Contents may be differently structured; may contain limited history; may be coarser / aggregated

• Lightens load on central warehouse• Users primarily use marts with OLAP tools for

analysis and decision support• refreshed periodically from central warehouse

NLS/IITB/DWH 15

Aggregates

• An aggregate is a fact table representing a summarization of base-level fact table data

• It is a pre-calculated summaries that are stored in the data warehouse to improve query performance

• Aggregates are used for speeding the queries by a factor of 100 or even 1000

• The IS owners of a data warehouse should exhaust the potential for using aggregates before investing in new hardware

NLS/IITB/DWH 16

Warehouse Architecture

• Building a single organization-wide WH that integrates all data from legacy systems is a very challenging task

• data marts are subject/dept-wise and easier to build

• multiple data marts must be relatable and inter-operable across depts or business areas

• Kimball proposes DW with a ‘bus architecture’; he proposes an architecture phase followed by construction of data marts independently and asynchronously

NLS/IITB/DWH 17

WH Architecture ...

• As marts come on-line, they fit with each other properly

• this approach natural in most cases as extraction of data for WH building is often source-wise and needs to be done independently

NLS/IITB/DWH 18

Conformed Dimensions and Facts

• Goal is to produce a master suite of conformed dimensions and to standardize facts

• resulting dimensions and facts for the ‘bus’• conformed dimension means same thing with

every fact table (eg., customer, time, geography)• it may contain data brought together from many

sources• without conformed dimensions, a WH cannot

function as a whole

NLS/IITB/DWH 19

WH Architecture ...

• Getting conformed dimensions represents 80 % up-front architecture effort

• rest for conformed facts that ensures same terminology across data marts so that ‘drill across’ can be done (eg, price, profit)

• ensures same units and meaning, same time durations and geographies across marts

NLS/IITB/DWH 20

WH Architecture ...

• Advantages of conformed dimensions – a single dimension table can be used against multiple

fact tables in the same WH

– user interfaces and data content are consistent whenever the dimension is used

– there is consistent interpretation of attributes and rollups across marts

– a new data mart can be created such that it can co-exist with other

• Use of conformed dimensions must be supported at the highest executive level

NLS/IITB/DWH 21

Financial Services : A Case Study

• A bank offers various products/services like saving/checking accounts, mortgage loans, personal loans, TD, credit cards, etc…

• Purpose : track various a/c, customer profiles, etc…, for marketing and offering new services

• Requirements:– Get end-of-month summary of a/c for last 5 years– Valid snapshot as of yesterday for current month (with

full details)– Ability to group a/c in various ways & compare

balances– demographic behavior

NLS/IITB/DWH 22

Case Study ...

• Each account type has some unique attributes (requiring customized dimension and facts for each)

• Old data (a/c & customers ) may be incomplete or even different

• The warehouse data may come from multiple sources :– Loan processing system(customer,loan,dues,payment)– Fixed deposit system(customer,TD,…)– Front-office system(customer, account, transaction,..)– Credit-card system customer, transactions, interest,..)

NLS/IITB/DWH 23

Case Study ...

• Must plan extraction, correlation, consistent representation,…

• Let us consider a possible warehouse design for the indicated requirements

• Core fact table : balance in each account, # of transactions, grain : month

• Dimensions : a/c, household, branch, product, status, time

• A/c and household separate : many accounts per family; household definitions change

NLS/IITB/DWH 24

Case Study ...

• Product dimension permits hierarchy and defining specific attributes; separate because it changes

• Status : active or not, closed, etc. with reasons• Account contains customer’s data; for historical

reasons, customer to accounts relationship not well maintained

NLS/IITB/DWH 25

account keyprimary_namesecondary_nameaccount_addressaccount_cityaccount_stateaccount_zipdate_openedprimary_ageprimary_sexprimary_marital

household keyhousehold_head_namehousehold_addresshousehold_cityhousehold_statehousehold_ziphousehold_incomehousehold_type

Household Facts

account_keyhousehold_keybranch_keyproduct_keystatus_keytime_keyprimary_balancetransaction_count

product keyproduct_descriptiontypecategory

time keymonthyearfiscal_quarter

status keystatus_descriptionstatus_reasonnew_account_flagclosed_account_flag

branch keybranch-namebranch_addressbranch_citybranch_statebranch_zipbranch_type

The household data warehouse

NLS/IITB/DWH 26

Case Study ...

• Balance is semi-additive : can not be added across time

• Products highly heterogeneous : different attributes characterize different accounts (balance, deposit options, interest rate, over draft limit,..)

• Can’t combine all in a dimension as many not applicable to all products

NLS/IITB/DWH 27

Case Study ...

• Solution: create many facts, customized for each product, and one core fact with a product dimension having common attributes; leads to 100% replication, but facilitates clarifications, browsing, etc. and avoids join of customized and core facts

• When many facts are to be stored together go for snapshot (eg. monthly) snapshots

NLS/IITB/DWH 28

Case Study ...

• Transaction-gained facts usually have a single fact (eg. amount) that is directly involved in the transaction; we need a transaction dimension to represent these amounts

• In transaction grained fact table, we do not need customized facts tables per product; instead we create customized dimension tables

NLS/IITB/DWH 29

BusinessRequirement

Definition

BusinessRequirement

Definition

TechnicalArchitecture

Design

TechnicalArchitecture

Design

ProductSelection &Installation

ProductSelection &Installation

DimensionalModeling

DimensionalModeling

Data StagingDesign &

Development

Data StagingDesign &

Development

End-UserApplication

Development

End-UserApplication

Development

Projectplanning

End-UserApplication

Specification

Project ManagementProject Management

Deploy-ment

Deploy-ment

Main-tenence &

Growth

Main-tenence &

Growth

PhysicalDesign

Data Warehouse Life Cycle

NLS/IITB/DWH 30

Life Cycle Phases

• Project planning– Life cycle begins with project planning and addresses

the scoping of the project

– focuses on resource and skill-level, staffing requirements, project task assignments, and duration

• Business requirements definition– success of the project depends on the sound

understanding of the business users and their requirements

– Data warehouse designers must understand the key factors driving the business requirement and translate them into design considerations

NLS/IITB/DWH 31

Phases ...

• Dimensional modeling– Dimensional model is performed by combining data

analysis with our earlier understanding of business requirements (represented as a matrix)

– this step identifies the fact table grain, associated dimensions, attributes and hierarchical drill paths, and facts

• Physical design– The primary elements in this phase are defining the

naming standards and setting up the database environment

– It focuses on defining the physical structures necessary to support the logical database design

NLS/IITB/DWH 32

Phases ...

• Data staging design and development– The data staging process has three major steps

– Extraction

• It exposes data quality issues within the operational system

– Transformation

• Consists of data re-structuring and type conversions (eg., form the EBCDIC character set to ASCII)

– Load

• Load the prepared data into the target tables

NLS/IITB/DWH 33

Phases ...

• Technical Architecture Design– It specifies the tools and techniques we will need to

make DW happen

• Product Selection and Installation– Architectural components such as Hardware

platforms, DBMS, and Data staging tools

• End user application specification– Application specification describe the report template,

user driven parameters, and required calculations.

• End user application Development

NLS/IITB/DWH 34

Phases ...

• Deployment– It is the convergence of technology, data, and end user

applications accessible from the business user’s desktop

– Business user education integrating all aspects of the convergence must be developed and delivered

• Maintenance and growth– Data warehouse acceptance and performance metrics

should be measured over time and the maintenance plan should include a communication strategy

– Prioritization processes must be established to deal with user demands for evolution and growth

NLS/IITB/DWH 35

Phases ...

• Project management– Project management ensures that the business

dimensional life cycle activities remain on track and synchronized

– these activities occurs throughout the life cycle

– It focuses on monitoring the project status, issue tracking, and change control to preserve scope

– It includes the development of a comprehensive project communication plan that addresses both the business and information system organization

• Use a good project management tool

NLS/IITB/DWH 36

Life Cycle : summary

• Project planning• Business requirements definition• Data track

– Dimensional modeling

– Physical design

– Data staging design and development

• Technology track– Technical architectural design

– Product selection and and installation

NLS/IITB/DWH 37

Life Cycle...

• Application track– End user application specification

– End user application development

• Deployment• Maintenance and growth• Project management

NLS/IITB/DWH 38

Assess Your Readiness

• Strong business management sponsors• Compelling business motivation• IS/Business partnership• Current analytic culture• Feasibility

NLS/IITB/DWH 39

Core Project Team

• Business system analyst• Data modeler• Data warehouse database administrator• Data staging system designer• End user application developers• Data warehouse educator

NLS/IITB/DWH 40

Special Teams

• Technical/security architect• Technical support specialists• Data staging programmer• Data administrator• Data warehouse quality assurance analyst

NLS/IITB/DWH 41

Develop the Project Plan

• Integrated and detailed• Resources• Original estimated effort• Start date• Original estimated completion date• Current estimated completion date• Status• Effort to complete• Dependencies• Late flags

NLS/IITB/DWH 42

Develop Communication Plan

• To manage expectations at all levels• within project team : share scope, plans, status• face-to-face communications with sponsors • Business user community : inform what is there

for them : capabilities, limitations, timeframes• Communication with other interested parties

– Executive management

– IS organization - to enable integration with existing and proposed systems

– Organization at large

NLS/IITB/DWH 43

Collecting Requirements

Business Business RequirementsRequirements

ProjectProjectPlanning &Planning &ManagemenManagemen

tt

MaintenancMaintenancee

and Growthand Growth

DeploymentDeploymentPlanningPlanning

End-UserEnd-UserApplicationApplicationSpecificatioSpecificatio

nn

Data StagingData StagingDesignDesign

PhysicalPhysicalDesignDesign

TechnicalTechnicalArchitectureArchitecture

DesignDesign

DimensionalDimensionalModelingModeling

NLS/IITB/DWH 44

Collecting Requirements...

• Interviews/write-ups• Requirements findings document

– Project overview

– review of business objectives

– analytic and information requirements

– preliminary source systems analysis

– Preliminary success criteria

• Prepare and publish the requirements • Agree on next step after collecting requirements• Facilitation for conforming and prioritization

NLS/IITB/DWH 45

Collecting Data about Existing Systems

• Understanding the candidate data sources• Source data ownership• Data providers• Detailed criteria for selecting the data sources

– Data accessibility– Longevity of the feed– Data accuracy– Project scheduling

• Customer matching and house-holding• Browsing and data content• Mapping data from source to target

NLS/IITB/DWH 46

Designing the Data Warehouse / Data Marts

• Identifying marts and dimensions• identify marts based on facts likely to be used

together, as a mart is a kind of subject area or application (divide-and-conquer strategy)

• often based on a single business process or a single source

• 10 to 30 marts common for a large organization• build a matrix of marts versus dimensions

NLS/IITB/DWH 47

Designing a Fact

• Choose a data mart : start with single source data marts

• Define fact grain based on the basic business facts stored in legacy systems

• Choose dimensions and match them with granularity of facts

• Combine as many facts as possible with the context of defined granularity

NLS/IITB/DWH 48

Detailed Design Tips

• Labels which name data marts, dimensions and attributes should be chosen carefully to refer to corresponding business entities

• An attribute (in a dimension) is not replicated, but a fact may be present in many fact tables

• If a dimension occurs multiple times (eg, time), it is playing multiple roles; name them uniquely

• A single field in the underlying source data can have one or more logical columns associated with it (eg, product having code, description, etc)

• Every fact should have a default aggregation rule so that it is not aggregated wrongly

NLS/IITB/DWH 49

Data Modeling Tool

• The advantages of data modeling tool are– Integrates the data warehouse model with other

corporate data model

– Helps assure consistency in naming

– Creates good documentation

– Generates physical schema

– Provides a reasonably intuitive user interface for entering comments about objects

NLS/IITB/DWH 50

Dimensional Modeling

• Strength of dimensional modeling– It is predictable and standard framework

– It makes the user interfaces more understandable and processing more efficient

– The predictable frame work of a dimensional model allows both database systems and end user query tools to make strong assumptions about the data that aid in presentation and performance

– It is gracefully extensible to accommodate unexpected new data elements and new design decisions

– Number of standard approaches for handling Common modeling situations in the business world

NLS/IITB/DWH 51

Dimension Attributes

• The quality of the data warehouse is measured by the quality of the dimension attributes

• The user interface responses and final reports are restricted to the precise contents of the dimension table attributes

• Properties– Verbose, descriptive, complete

– Quality assured, indexed

– Equally available, documented

NLS/IITB/DWH 52

Time Dimension

• Every data warehouse fact table is a time series of some observations

• We always seems to have one or more time dimensions in our fact table designs

• Provides useful hierarchies : week, month, quarter, year, etc

• Represents calendar with many useful attributes like day of week, day of month, week#, day#, quarter, weekday-flag, last-day-of-month-flag, holiday flag, etc.

NLS/IITB/DWH 53

Slowly Changing Dimensions

• The production key or customer key does not change, but the description of the product or customer does

• The data warehouse has three options for above changes– Overwrite the dimension record with the new values,

thereby losing history

• It is used whenever the old value of the attribute has no significance

• The corrections of any error falls into this category

NLS/IITB/DWH 54

Slowly Changing Dimensions...

– Create a new additional dimension record using a new value of the surrogate key

• is primary technique for accurately tracking a change in an attribute within a dimension

• requires use of a surrogate key

• a slowly changing dimension is used when a true physical change to the dimension entity has taken place

– Create an “old” field in the dimension record to store the immediate previous attribute value

• It is used when a change is tentative

NLS/IITB/DWH 55

Time Stamping the Changes

• The design of slowly changing dimension may be established by adding begin and end time stamps and a transaction description in each instance of a dimension record

• This design allows very precise time slicing of the dimension by itself

NLS/IITB/DWH 56

Large Dimensions

• Data warehouses that store extremely granular data may require some extremely large dimensions

• To support large dimensions we must choose the indexing technologies and data design approaches that:– supports rapid browsing of the unconditional

dimension, especially for low cardinality attributes

– Supports efficient browsing of cross-constrained values in the dimension table

– Find and suppress duplicate entries in the dimension

NLS/IITB/DWH 57

Foreign Key, Primary Key, Surrogate Key

• All dimensional tables have single keys, which, by definition, are primary keys

• All data warehouse keys must be meaningless surrogate keys; you must not use the original production keys

• A four byte integer makes a good surrogate key• Surrogate date keys• Avoid smart keys• Avoid production keys

NLS/IITB/DWH 58

Heterogeneous Product Schemas

• Multiple fact tables are needed when a business has heterogeneous products

• The global view needs a single core fact table crossing all lines of business, whereas local view focuses on specific product

• There are many attributes and facts which apply only to a specific product; a single fact table is not feasible

• create customized fact and (product) dimension table for each product, and build a core fact table with attributes that make sense across all lines of business; this allows to create a single portfolio (of products) for each customer

NLS/IITB/DWH 59

Transaction Schema

• Every data mart needs two separate models– Transaction version

– Periodic snapshot version

• ‘rolling’ snapshot containing averages across time

• Snapshots allow us to quickly measure the status of the enterprise

• The Transaction schema– low level transactions in the organization makes for a

good dimensional frame work

– The fact record for an individual transaction frequently contains only a single value

NLS/IITB/DWH 60

Transaction Schema..

• The transaction-based WH commonly used in– Time of day analysis

– Queue analysis

– Fraud detection

– Basket analysis

– Current status

NLS/IITB/DWH 61

Factless Fact Tables

• useful to describe events and their coverage• an event fact table records occurrence of an

event; has only flag and dimension keys (eg, student attendance)

• coverage fact table is frequently needed when a primary fact table in a dimensional data warehouse is sparse; eg, primary fact table will not provide items which were on promotion but did not sale; the coverage table, containing only dimension keys, lists all items on sale

NLS/IITB/DWH 62

Facts of Different Granularity

• The dimensional model gains power as the individual fact records become more and more atomic

• At the lowest level of individual transactions, the design is most powerful because– More of the descriptive attributes have single values

– The design withstands surprise in the form of new facts, new dimensions, or new attributes within existing dimensions

– More expressiveness at the lowest levels of granularity

NLS/IITB/DWH 63

Source System

DataStaging

Area

MetadataCatalog

MetadataCatalog

Dimensional Data Marts withOnly Aggregated Data

Dimensional Date MartsIncluding Atomic Data

Presentation Servers Desktop DataAccess Tools

ApplicationModels

Operational System

StandardReporting Tools

The Back Room The Front Room

Data ElementService Element

Service ElementKey

Technical Architecture

QueryServices

DataStaging

Services

NLS/IITB/DWH 64

The Technical Architecture...

• Data staging services– Extract

– Transformation

– Load

– Job control

• Query services

– Warehouse browsing

– Access and security

– Query management

– Standard reporting

– Activity monitor

It describes flow of data from the source systemsto the decision makers

NLS/IITB/DWH 65

Metadata Catalog

• It is an integral part of the overall architecture• It contains information that describes the

warehouse and plays an active role in its creation, use, and maintenance

• Contains source system metadata (data and processes), data staging metadata (dimensions, transformations, aggregations), DBMS metadata (tables, indexes, stored procedures), and front-room metadata (users, applications)

NLS/IITB/DWH 66

Technical Architecture Features

• Metadata driven– Metadata provides flexibility by buffering the various

components of the system from each other

– The metadata catalog provides parameters and information that allow the application to perform their task

• Flexible services layers– The data staging services and data query services add

to the flexibility of the architecture

NLS/IITB/DWH 67

Back Room : Data Staging Area

• It is the construction site for the Warehouse• The central role of the staging area is to evolve

the source system of record for all downstream DSS and reporting environment

• Data staging data models– The data models can be designed for performance and

ease for development

– Third normal form often appear in the data staging area because the source systems are duplicated

NLS/IITB/DWH 68

Data Staging Area...

• Atomic data marts hold the lowest level of necessary details to meet the most of the high value business requirements– Atomic data mart storage type should be relational

rather than OLAP because of extreme level of detail, the number of dimensions, and size

– Atomic data mart data model built around the dimensional model, not an ER model

NLS/IITB/DWH 69

Transformation Services

• It is a process of transforming the data from source systems into something presentable to the end users and valuable to the business

• Different transformation services :– Integration – Slowly Changing dimension maintenance– Referential integrity checking– Data type conversion– Aggregation– Data content audit– Pre- and post-step exits

NLS/IITB/DWH 70

Front Room Architecture

• It is the public face of the warehouse, the business users see and work with day-to-day

• The presentation servers are machines on which the data warehouse data is organized for direct querying by the end users and report writers

• The major types of activities here :– Warehouse or metadata browsing– Access and Security– Activity monitoring– Query management– Standard reporting

NLS/IITB/DWH 71

Warehouse Browsing

• Using the browsing tools to find and access the information needed by the user

• The warehouse browser should be dynamically linked to the metadata catalog

• It should be able to pull the definition and derivations of the various data elements and to show a set of standard reports

• Browsing tools– Visual Basic

– Microsoft Access, etc

NLS/IITB/DWH 72

Access and Security Services

• Access and security services facilitate a user’s connection to the data base

• It relies on authorization and authentication services where the user is identified and access rights are determined or access is refused

• Levels of authentication depends on how sensitive the data is

NLS/IITB/DWH 73

Activity Monitoring Services

• Capturing the information about the use of the data warehouse

• The capabilities are :– Performance

– User support

– Marketing

– Planning

NLS/IITB/DWH 74

Query Management Services

• Query management services are the set of capabilities that manage the execution of the query, and return of the result set to the desktop

• The major query management services are :– Query reformulation– Query re-targeting and multi-pass SQL– Aggregate awareness– Query Governing

NLS/IITB/DWH 75

Standard Reporting Services

• It has an ability to create a fixed-format report requiring limited user interaction, and regular execution schedules

• Requirements for standard reporting tools are :– Reporting developing environment– Report execution server– Time-and event-based scheduling of report execution– Iterative execution– Flexible report definition– Flexible report delivery– Report library with browsing capability

NLS/IITB/DWH 76

Back Room infrastructure factors

• Infrastructure for the data warehouse includes the hardware, network, and lower-level functions, such as security etc…

• The data base server is the biggest hardware platform decision for most data warehouse projects

NLS/IITB/DWH 77

Back Room Infrastructure Factors...

• The major factors in determining requirements for the server platforms are :– Data size

• Most data warehouse/data mart projects tend to start out with no more than 200 GB

• The data warehouse of less than 100 GB as small, those from 100 GB as typical, and those with more than 500 GB to be large

– Volatility

• It measures the dynamic nature of the database; it includes how often the data base will be updated, how much data is replaced each time

NLS/IITB/DWH 78

Back Room Infrastructure Factors...

– Number of users

• How active the users are, how many are active concurrently, and their geographical distribution etc. are important factors in selecting a platform

– Number of business processes

• It increases the complexity of the data warehouse

• Separate hardware platforms for each business process

– Nature of use

• It depends on the front-end tools, implication on platform selection, types of queries etc..

NLS/IITB/DWH 79

Technical Factors

• Platforms– NT servers for medium-sized warehouse

• The NT is cost-effective platform for smaller warehouses or data marts

– Open system servers

• The open system, or Unix, servers are the primary platform for most medium-sized or larger warehouse

• If the data warehouse is based on a Unix environment, the warehouse team will need to know administrative tools, basic Unix commands and utilities to be able to develop and manage the warehouse

NLS/IITB/DWH 80

Technical Factors...

• Disks– Disk drives can have a major impact on the

performance, flexibility, and scalability of the warehouse platform

• Memory– More memory is better for data warehousing

– Transaction requests are small and typically don’t need much memory, decision support queries requires more memory and involves large tables

– If the table can fit in memory the performance can improve 10 to 100 times

NLS/IITB/DWH 81

Technical Factors...

• Database platform– Data warehouses are implemented using main frame-

based database products

– Some data warehouses are implemented using a specialized multidimensional database products called MOLAP (multidimensional on-line analytical processing) engines

– MOLAP engines came about in response to three main user requirements: simple data access, cross-tab-style reports, fast response time

– The significant benefit of using a MOLAP engine is the end user query performance

NLS/IITB/DWH 82

Physical Design

• In the physical design, the data warehouse team is required to estimate the warehouse’s size

• In data warehouses, the size of dimension tables is insignificant compared to the size of the fact tables and the size of the indexes on the fact tables

NLS/IITB/DWH 83

Initial Sizing Estimates...

• preliminary sizing estimates include– Estimate row length

– Estimate number of rows

– Count and sizes of indexes

– Temp space

– space for metadata tables

– Considerable space for aggregate tables

NLS/IITB/DWH 84

Indexes and Query Strategies

• To develop an index plan, it’s important to understand how the RDBMS’s query optimizer and indexes work– The B-tree index

– The bitmapped index

– The hash index

– Other index types

– Star schema optimization

• Indexing the fact tables, Dimension tables, and indexing for loads

NLS/IITB/DWH 85

Natureof use

CustomerType

InformationInterface

Value

Strategic

Operational

Ad hocpower

user

Push-buttonknowledge

workers

Standardreport

consumers

Desktop tools fordo-it-yourself queries

Operationalreporting

environment

End UserApplication

Migrationpath

Migrationpath

Reporting/Analysis-ExamplesAssured reference points

-Low effort -Current business view -Flexible

End User Application

NLS/IITB/DWH 86

End User Application Template

• It provides the layout and structure of a report that is driven by a set of parameters

• This approach allows users to generate number of similar structure reports from a single template

• Through the drill-down capabilities, a user could produce reports on other attributes; this action results in changing the actual template structure

• Many data access tools provide this functionality transparently

NLS/IITB/DWH 87

Typical Analysis Cycle

• How’s business?• What are the trends?• What’s unusual?• What is driving those exceptions?• What if…?• Make a business decision• Implement the decision

NLS/IITB/DWH 88

The Desktop Installation Readiness

• The back room architecture and infrastructure will be established long before deployment as it is needed for development activities

• The technology residing on user’s desktop is the last piece that must be put in place prior to the deployment

NLS/IITB/DWH 89

The Desktop Installation Readiness...

• Check list of activities that should occur well before the deployment– Determine the client configuration requirement

– Determine LAN addresses

– Conduct a physical audit

– Complete the contract and procurement process

– Acquire user logons and security approval

– Test installation procedures on a variety of machines

– Schedule the installation

– Install the desktop hardware and/or software

– Complete installation testing

NLS/IITB/DWH 90

End User Education Strategy

• A robust education strategy for business end user is a prerequisite for data warehouse success

• Integrate and tailor education content• Education for business users must address three

key aspects of the data warehouse– Data content

– End user application

– The data access tool

NLS/IITB/DWH 91

The End User Education Strategy…

• Data education content– provide an overview of structures, hierarchies,

business rules, and definitions

– Before deployment, identify, document, and communicate these data to the business users

– Factors causing discrepancy between data from the warehouse and previously reported information are :

• The data warehouse information is incorrect

• The warehouse information has a different or new business definition or meaning

• The previously reported information was incorrect

NLS/IITB/DWH 92

An End User Support Strategy

• The user support strategies vary by organization and culture, based largely on the expectations of senior business management

• Determine the support organization structure– Centralized team of support resources handles the

more global data warehouse maintenance and responsibility

– The team typically serves as a second line of defense, and provides a pool of advanced application development resources

NLS/IITB/DWH 93

An End User Support Strategy...

• Establish support communication and feedback– Communication with your user should be minimum,

consisting of general information, and status updates

– Success stories can help motivate

• Provide support documentation• Create a Warehouse web site

NLS/IITB/DWH 94

Conclusion

• Building a corporate-wide data warehouse is a challenging task

• A systematic methodology essential• Plan the architecture globally but build it

incrementally• Keep user requirements at the core of all

development activities