NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant

Preview:

Citation preview

NIC Exposure Level Training

Vijayendra GururaoBusiness Intelligence Consultant

Agenda

Data warehousing Concepts - Day 1 Govt Case Study - Day 2 Defense HR Case study - Day

3 Manufacturing Case Study - Day

4 Data mining - Day

5

What’s What’s importantimportant

What itWhat itmeansmeans

What to doWhat to doabout it about it

ReportReport

Summarize Summarize (MIS)(MIS)

Focus Focus (EIS)(EIS)

Analyze Analyze (OLAP)(OLAP)

ActAct(Intelligent Agents)(Intelligent Agents)

RecommendRecommend(Data Mining)(Data Mining)

20012001

19961996

19911991

19861986

ActionAction

Type of AnalysisType of Analysis

ActiveActive

PassivePassive

HumanHuman TechnologyTechnology

The Evolution of Business Intelligence

Data Warehousing

Introduction:

Definitions Legacy Systems Dimensions Data Dependencies Model Dimensional Model

An ER ModelShipType

Shipper

DistrictCredit

OrderItem

Ship To

Product

ContactLocat.

ProductLine

SalesOrder

Cust.Locat.

ProductGroup

Contract ContractType

Customer

SalesRep

SalesDistrict

SalesRegion

SalesDivision

Contact

Why Data Warehouses?

To meet the long sought after goal of providing the user with more flexible data bases containing data that can be accessed “every which way.”

OLTP vs. OLAP

OLTP (Online transaction processing) has been the standard reason for IS and DP for the last thirty years. Most legacy systems are quite good at capturing data but do not facilitate data access.

OLAP (Online analytical processing) is a set of procedures for defining and using a dimension framework for decision support

The Goals for and Characteristics of a DW

Make organizational data accessible Facilitate consistency Adaptable and yet resilient to change Secure and reliable Designed with a focus on supporting decision

making

The Goals for and Characteristics of a DW

Generate an environment in which data can be sliced and diced in multiple ways

It is more than data, it is a set of tools to query, analyze, and present information

The DW is the place where operational data is published (cleaned up, assembled, etc.)

Data Warehousing is Changing!

ERPCampaign Management

Supply Chain

Customer Relationship

Mgmt.

E-commerce

Target Marketing

Knowledge Management

Call Center

ERP

Application requirements--not just data requirements--are now driving need.

Organization of data in the presentation area of the data warehouse

Data in the warehouse are dimensional, not normalized relations However, data that are ultimately

presented in the data warehouse will often be derived directly from relational DBs

Data should be atomic someplace in the warehouse; even if the presentation is aggregate

Uses the bus architecture to support a decentralized set of data marts

Updates to a data warehouse For many years, the dogma stated

that data warehouses are never updated.

This is unrealistic since labels, titles, etc. change.

Some components will, therefore, be changed; albeit, via a managed load (as opposed to transactional updates)

Basic elements of the data warehouse

Services:

Clean, combine, and standardizeConform DimensionsNo user query services

Data Store:

Flat files and relational tables

Processing:

Sorting and sequential processing

DataStaging

Area

Data Mart #1

DimensionalAtomic and summary dataBased on a single business process

Data Mart #2

Similar design

DW Bus:Conformed facts and dimensions

Ad hoc query tools

Report Writers

Analytical Applications

Modeling:

Forecasting

Scoring

Data MiningExtract

Extract

Extract

Load

Load Access

Access

Operational Source

Systems

DataPresentation

Area

DataAccessTools

Data Staging Area

Extract-Transformation-Load Extract: Reading the source data and

copying the data to the staging area Transformation:

Cleaning Combining Duplicating Assigning keys

Load: present data to the bulk loading facilities of the data mart

Dimensional Modeling Terms and Concepts Fact table Dimension tables

Fact Tables

Fact table: a table in the data warehouse that contains Numerical performance measures Foreign keys that tie the fact table to the

dimension tables

Fact Tables

Each row records a measurement describing a transaction

Where? When? Who? How much? How many?

The level of detail represented by this data is referred to as the grain of the data warehouse

Questions can only be asked down to a level corresponding with the grain of the data warehouse

Dimension tables

Tables containing textual descriptors of the business Dimension tables are usually wide (e.g.,

100 columns) Dimension tables are usually shallow

(100s of thousand or a few million rows) Values in the dimensions usually provide

Constraints on queries (e.g., view customer by region)

Report headings

Dimension tables

The quality of the dimensions will determine the quality of the data warehouse; that is, the DW is only as good as its dimension attributes

Dimensions are often split into hierarchical branches (i.e., snowflakes) because of the hierarchical nature of organizations Product part Product Brand

Dimensions are usually highly denormalized

Dimension tables

The dimension attributes define the constraints for the DW. Without good dimensions, it becomes difficult to narrow down on a solution when the DW is used for decision support

Bringing together facts and dimensions – Building the dimensional Model Start with the normalized ER Model Group the ER diagram components into

segments based on common business processes and model each as a unit

Find M:M relationships in the model with numeric and additive non-key facts and include them in a fact table

Denormalize the other tables as needed and designate one field as a primary key

A Dimensional Model

time_key

day_of_Week

month

quarter

year

holiday_flag

time_key

product_key

store_key

dollars_sold

units_sold

dollars_cost

product_key

description

brand

category

store_key

store_name

address

floor_plan_type

Time DimensionSales Fact

Product Dimension

Store Dimension

Kimball Methodology

Conformed Dimensions

Review: A Private Data Mart

A data mart containing one fact table and three dimension tables. We delivered all the tables by executing a fact build.

What if we want to add another fact table called F_Sales that will reference the three existing dimension tables?

Understand Conformed Dimensions

Sales Fact

Distribution Fact

Order Fact

Location

Customer

Product

Time

Distributor

Promotion

Co

nfo

rmed

Dim

ensi

on

sC

on

form

ed D

imen

sion

s

A conformed dimension is a dimension that is standardized across all data marts.

Advantages of Conformed Dimensions

Deliver incremental data marts in a short period of time.

Independent data marts become part of a fully integrated data warehouse.

Deliver a consistent view across your business.

Conformed Dimensions Within Bus Architecture

Facts

Dimensions

Sales Fact

Distribution Fact

Order Fact

Lo

cati

on

Cu

sto

mer

Pro

du

ct

Tim

e

Dis

trib

uto

r

Pro

mo

tio

n

X

X

X

X X

X

X

X

X

X

X

X

Identifying and designing the conformed dimensions is a critical step in the architecture phase of data warehouse design.

Design of Conformed Dimensions

A commitment to using conformed dimensions is more than just a technical consideration. It must be a business mandate.

Lay out a broad dimensional map for the enterprise. Define conformed dimensions at the most granular

(atomic) level possible. Conformed dimensions should always use surrogate

keys. Define standard definitions for dimension and fact

attributes.

Granularity in Conformed Dimensions

D_ProductProduct IdDescriptionProduct TypeType DescriptionProduct LineLine Description

Order FactDay IdProduct IdCustomer IdCostNumberOrdered

D_TimeDayDay IdDayMonth Year

D_CustomerCustomer IdLast NameFirst NameAddress

Conformed dimensions should be defined at the most granular (atomic) level so that each record in these tables corresponds to a single record in the base-level fact table.

Flexibility of Conformed Dimensions

ProductProduct IdDescriptionProduct TypeProduct Line

CustomerCustomer IdLast NameFirst NameAddress

Time(Day)Day IdDayMonth Id Period

Time(Month) ViewMonth IdMonthPeriod

View or Snowflake table

Time Dimension

Conformed dimensions are usually designed within star schema data marts. For multiple granularity fact tables, higher level views of dimensions can be used (or a snowflake table).

Order FactDay IdProduct IdCustomer IdCostNumberOrdered

Sales FactMonth IdProduct IdCustomer IdAmountSoldRevenue

So, What is a DW?

A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data in support of management’s decisions

W.H. Inmon (the father of DW)

Subject Oriented

Data in a data warehouse are organized around the major subjects of the organization

Integrated

Data from multiple sources are standardized (scrubbed, cleansed, etc.) and brought into one environment

Non-Volatile

Once added to the DW, data are not changed (barring the existence of major errors)

Time Variant

The DW captures data at a specific moment, thus, it is a snap-shot view of the organization at that moment in time. As these snap-shots accumulate, the analyst is able to examine the organization over time (a time series!)

The snap-shot is called a production data extract

Need for Data Warehousing Integrated, company-wide view of high-quality information (from

disparate databases) Separation of operational and informational systems and data

(for improved performance)

comparison of operational and informational systems

Data Warehouse Architectures Generic Two-Level Architecture Independent Data Mart Dependent Data Mart and Operational Data

Store Logical Data Mart and @ctive Warehouse Three-Layer architecture

All involve some form of extraction, transformation and loading (ETLETL)

Generic two-level architecture

E

T

LOne, company-wide warehouse

Periodic extraction data is not completely current in warehouse

Independent Data Mart Data marts:Data marts:Mini-warehouses, limited in scope

E

T

L

Separate ETL for each independent data mart

Data access complexity due to multiple data marts

Dependent data mart with operational data store

ET

L

Single ETL for enterprise data warehouse(EDW)(EDW)

Simpler data access

ODS ODS provides option for obtaining current data

Dependent data marts loaded from EDW

Logical data mart and @ctive data warehouse

ET

L

Near real-time ETL for @active Data Warehouse@active Data Warehouse

ODSODS and data warehousedata warehouse are one and the same

Data marts are NOT separate databases, but logical views of the data warehouse Easier to create new data marts

Three-layer architecture

DW Design

Mainly consists of Logical Design Physical Design

Logical Design of DW

Identification of Entities Relationships Attributes Uniqe identifiers

Conceptual and abstract. Results in Fact and dimension tables

Created using Pen and Paper OR Modeling tools also

Physical Design of DW

Conversion of data gathered in Logical design to Physical database structure

Mainly driven for query performance

Logical Physical

Data CharacteristicsStatus vs. Event Data Example of DBMS log entry

Status

Status

Event = a database action (create/update/delete) that results from a transaction

Data CharacteristicsData CharacteristicsTransient vs. Periodic DataTransient vs. Periodic Data

Figure 11-8: Transient operational data

Changes to existing records are written over previous records, thus

destroying the previous data content

Data CharacteristicsData CharacteristicsTransient vs. Periodic DataTransient vs. Periodic Data

Periodic warehouse data

Data are never physically altered or deleted once they

have been added to the store

Data Reconciliation Typical operational data is:

Transient – not historical Not normalized (perhaps due to denormalization for

performance) Restricted in scope – not comprehensive Sometimes poor quality – inconsistencies and errors

After ETL, data should be: Detailed – not summarized yet Historical – periodic Denormalized Comprehensive – enterprise-wide perspective Quality controlled – accurate with full integrity

Extract Transform Load

Extract data from operational system, transform and load into data warehouse

Why ETL? Will your warehouse produce correct

information with the current data? How how can I ensure warehouse

credibility?

Excuses for NOT Transforming Legacy Data Old data works fine, new will work as

well. Data will be fixed at point of entry

through GUI. If needed, data will be cleaned after new

system populated; After proof-of-concept pilot.

Keys join the data most of the time. Users will not agree to modifying or

standardizing their data.

Levels of Migration Problem Existing metadata is insufficient and

unreliable Metadata must hold for all occurrences Metadata must represent business and

technical attributes Data values incorrectly typed and

accessible Values form extracted from storage Values meaning inferred from its content

Entity keys unreliable or unavailable Inferred from related values

Metadata Challenge

Metadata gets out of synch with details it summarizes Business grows faster than systems

designed to capture business info Not at the right level of detail

Multiple values in a single field Multiple meanings to a single field No fixed format for value

Expressed in awkward of limited terms Program/compiler view rather than business

view

Character-level Challenge Value instance level

Spelling, aliases Abbreviations, truncations, transpositions Inconsistent storage formats

Named type level Multiple meanings, contextual meanings Synonyms, homonyms

Entity level No common keys or representation No integrated view across records, files,

systems

The ETL Process

Capture Scrub or data cleansing Transform Load and Index

ETL = Extract, transform, and load

The ETL ProcessSource Systems

Extract Transform

Staging Area

Load

PresentationSystem

Source Data

Record the name location and data that exists in the TPS environment.

File names and location Layout Attribute meaning

 Source

BusinessOwner

ISOwner

 Platform

 Location

Data SourceDescription

           

Extraction

Copy specific data directly from the source tables into a working dataset in the staging area.

Target Table

Target Column

Data

Type

 Len

Target Column Description

Source

System

Source Table /

File

Source Col / Field

Data Txform Notes

                 

Transformation (Dimension Tables) Generate surrogate key in a primary-

surrogate table. Make this permanent. Insert the surrogate key into the working

dimension tables. Conduct any editing/cleaning operations

you need (usually on the working table) Generate any derived attributes you

need. Generate and retain process logs.

Transformation(Fact tables) Join all dimensions to the fact table

(using original primary keys). Insert surrogate keys

Generate derived facts Generate indicator flags

Chg

Flag

Fact Grou

p

Derived Fact Name

 Derived Fact Description

 Typ

e

Agg Rule

 Formula

 Constrai

nts

Transfor-

mations

                 

Target Data

Describe the presentation data structure.

Model Metadata Usage and constraints

TableName

ColumnName

Data

Type

 Len Null

s?

ColumnDescription

 PK

PKOrder

 FK

                 

Flow Documentation

DFD for the ETL process ERD for Source, Staging and

Target databases. Metadata Usage notes.

Steps in data reconciliation

Static extractStatic extract = capturing a snapshot of the source data at a point in time

Incremental extractIncremental extract = capturing changes that have occurred since the last static extract

Capture = extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse

Steps in data reconciliation (continued)

Scrub = cleanse…uses pattern recognition and AI techniques to upgrade data quality

Fixing errors:Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Also:Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data

Steps in data reconciliation (continued)

Transform = convert data from format of operational system to format of data warehouse

Record-level:Record-level:Selection – data partitioningJoining – data combiningAggregation – data summarization

Field-level:Field-level: single-field – from one field to one fieldmulti-field – from many fields to one, or one field to many

Steps in data reconciliation (continued)

Load/Index= place transformed data into the warehouse and create indexes

Refresh mode:Refresh mode: bulk rewriting of target data at periodic intervals

Update mode:Update mode: only changes in source data are written to data warehouse

Single-field transformation

In general – some transformation function translates data from old form to new form

Algorithmic transformation uses a formula or logical expression

Table lookup – another approach

Multi field transformation

M:1 –from many source fields to one target field

1:M –from one source field to many target fields

Derived Data Objectives

Ease of use for decision support applications Fast response to predefined user queries Customized data for particular target audiences Ad-hoc query support Data mining capabilities

Characteristics Detailed (mostly periodic) data Aggregate (for summary) Distributed (to departmental servers)

Most common data model = star schemastar schema(also called “dimensional model”)

Components of a star schemastar schemaFact tables contain factual or quantitative data

Dimension tables contain descriptions about the subjects of the business

1:N relationship between dimension tables and fact tables

Excellent for ad-hoc queries, but bad for online transaction processing

Dimension tables are denormalized to maximize performance

Star schema example

Fact table provides statistics for sales broken down by product, period and store dimensions

Star schema with sample data

Slowly Changing dimensions Ragged Hierarchies

Advanced concepts

What if Our Data is not Static? Small occasional changes in dimension data

are normal in business. Examples of these changes include:

addition of new members (a new product is launched) changing of relationships within the dimension (a sales

rep moves to another branch) properties of members changed (a product is

reformulated or renamed) deletion of members (this is rare in data warehousing)

Understand Surrogates 4-byte integer key

(can hold more than 2 billion positive integers)

Internally assigned and meaningless

insures uniqueness

always known

Used in conjunction with business keys

business key is often mnemonic; for example, OTA used for Ottawa office

surrogate key is numeric; for example, 000128

Surrogate keys are never used in reports. They are used to link dimension tables to fact tables.

Understand Surrogate Keys Used In Operational Systems

Operational databases also sometimes use surrogate keys (for example, Employee_No). These keys typically cannot be used as the data mart surrogate keys.

A single member in a data mart (for example, a particular employee) may have several data mart surrogate keys assigned over time to deal with slowly changing dimensions.

You may have to merge entities from separate operational systems, each with its own operational surrogate key (for example, customers from separate banking and insurance applications).

Operational surrogate keys are usually considered business keys in the data mart.

Understand Natural Keys: Example

Product Dimension

Fact Table

Customer Dimension

Metrics

PR X 002 39 SA 1 11

PR X 003 40 LO 2 22

PR Y 003 40 SE 5 55

Prod Code Cust Code

Measures

PR X 002 39 Soup

PR X 003 40 Beans

PR Y 003 40 Peas

Prod Code Name

SA 1 11 Safeway

LO 2 22 Loblaws

SE 5 55 7-11

Cust Code Name

Understand Surrogate Keys: Example

Product Dimension

Fact Table

Customer Dimension

1 10

2 20

3 30

Prod Sur Cust Sur

Measures

1 PR X 002 39 Soup

2 PR X 003 40 Beans

3 PR Y 003 40 Peas

10 SA 1 11 Safeway

20 LO 2 22 Loblaws

30 SE 5 55 7-11

Cust Sur Cust Code NameProd Sur Prod Code Name

Track Dimensional Changes Over Time

Operational systems tend to contain data about the current state of the business.

A data warehouse is expected to hold data for five to 10 years.

Users may need to query data as of any particular date (for example, at which office(s) did Mary Smith work between January/1999 and December/1999?).

If Mary Smith changes offices, to which office do her sales apply, the old one or the new one?

Understand Slowly Changing Dimensions (SCD)

Operational dimensional data may often be thought of as static. It may only need to reflect the current state.

Data warehouse dimensional data often must show how the dimensional data changes over time. It is not static.

The term Slowly Changing Dimension (SCD) refers to the tracking of changes to dimensional data over time.

Understand Issues With Slowly Changing Dimensions Maintaining SCDs can be complex without surrogates.

*

* (Emp. No + Branch)

**

** (Emp. No + Branch + Position)

***

*** (Emp. No + Branch + Position + Salary)

Imagine the effect of having such a largeNatural key in the fact table.

10001 Jack OTA VP Jan88' 50K

Emp. No Name Branch Position Hire Date

Salary

10002 Jane ARL MK Jan92' 40K

10003 Tom NY SS Jan93' 35K

Business Key

Normal Type2 Type2 Normal Type2

10001 Jack SJ VP Jan88' 50K

10001 Jack SJ S-VP Jan88' 50K

10001 Jack SJ S-VP Jan88' 60K

Surrogate

1

2

3

4

5

6

Use Different Methods of Handling Dimensional Changes

Two most commonly used types of SCDs (according to Kimball):Type 1. Overwrite the old value with the new value

(do not track the changes).Type 2. Add a new dimension record with a new

surrogate key (track changes over time).

A single row may have a combination of columns of different types.

Type 1: Overwrite the Original Value

Sales Rep Dimension Table

Rep Key

Order Date

Cust Key

00128 1/1/1999 1234500128 2/1/1999 12345

Sales Fact TableRep Key

Name Marital Status

Office …

00128 Mary Smith Single Dallas

Rep Key

Name Marital Status

Office …

00128 Mary Jones Married Dallas

The organization may not choose to track certain data changes because: the original data may have been incorrect the change is not considered relevant for

tracking

When a tracked change is detected, a new surrogate key is assigned and a new row is added to the dimension table.

Usually, an effective begin/end date is also updated on the new and old rows.

Multiple rows may have the same business key, but they will always have unique surrogate keys.

Type 2: Add a New Dimension Record

Sales Rep Dimension TableSales Fact Table

RepSur Key

Rep Key

Name Office Eff Date

11111 00128 Mary Smith Dallas 990111112 00128 Mary Smith NYC 9903

RepSur Key

Order Date Cust Key

11111 01/01/1999 1234511111 02/01/1999 1234511112 03/01/1999 1234511112 04/01/1999 12345

Balanced and Ragged Hierarchies

Dimensional data are usually structured as hierarchies, either balanced or ragged (unbalanced).

Balanced hierarchies (those with a fixed number of levels) are most common and are the easiest to understand and analyze.

In ragged hierarchies, each branch does not break down into the same number of levels. They are harder to analyze and report against.

Also, PowerPlay requires that all leaf (lowest-level) nodes be at the same level to aggregate properly.

Parent-Child Relationships

Employees

Orders

Reports To

Parent-child relationship are recursive relationships.

The levels of the hierarchy are determined by rows of the same table.

Ragged Hierarchies

Nancy Davolio Janet Leverling Margaret Peacock Laura Callahan

Michael Suyama Robert King Anne Dodsworth

Stephen Buchanan

Andrew Fuller

Employee Hierarchy

leaf leaf leaf leaf

leaf leaf leaf

Leaf nodes have no children. DecisionStream fact builds only look for leaf

nodes at the lowest level.

Resolve Ragged Hierarchies: Step 1

Create an auto-level hierarchy to obtain the number of levels.

Create a dimension build to create a physical table that will identify for each row the level it belongs to.

Resolve Ragged Hierarchies: Step 1 (cont’d) Use Auto-Level Hierarchies

Report to Steven Buchanan (Level 3)

Report to Andrew Fuller (Level 2)

Report to Andrew Fuller (Level 2)

Top Level (Level 1)

The purpose of auto-level hierarchies in DecisionStream is to determine the number of levels in a hierarchy.

Issues Regarding Star Schema

Dimension table keys must be surrogate (non-intelligent and non-business related), because: Keys may change over time Length/format consistency

Granularity of Fact Table – what level of detail do you want? Transactional grain – finest level Aggregated grain – more summarized Finer grains better market basket analysis capability Finer grain more dimension tables, more rows in fact table

The User InterfaceMetadata (data catalog)

Identify subjects of the data mart Identify dimensions and facts Indicate how data is derived from enterprise data

warehouses, including derivation rules Indicate how data is derived from operational data

store, including derivation rules Identify available reports and predefined queries Identify data analysis techniques (e.g. drill-down) Identify responsible people

Q & A

Multi-dimensional data

On-Line Analytical Processing (OLAP) The use of a set of graphical tools that provides users

with multidimensional views of their data and allows them to analyze the data using simple windowing techniques

Relational OLAP (ROLAP) Traditional relational representation

Multidimensional OLAP (MOLAP) CubeCube structure

OLAP Operations Cube slicing – come up with 2-D view of data Drill-down – going from summary to more

detailed views

Overall Plan

We need fast answers to analytical questions

Relational model may not be the answer

We can restructure data specifically for analysis

First we need to find out how people analyse data

Overall Plan

Analysis analysis reveals importance of measures and dimensions

So we structure the data with that in mind

The star schema is the physical structure that emerges

We can implement this as ROLAP, MOLAP and HOLAP

We achieve our objective – rapid analytical processing

How do we make databases faster? Indexing Query design Application design Care with locking Lots of ways Data structuring

Relational

Data structuring –Relational model

Pros Data integrity in the face of user updates Small data volume Good for transactional queries

Cons Poor analytical query performance

Why are relational databases slow? Joins Functions Aggregations

Poor analytical query performance

Poor analytical query performance

So, there is a tension between: Transactions Analytical querying

Solution: Split them up Take a copy of the transactional database and structure it

in a totally different way that is optimised for analytical querying

Structure for analytical querying

Great idea, but first we need to find out how people analyse their data

How people analyse their data

People analyse their data in terms of: Graphs Grids Reports

Do these have anything in common?

How people analyse their data

Do these have anything in common? Measures

Numerical values Typically plotted on the Y axis

Dimensions Discontinuous variables Slice the measures into aggregated groups Typically plotted on the X axis

How people analyse their data

Dimensions are often hierarchical People want to analyse:

Time by Year, Quarter, Month, Day Product by Warehouse, Type, Product Customer by Country, County, Person

How people analyse their data

So, we need to summarise all of this….. Measures Dimensions Hierarchies

Employee

Region

Employee

Item

Year

Quarter

Class

Warehouse

Month

WeekDay

Time

Customer

Country

Region

Name

Delay

Profit

Quantity

County

Squashed Octopus

Product

Squashed Octopus

The SO is a logical model What about a physical model? (Recap why the relational model is slow for

analytical querying) Joins Functions Aggregations

Relational

Star Schema

Star Schema

What is in the fact table? Facts

Star Schema

What is in a dimension table? Dimensional information Hierarchical information

Star Schema

How do dimension and fact tables work together?

Star Schema

Is it faster?

Query Time

Relational

Time

Star Schema

Monthly totals 70 60

Sales in March by Product

18 6

Sales in March 2004 by Product

12 2

Star Schema

Is it faster? Yes

How can we make it even faster? Aggregation

Query Time

Relational

Time

Star Schema

Time

Aggregated

Star Schema

Monthly totals

70 60 <1

Sales in March by Product

18 6 <1

Sales in March 2004 by Product

12 2 <1

Star Schema

If we leave this as a set of tables then it is ROLAP – Relational OLAP

(OLAP – On-Line Analytical Processing) But it is a pain to manage

All those aggregation tables

Star Schema

So, the answer is MOLAP (Multi-dimensional OLAP)

Star Schema

Finally HOLAP (Hybrid OLAP)

On-Line Analytical Processing (OLAP) OLAP Operations

Cube slicing – come up with 2-D view of data Drill-down – going from summary to more

detailed views

Figure 11-22: Slicing a data cube

Example of drill-down

Summary report

Drill-down with color added

Summary

We need fast answers to analytical questions

Relational model may not be the answer

We can restructure data specifically for analysis

First we need to find out how people analyse data

Summary

Analysis analysis reveals importance of measures and dimensions

So we structure the data with that in mind

The star schema is the physical structure that emerges

We can implement this as ROLAP, MOLAP and HOLAP

We achieve our objective – rapid analytical processing

Recommended