68
DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University 1 Semester 2/2011 Lecture 2&3 Data Warehouse and OLAP Technology by Kritsada Sriphaew (sriphaew.k AT gmail.com)

Dbm630_Lecture02-03

Embed Size (px)

Citation preview

Page 1: Dbm630_Lecture02-03

DBM630: Data Mining and

Data Warehousing

MS.IT. Rangsit University

1

Semester 2/2011

Lecture 2&3

Data Warehouse and OLAP Technology

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

Page 2: Dbm630_Lecture02-03

Outline

2

Part I: Basic Knowledge about Data Warehousing

Dimensional Modeling

Data Cube

Architecture

Part II: OLAP and Cube Computations

OLAP, Data Cube and Data Analysis

Cube Computations

Demo PivotTable (bring laptop with MS Excel, if you have)

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 3: Dbm630_Lecture02-03

What is Data Warehouse?

3

Defined in many different ways, but not rigorously.

A decision support database that is maintained separately from the organization’s operational DB

Support information processing by providing a solid platform of consolidated, historical data for analysis.

“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.” (definition by W. H. Inmon)

Data warehousing:

Process of constructing and using data warehouses

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 4: Dbm630_Lecture02-03

Four Properties of Data Warehouses

Data Warehousing and Data Mining by Kritsada Sriphaew 4

subject-oriented จัดเก็บเป็นเรื่องๆ

integrated รวบรวมอยู่ในรูปแบบเดียวกัน

time-variant มีข้อมูลตามมติิเวลา

non-volatile มีเสถียรภาพ ข้อมูลไม่สูญหาย

Page 5: Dbm630_Lecture02-03

Subject-Oriented

5

Subject-Oriented Property Organized around major subjects, such as customer, product, sales.

Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.

Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.

OPERATIONAL DB DATA WAREHOUSE

• Loans

• Savings

• Bank card

• Trust

• Customer

• Vendor

• Product

• Activity

An application orientation A subject orientation

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 6: Dbm630_Lecture02-03

Integrated

6

Integrated Property Constructed by integrating multiple, heterogeneous data sources

Relational databases, flat files, on-line transaction records

Data cleaning and data integration techniques are applied.

Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources e.g., Hotel price: currency, tax, breakfast covered, etc.

When data is moved to the warehouse, it is converted.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 7: Dbm630_Lecture02-03

Time Variant/Non-Volatile

7

Time Variant Property The time horizon for the data warehouse is significantly longer than that of

operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10

years)

Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain “time element”.

Non-Volatile Property A physically separated store of data transformed from the operational

environment. Operational update of data does not occur in the data warehouse environment.

Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of

data.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 8: Dbm630_Lecture02-03

Operational DBMS vs. Data Warehouse (OLTP vs. OLAP)

8

OLTP (on-line transaction processing) Major task of traditional relational DBMS

Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.

OLAP (on-line analytical processing) Major task of data warehouse system

Data analysis and decision making

Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market

Data contents: current, detailed vs. historical, consolidated

Design: ER + application vs. star + subject

View: current, local vs. evolutionary, integrated

Access patterns: create/select/insert/delete/update vs. read-only but complex queries

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 9: Dbm630_Lecture02-03

OLTP vs. OLAP

9

OLTP OLAP users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date

detailed, flat relational

isolated

historical,

summarized,

multidimensional

integrated, consolidated

usage repetitive ad-hoc

access read/write

index/hash on primary key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands tens

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 10: Dbm630_Lecture02-03

Heterogeneous DBMS vs. Data Warehouse

10

Traditional heterogeneous DB integration Build wrappers/mediators on top of heterogeneous DBs Query-driven approach

When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set

Complex information filtering, compete for resources

Data warehouse (DW) Another database for high-performance analysis Update-driven approach

Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query/analysis

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 11: Dbm630_Lecture02-03

Heterogeneous DBMS vs. Data Warehouse

11

Query-driven Approach Sales

Purchasing

Production

Wrapper/

Mediators

Heterogeneous Operational Database

OLTP

Sales

Purchasing

Production

Query

Update-driven Approach

Data Warehouse

OLAP

OLTP

Heterogeneous Operational Database

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 12: Dbm630_Lecture02-03

Why Separate Data Warehouse?

12

High performance for both systems DBMS— tuned for OLTP: access methods, indexing, concurrency

control, recovery

Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.

Different functions and different data: Missing data: Decision support (OLAP) requires historical data which

operational DBs (OLTP) do not typically maintain

Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources

Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 13: Dbm630_Lecture02-03

Data Warehousing and Data Mining by Kritsada Sriphaew

From Tables and Spreadsheets to Data Cubes

13

A data warehouse is based on a multidimensional data model which views data in the form of a data cube

A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Fact table contains measures (such as units_sold, dollars_sold) and keys to each of

the related dimension tables Dimension tables, such as location (branch, country, continent), or time (day, week,

month, quarter, year)

Branch Country Continent

London UK Europe

Glasgow UK Europe

Berlin Germany Europe

Bangkok Thailand Asia

Phuket Thailand Asia

Tokyo Japan Asia

Fact Table Dimension Table

Date Branch Item Buyer Units sold Dollars sold

1/1/2008 London VCD First Company 20 5000

1/1/2008 Bangkok TV First Company 30 9000

10/1/2008 London Ham First Company 20 1000

4/2/2008 London Milk First Company 80 1600

15/2/2008 Bangkok VCD Best Company 30 7500

2/5/2008 Bangkok Orange Best Company 20 500

* A multidimensional data model

Page 14: Dbm630_Lecture02-03

Three Concepts in Data Cubes

14

Item Subcategory Category

VCD Electric Non-Food

TV Electric Non-Food

Shirt Clothes Non-Food

Ham Process food Food

Milk Fresh food Food

Orange Fresh food Food

Branch Country Continent

London UK Europe

Glasgow UK Europe

Berlin Germany Europe

Bangkok Thailand Asia

Phuket Thailand Asia

Tokyo Japan Asia

Fact Table

Location Dimension Table (3 levels)

Date Branch Item Buyer Units sold Dollars sold

1/1/2008 London VCD First Company 20 5000

1/1/2008 Bangkok TV First Company 30 9000

10/1/2008 London Ham First Company 20 1000

4/2/2008 London Milk First Company 80 1600

15/2/2008 Bangkok VCD Best Company 30 7500

2/5/2008 Bangkok Orange Best Company 20 500

Three concepts in data cubes are

(1) Multidimension (2) Hierarchy

(3) Measure

Buyer Buyer Group

First Company Group 1

Second Company Group 1

Third Company Group 1

Best Company Group 2

Good Company Group 2

Customer Dimension Table (2 levels)

Product Dimension Table (3 levels)

4 dimensions 2 measures

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 15: Dbm630_Lecture02-03

Cuboids in Data Cubes

15

Cuboid concept is formed by the number of dimensions.

The top most 0-D cuboid, which holds the highest-level of summarization, is called an apex cuboid.

In data warehousing literature, an n-D base cube where n is the total number of dimensions is called a base cuboid.

The lattice of cuboids forms a data cube.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 16: Dbm630_Lecture02-03

An Example of Cuboids (Dimension)

(add vs. delete dimensions)

16

Sales(all) 0-D (apex) cuboid Sales

24000

Time Sales

Q1 5000

Q2 4000

Q3 8000

Q4 7000 1-D cuboid Sales(Q1), Sales(Q2),

Sales(Q3), Sales(Q4)

Time Thailand Japan

Q1 2000 3000

Q2 1500 2500

Q3 2000 6000

Q4 3000 4000

2-D cuboid

Sales(Q1, Thailand), Sales(Q1, Japan)

Sales(Q2, Thailand), Sales(Q2, Japan)

Sales(Q3, Thailand), Sales(Q3, Japan)

Sales(Q4, Thailand), Sales(Q4, Japan)

Time Thailand Japan

Food NonFood Food NonFood

Q1 1500 500 2000 1000

Q2 900 600 1500 1000

Q3 1200 800 4000 2000

Q4 2000 1000 2500 1500

3-D cuboid

Sales(Q1, Thailand, Food), Sales(Q1, Thailand, NonFood)

Sales(Q2, Thailand, Food), Sales(Q2, Thailand, NonFood)

Sales(Q3, Thailand, Food), Sales(Q3, Thailand, NonFood)

Sales(Q4, Thailand, Food), Sales(Q4, Thailand, NonFood)

Sales(Q1, Japan, Food), Sales(Q1, Japan, NonFood)

Sales(Q2, Japan, Food), Sales(Q2, Japan, NonFood)

Sales(Q3, Japan, Food), Sales(Q3, Japan, NonFood)

Sales(Q4, Japan, Food), Sales(Q4, Japan, NonFood)

Add dim. Delete dim.

Add dim. Delete dim.

Add dim. Delete dim.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 17: Dbm630_Lecture02-03

Data Cube: A Lattice of Cuboids (Dimension)

17

all

time product location customer

time, item time, location

time, customer

item, location

item, customer

location, customer

time, item, location

time, item, customer

time, location, customer

item, location, customer

time, item, location, customer

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 18: Dbm630_Lecture02-03

Designing Data Warehouses

Data Warehousing and Data Mining by Kritsada Sriphaew 18

Conceptual Modeling (Star schema, Snowflake, Fact constellations), Concept Hierarchy

Physical Modeling (Data sources, Data storage, OLAP engine)

Page 19: Dbm630_Lecture02-03

Conceptual Modeling of Data Warehouses

19

Modeling data warehouses: dimensions & measures

Star schema: A fact table in the middle connected to a set of dimension tables

Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake

Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 20: Dbm630_Lecture02-03

Example of Star Schema

20

time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city

province_or_street

country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_type

item

branch_key

branch_name

branch_type

branch

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 21: Dbm630_Lecture02-03

Example of Snowflake Schema

21

time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_key

item

branch_key

branch_name

branch_type

branch

supplier_key

supplier_type

supplier

city_key

city

province_or_street

country

city

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 22: Dbm630_Lecture02-03

Example of Fact Constellation

22

time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city

province_or_street

country

location

Sales Fact Table

time_key

item_key

branch_key location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_type

item

branch_key

branch_name

branch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_key

shipper_name

location_key

shipper_type

shipper

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 23: Dbm630_Lecture02-03

Measures: Three Categories

23

Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning.

e.g., count(), sum(), min(), max().

Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.

e.g., avg(), standard_deviation().

Holistic: if there is no constant bound on the storage size needed to describe a sub-aggregate.

e.g., median(), mode(), rank().

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 24: Dbm630_Lecture02-03

A Concept Hierarchy: Dimension (location)

24

all

Europe North_America

Mexico Canada Spain Germany

Vancouver

M. Wind L. Chan

...

... ...

... ...

...

all

region

office

country

Toronto Frankfurt city

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 25: Dbm630_Lecture02-03

A Concept Hierarchy: Dimension

(Distributive - count(), sum(), min(), max())

25

all

Europe North_America

Canada Spain Germany

Vancouver

all

region

country

Toronto Frankfurt city Berlin Aechen Segovia Madrid

400 200 400 300

sum = 900

sum = 1400

Mexico

Mexico

100 200 400

100

sum = 500 sum = 600 sum = 100

sum = 700

sum = 2100

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 26: Dbm630_Lecture02-03

A Concept Hierarchy: Dimension

(Algebraic - avg(), standard_deviation())

26

all

Europe North_America

Canada Spain Germany

Vancouver

all

region

country

Toronto Frankfurt city Berlin Aechen Segovia Madrid

400 200 400 300

avg = 300

avg = 280

Mexico

Mexico

100 200 400

100

avg = 250 avg = 300 avg = 100

avg = 233.33

avg = 262.5

count=3 count=1 count=2

count=5

count=2

count=3

count=8

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 27: Dbm630_Lecture02-03

A Concept Hierarchy: Dimension

(Holistic - median(), mode(), rank())

27

all

Europe North_America

Canada Spain Germany

Vancouver

all

region

country

Toronto Frankfurt city Berlin Aechen Segovia Madrid

400 200 400 300

median = 300

median = 300

Mexico

Mexico

100 200 400

100

median = 250 median = 300

median = 100

median = 200

median = 250

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 28: Dbm630_Lecture02-03

Multidimensional Data

28

Sales volume as a function of product, month, and region

Pro

duct

Month

Dimensions: Product, Location, Time

Hierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 29: Dbm630_Lecture02-03

A Sample Data Cube

29

Total annual sales

of TV in U.S.A. Time

Cou

ntr

y

sum

sum

TV

VCR PC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 30: Dbm630_Lecture02-03

Browsing a Data Cube

30

Visualization OLAP capabilities Interactive manipulation

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 31: Dbm630_Lecture02-03

Typical OLAP Operations

31

Drill up (roll up): summarize data by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of drill up from higher level summary to lower level summary or detailed data, or introducing new

dimensions

Slice and dice: project and select Slice (select on 1 dim.), dice (select on 2 or more dim.)

Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.

Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables

(using SQL)

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 32: Dbm630_Lecture02-03

A Star-Net Query Model

32

Product

Shipping Method

Shipping type

group

Shipping type Order

Customer Orders

Contracts

Product line

Item

Salesperson

District

Division

Organization Promotion

City

Country

Region

Location

Daily Quarterly

Annually Time

Each circle is called

a footprint

Customer Sex

Male/Female

All

All

All

Promotion

type

All

All

Product group

All

All All

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 33: Dbm630_Lecture02-03

An Example

33

* A multi-dimensional data model

Item Subcategory Category

VCD Electric Non-Food

TV Electric Non-Food

Shirt Clothes Non-Food

Ham Process food Food

Milk Fresh food Food

Orange Fresh food Food

Dimension Table

Branch Country Continent

London UK Europe

Glasgow UK Europe

Berlin Germany Europe

Bangkok Thailand Asia

Phuket Thailand Asia

Tokyo Japan Asia

Fact Table

Location Dimension Table (3 levels)

Date Branch Item Buyer Units sold Dollars sold

1/1/2008 London VCD First Company 20 5000

1/1/2008 Bangkok TV First Company 30 9000

10/1/2008 London Ham First Company 20 1000

4/2/2008 London Milk First Company 80 1600

15/2/2008 Bangkok VCD Best Company 30 7500

2/5/2008 Bangkok Orange Best Company 20 500

Three concepts in data cubes are

(1) Multidimension (2) Hierarchy

(3) Measure

Buyer Buyer Group

First Company Group 1

Second Company Group 1

Third Company Group 1

Best Company Group 2

Good Company Group 2

Customer Dimension Table (2 levels)

Product Dimension Table (3 levels)

4 dimensions 2 measures

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 34: Dbm630_Lecture02-03

OLAP Operations (Drill down vs. Drill up)

34

Drill down Roll down

Drill up Roll up

Drill down Roll down

Drill up Roll up

Time Thailand Japan Total

Food NonFood Food NonFood

Q1 1500 500 2000 1000 5000

Q2 900 600 1500 1000 4000

Q3 1200 800 4000 2000 8000

Q4 2000 1000 2500 1500 7000

Total 2008 5600 2900 10000 5500 24000

Time Thailand Japan Total

Food NonFood Food NonFood

January 700 200 800 500 2200

February 500 100 700 200 1500

March 300 200 500 300 1300

Total Q1 1500 500 2000 1000 5000

Time Thailand Japan Total

Food NonFood Food NonFood

2006 2400 2200 11000 5000 20600

2007 4500 3200 12000 6000 25700

2008 5600 2900 10000 5500 24000

Total 12500 8300 33000 16500 70300

All customers All customers

All customers

Location

Time Product

Customer

day

month year all

category

subcategory

item

all

all

buyer

buyer group

country

branch

all

continent

quarter

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 35: Dbm630_Lecture02-03

OLAP Operations (Slice and Dice)

35

Slice

Location

Time Product

Customer

day

month year all

category

subcategory

item

all

all

buyer

buyer group

country

branch

all

continent

quarter

Buyer Group 1

Dice Buyer Group 1 January

Time Thailand Japan Total

Food NonFood Food NonFood

2006 2400 2200 11000 5000 20600

2007 4500 3200 12000 6000 25700

2008 5600 2900 10000 5500 24000

Total 12500 8300 33000 16500 70300

All customers

Time Thailand Japan Total

Food NonFood Food NonFood

2006 1100 1500 7000 2000 11600

2007 1200 1200 8000 3000 13400

2008 3000 1200 5000 1800 11000

Total 5300 3900 20000 6800 36000

Time Thailand Japan Total

Food NonFood Food NonFood

2006 200 200 800 300 1500

2007 100 100 900 400 1500

2008 300 200 400 100 1000

Total 600 500 2100 800 4000

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 36: Dbm630_Lecture02-03

OLAP Operations (Pivot)

36

Pivot

Location

Time

Product

Customer

day

month year all

category

subcategory

item

all

all

buyer

buyer group

country

branch

all

continent

quarter

Time Thailand Japan Total

Food NonFood Food NonFood

2006 2400 2200 11000 5000 20600

2007 4500 3200 12000 6000 25700

2008 5600 2900 10000 5500 24000

Total 12500 8300 33000 16500 70300

All customers

Time Thailand Japan Total

2006 2007 2008 2006 2007 2008

Food 2400 4500 5600 11000 12000 10000 45500

NonFood 2200 3200 2900 5000 6000 5500 24800

Total 4600 7700 8500 16000 18000 15500 70300

All customers

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 37: Dbm630_Lecture02-03

OLAP Operations (Drill Across)

37

Drill Across

Time Thailand Japan Total

Food NonFood Food NonFood

2006 2400 2200 11000 5000 20600

2007 4500 3200 12000 6000 25700

2008 5600 2900 10000 5500 24000

Total 12500 8300 33000 16500 70300

All customers

Pro

du

ct

Location

Sales

Time Thailand Japan Total

Food NonFood Food NonFood

2006 1500 1500 6000 3000 12000

2007 3500 2500 8000 4000 18000

2008 4000 1500 5000 3000 13500

Total 9000 5500 19000 10000 43500

All customers

Purchase

Pro

du

ct

Location

Drill Across

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 38: Dbm630_Lecture02-03

OLAP Operations (Drill Through)

38

Drill Through

Time Thailand Japan Total

Food NonFood Food NonFood

2006 2400 2200 11000 5000 20600

2007 4500 3200 12000 6000 25700

2008 5600 2900 10000 5500 24000

Total 12500 8300 33000 16500 70300

All customers

Pro

du

ct

Location

Sales

Date Branch Item Buyer Units sold Dollars sold

1/1/2008 Phuket VCD First Company 10 250

1/1/2008 Bangkok TV First Company 30 900

10/1/2008 Phuket TV First Company 10 300

4/2/2008 Phuket Stereo First Company 40 200

15/2/2008 Bangkok VCD Best Company 30 750

2/5/2008 Bangkok Computer Best Company 20 600

In order to see the detail at the relational table, we can perform ‘drill through’.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 39: Dbm630_Lecture02-03

An Example of OLAP tools

39 Data Warehousing and Data Mining by Kritsada Sriphaew

Page 40: Dbm630_Lecture02-03

An Example of OLAP tools

40

Dimension

Measures Time (year level)

Organization (continent level)

2 Dimensions: (1) Time (2) Organization

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 41: Dbm630_Lecture02-03

An Example of OLAP tools

41 Data Warehousing and Data Mining by Kritsada Sriphaew

Page 42: Dbm630_Lecture02-03

An Example of OLAP tools

42 Data Warehousing and Data Mining by Kritsada Sriphaew

Page 43: Dbm630_Lecture02-03

Designing Data Warehouses

Data Warehousing and Data Mining by Kritsada Sriphaew 43

Conceptual Modeling (Star schema, Snowflake, Fact constellations), Concept Hierarchy

Physical Modeling (Data sources, Data storage, OLAP engine)

Page 44: Dbm630_Lecture02-03

Multi-Tiered Architecture

44

Data

Warehouse

Extract

Transform

Load

OLAP Engine

Analysis

Query

Reports

Data

mining

Monitor

&

Integrator

Metadata

Data Sources Front-End Tools

Serve

Data Marts

Operational

DBs

other

sources

Data Storage

OLAP Server

Top Tier Middle Tier Bottom Tier

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 45: Dbm630_Lecture02-03

Three Physical Data Warehouse Models

45

Enterprise Warehouse collects all of the information about subjects spanning the entire

organization Data Mart

a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart

Independent vs. dependent (directly from warehouse) data mart

Virtual Warehouse A set of views over operational databases Only some of the possible summary views may be materialized Easy to build but require excess capacity on operational DB server.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 46: Dbm630_Lecture02-03

OLAP Server Architectures

46

Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage

warehouse data and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of aggregation

navigation logic, and additional tools/services greater scalability, we may keep summary in relational DB

Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine(sparse matrix techniques) fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP) User flexibility, e.g., low level: relational, high-level: array

Specialized SQL servers specialized support for SQL queries over star/snowflake schemas in read-

only environment.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 47: Dbm630_Lecture02-03

Efficient Data Cube Computation

47

Data cube can be viewed as a lattice of cuboids The bottom-most cuboid is the base cuboid The top-most cuboid (apex) contains only one cell How many cuboids in an n-dimensional cube with L levels?

Materialization of data cube Materialize every (cuboid) (full materialization), none (no materialization),

or some (partial materialization) Selection of which cuboids to materialize

Based on size, sharing, access frequency, etc.

)11(

n

ii

LT

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 48: Dbm630_Lecture02-03

Cube: A Lattice of Cuboids (Dimension)

48

all

time product location customer

time, item time, location

time, customer

item, location

item, customer

location, customer

time, item, location

time, item, customer

time, location, customer

item, location, customer

time, item, location, customer

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 49: Dbm630_Lecture02-03

Multi-way Array Aggregation for Cube Computation

49

Partition arrays into chunks (a small subcube which fits in memory).

Compressed sparse array addressing: (chunk_id, offset)

Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost.

The size of the dimensions A, B,

and C is 6000, 1000, and 100000.

What is the best traversing order

to do multi-way aggregation?

A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

64 63 62 61 48 47 46 45

a1 a0

c3 c2

c1 c 0

b3

b2

b1

b0

a2 a3

C

B

44 28 56

40 24 52

36 20

60

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 50: Dbm630_Lecture02-03

Multi-way Array Aggregation for Cube Computation

50

A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

64 63 62 61 48 47 46 45

a1 a0

c3 c2

c1 c 0

b3

b2

b1

b0

a2 a3

C

44 28

56 40

24 52

36 20

60

B

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 51: Dbm630_Lecture02-03

Multi-way Array Aggregation for Cube Computation

51

A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

64 63 62 61 48 47 46 45

a1 a0

c3 c2

c1 c 0

b3

b2

b1

b0

a2 a3

C

44 28

56 40

24 52

36 20

60

B

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 52: Dbm630_Lecture02-03

Multi-Way Array Aggregation for Cube Computation (Cont.)

52

Method: the planes should be sorted and computed according to their size in ascending order.

Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane

Limitation of the method: computing well only for a small number of dimensions

If there are a large number of dimensions, “bottom-up computation” and iceberg cube computation methods can be explored. Iceberg cubes store only cube partitions where the aggregate value is above some min. support.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 53: Dbm630_Lecture02-03

Multi-Way Array Aggregation (An Example)

53

Suppose that a base cuboid has three dimensions A, B, C with the following numbers of cells: |A| = 6,000, |B| = 1,000, |C| = 50,000. Suppose that the dimensions A, B and C are partitioned into 6, 5 and 1000 portions for chunking, respectively.

If each cube cell stores one measure with 4 bytes, what is the total size of the computed cube if the cube is very dense? That is, calculate the size of a base cuboid.

The number of cells in the computed cube is 6 x 103 x 103 x 5 x 104 = 3x1011 cells.

The total size of the computed cube is 4 x 3 x 1011 = 1.2x1012 bytes.

State the order for computing the chunks in the cube that requires the least amount of space, and compute the total amount of main memory space required for computing the 2-D planes.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 54: Dbm630_Lecture02-03

Multi-Way Array Aggregation for Cube Computation (Cont.)

54

B

A

C

One chunk includes

A: 6000/6 = 1000

B: 1000/5 = 200

C: 50000/1000 = 50

The space we need for

computing the cube is

(1) All elements of plane AB:

= 6000x1000 cells

= 6000000 cells

= 24000000 bytes

(2) One row of plane BC:

= 1000x50 cells

= 50000 cells

= 200000 bytes

(3) One chunk of plane AC:

= 1000x50 cells

= 50000 cells

= 200000 bytes

(4) One cube of ABC:

= 1000x200x50 cells

= 10000000 cells

= 40000000 bytes

Total memory for

keeping the result

= 64400000 bytes

= 6.44x107 bytes.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 55: Dbm630_Lecture02-03

Efficient Processing OLAP Queries

55

Determine which operations should be performed on the available cuboids:

transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection

Determine to which materialized cuboid(s) the relevant operations should be applied.

Exploring indexing structures and compressed vs. dense array structures in MOLAP

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 56: Dbm630_Lecture02-03

Metadata Repository

56

Meta data is the data defining warehouse objects. It has the following kinds

The algorithms used for summarization

The mapping from operational environment to the data warehouse

Data related to system performance

warehouse schema, view and derived data definitions

Business data

business terms and definitions, ownership of data, charging policies

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 57: Dbm630_Lecture02-03

Data Warehouse Back-End Tools and Utilities

57

Extraction: get data from multiple, heterogeneous, and external sources

Transformation: convert data from legacy or host format to warehouse format

Load: sort, summarize, consolidate, compute views, check integrity, and

build indices and partitions

Cleaning: detect errors in the data and rectify them when possible

Refresh: propagate the updates from the data sources to the warehouse

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 58: Dbm630_Lecture02-03

From OLAP to On Line Analytical Mining (OLAM)

58

Why online analytical mining? High quality of data in data warehouses

DW contains integrated, consistent, cleaned data Available information processing structure surrounding data warehouses

ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools

OLAP-based exploratory data analysis

mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions

integration and swapping of multiple mining functions, algorithms, and tasks.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 59: Dbm630_Lecture02-03

An OLAM Architecture (Online Analytical Mining)

59

Data

Warehouse

Meta Data

MDDB

OLAM

Engine

OLAP

Engine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 60: Dbm630_Lecture02-03

Major Applications of Data Warehouses

Data Warehousing and Data Mining by Kritsada Sriphaew 60

Applications

Online Analytical Processing (OLAP)

Data Mining

Customer Relationship Management (CRM)

Page 61: Dbm630_Lecture02-03

Some commercial tools for Data Warehouse

Data Warehousing and Data Mining by Kritsada Sriphaew 61

Microsoft Excel’s PivotTable

Crystal Reports’ Business Objects

IBM Cognos

Microsoft SQL Server Analysis Services

Oracle Express

Microsoft Proclarity

Etc.

Page 62: Dbm630_Lecture02-03

Exercise 1

Knowledge Management and Discovery © Kritsada Sriphaew 62

Given the following fact table Date Branch Item Buyer Units

sold Dollars

sold 1/01/2008 London Chair First Company 20 5000 1/01/2008 Bangkok TV First Company 30 9000

10/01/2008 London Ham Second Company 20 1000 4/02/2008 London Milk First Company 80 1600

15/02/2008 Bangkok VCD Best Company 30 7500 2/05/2008 Berlin Orange Best Company 20 500

12/06/2008 London VCD Good Company 20 5000 14/06/2008 Bangkok TV First Company 10 12000 16/07/2008 Phuket VCD Good Company 20 1000 24/07/2008 London Orange First Company 30 6000

2/08/2008 Bangkok Table Best Company 30 7500 12/08/2008 Bangkok Ham Best Company 20 500 12/10/2008 London Table Good Company 20 5000 14/11/2008 Bangkok TV First Company 8 2000 16/11/2008 Phuket Ham Good Company 5 2000 24/12/2008 London Milk First Company 30 6000

Page 63: Dbm630_Lecture02-03

Exercise 1 (cont.)

Knowledge Management and Discovery © Kritsada Sriphaew 63

Given the following dimensions of the data

Location

Time Product

Customer

day

month year all

category

subcategory

item

all

all

buyer

buyer group

country

branch

all

continent

quarter

Page 64: Dbm630_Lecture02-03

Exercise 1 (cont.)

Knowledge Management and Discovery © Kritsada Sriphaew 64

Question 1: Write concept hierarchies of time, location, product and customer.

Buyer groups are

Group1: Quality Group; good company, best company

Group2: Number Group; first company, second company

Product’s categories are

Cat1: Food Subcategories are drink and non-drink

Cat2: NonFood Subcategories are electronic and non-electronic

Page 65: Dbm630_Lecture02-03

Exercise 1 (cont.)

Knowledge Management and Discovery © Kritsada Sriphaew 65

Question 2: Write a result table of this star-net query model with sum dollars sold.

Location

Time Product

Customer

day

month year all

category

subcategory

item

all

all

buyer

buyer group

country

branch

all

continent

quarter

Page 66: Dbm630_Lecture02-03

Exercise 1 (cont.)

Knowledge Management and Discovery © Kritsada Sriphaew 66

Question 3: Write a result table of this star-net query model with average units sold.

Location

Time Product

Customer

day

month year all

category

subcategory

item

all

all

buyer

buyer group

country

branch

all

continent

quarter

Page 67: Dbm630_Lecture02-03

Exercise 1 (cont.)

Knowledge Management and Discovery © Kritsada Sriphaew 67

Question 4: Write a result table of this star-net query model with sum dollars sold. Slice only Bangkok location.

Location

Time Product

Customer

day

month year all

category

subcategory

item

all

all

buyer

buyer group

country

branch

all

continent

quarter

Page 68: Dbm630_Lecture02-03

Exercise 2

Knowledge Management and Discovery © Kritsada Sriphaew 68

(Data Cube Computation) Suppose that a base cuboid has three dimensions A, B, C with the following numbers of cells: |A| = 20,000, |B| = 8,000, |C| = 1,000. Also suppose that the dimensions A, B and C are partitioned into 10, 8 and 4 portions for chunking, respectively.

Question 1: If each cube cell stores one measure with 4 bytes, what is the total size of the computed cube if the cube is very dense? That is, calculate the size of a base cuboid.

The number of cells in the computed cube ?

The total size (bytes) of the computed cube ?

Question 2: State the order for computing the chunks in the cube that requires the least amount of space, and compute the total amount of main memory space required for computing the 2-D planes.