Upload
aj-kritsada-sriphaew
View
178
Download
5
Tags:
Embed Size (px)
Citation preview
DBM630: Data Mining and
Data Warehousing
MS.IT. Rangsit University
1
Semester 2/2011
Lecture 2&3
Data Warehouse and OLAP Technology
by Kritsada Sriphaew (sriphaew.k AT gmail.com)
Outline
2
Part I: Basic Knowledge about Data Warehousing
Dimensional Modeling
Data Cube
Architecture
Part II: OLAP and Cube Computations
OLAP, Data Cube and Data Analysis
Cube Computations
Demo PivotTable (bring laptop with MS Excel, if you have)
Data Warehousing and Data Mining by Kritsada Sriphaew
What is Data Warehouse?
3
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from the organization’s operational DB
Support information processing by providing a solid platform of consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.” (definition by W. H. Inmon)
Data warehousing:
Process of constructing and using data warehouses
Data Warehousing and Data Mining by Kritsada Sriphaew
Four Properties of Data Warehouses
Data Warehousing and Data Mining by Kritsada Sriphaew 4
subject-oriented จัดเก็บเป็นเรื่องๆ
integrated รวบรวมอยู่ในรูปแบบเดียวกัน
time-variant มีข้อมูลตามมติิเวลา
non-volatile มีเสถียรภาพ ข้อมูลไม่สูญหาย
Subject-Oriented
5
Subject-Oriented Property Organized around major subjects, such as customer, product, sales.
Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.
Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
OPERATIONAL DB DATA WAREHOUSE
• Loans
• Savings
• Bank card
• Trust
• Customer
• Vendor
• Product
• Activity
An application orientation A subject orientation
Data Warehousing and Data Mining by Kritsada Sriphaew
Integrated
6
Integrated Property Constructed by integrating multiple, heterogeneous data sources
Relational databases, flat files, on-line transaction records
Data cleaning and data integration techniques are applied.
Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources e.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted.
Data Warehousing and Data Mining by Kritsada Sriphaew
Time Variant/Non-Volatile
7
Time Variant Property The time horizon for the data warehouse is significantly longer than that of
operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10
years)
Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain “time element”.
Non-Volatile Property A physically separated store of data transformed from the operational
environment. Operational update of data does not occur in the data warehouse environment.
Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of
data.
Data Warehousing and Data Mining by Kritsada Sriphaew
Operational DBMS vs. Data Warehouse (OLTP vs. OLAP)
8
OLTP (on-line transaction processing) Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing) Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: create/select/insert/delete/update vs. read-only but complex queries
Data Warehousing and Data Mining by Kritsada Sriphaew
OLTP vs. OLAP
9
OLTP OLAP users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized,
multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on primary key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands tens
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Data Warehousing and Data Mining by Kritsada Sriphaew
Heterogeneous DBMS vs. Data Warehouse
10
Traditional heterogeneous DB integration Build wrappers/mediators on top of heterogeneous DBs Query-driven approach
When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set
Complex information filtering, compete for resources
Data warehouse (DW) Another database for high-performance analysis Update-driven approach
Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query/analysis
Data Warehousing and Data Mining by Kritsada Sriphaew
Heterogeneous DBMS vs. Data Warehouse
11
Query-driven Approach Sales
Purchasing
Production
Wrapper/
Mediators
Heterogeneous Operational Database
OLTP
Sales
Purchasing
Production
Query
Update-driven Approach
Data Warehouse
OLAP
OLTP
Heterogeneous Operational Database
Data Warehousing and Data Mining by Kritsada Sriphaew
Why Separate Data Warehouse?
12
High performance for both systems DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.
Different functions and different data: Missing data: Decision support (OLAP) requires historical data which
operational DBs (OLTP) do not typically maintain
Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources
Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
Data Warehousing and Data Mining by Kritsada Sriphaew
Data Warehousing and Data Mining by Kritsada Sriphaew
From Tables and Spreadsheets to Data Cubes
13
A data warehouse is based on a multidimensional data model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Fact table contains measures (such as units_sold, dollars_sold) and keys to each of
the related dimension tables Dimension tables, such as location (branch, country, continent), or time (day, week,
month, quarter, year)
Branch Country Continent
London UK Europe
Glasgow UK Europe
Berlin Germany Europe
Bangkok Thailand Asia
Phuket Thailand Asia
Tokyo Japan Asia
Fact Table Dimension Table
Date Branch Item Buyer Units sold Dollars sold
1/1/2008 London VCD First Company 20 5000
1/1/2008 Bangkok TV First Company 30 9000
10/1/2008 London Ham First Company 20 1000
4/2/2008 London Milk First Company 80 1600
15/2/2008 Bangkok VCD Best Company 30 7500
2/5/2008 Bangkok Orange Best Company 20 500
* A multidimensional data model
Three Concepts in Data Cubes
14
Item Subcategory Category
VCD Electric Non-Food
TV Electric Non-Food
Shirt Clothes Non-Food
Ham Process food Food
Milk Fresh food Food
Orange Fresh food Food
Branch Country Continent
London UK Europe
Glasgow UK Europe
Berlin Germany Europe
Bangkok Thailand Asia
Phuket Thailand Asia
Tokyo Japan Asia
Fact Table
Location Dimension Table (3 levels)
Date Branch Item Buyer Units sold Dollars sold
1/1/2008 London VCD First Company 20 5000
1/1/2008 Bangkok TV First Company 30 9000
10/1/2008 London Ham First Company 20 1000
4/2/2008 London Milk First Company 80 1600
15/2/2008 Bangkok VCD Best Company 30 7500
2/5/2008 Bangkok Orange Best Company 20 500
Three concepts in data cubes are
(1) Multidimension (2) Hierarchy
(3) Measure
Buyer Buyer Group
First Company Group 1
Second Company Group 1
Third Company Group 1
Best Company Group 2
Good Company Group 2
Customer Dimension Table (2 levels)
Product Dimension Table (3 levels)
4 dimensions 2 measures
Data Warehousing and Data Mining by Kritsada Sriphaew
Cuboids in Data Cubes
15
Cuboid concept is formed by the number of dimensions.
The top most 0-D cuboid, which holds the highest-level of summarization, is called an apex cuboid.
In data warehousing literature, an n-D base cube where n is the total number of dimensions is called a base cuboid.
The lattice of cuboids forms a data cube.
Data Warehousing and Data Mining by Kritsada Sriphaew
An Example of Cuboids (Dimension)
(add vs. delete dimensions)
16
Sales(all) 0-D (apex) cuboid Sales
24000
Time Sales
Q1 5000
Q2 4000
Q3 8000
Q4 7000 1-D cuboid Sales(Q1), Sales(Q2),
Sales(Q3), Sales(Q4)
Time Thailand Japan
Q1 2000 3000
Q2 1500 2500
Q3 2000 6000
Q4 3000 4000
2-D cuboid
Sales(Q1, Thailand), Sales(Q1, Japan)
Sales(Q2, Thailand), Sales(Q2, Japan)
Sales(Q3, Thailand), Sales(Q3, Japan)
Sales(Q4, Thailand), Sales(Q4, Japan)
Time Thailand Japan
Food NonFood Food NonFood
Q1 1500 500 2000 1000
Q2 900 600 1500 1000
Q3 1200 800 4000 2000
Q4 2000 1000 2500 1500
3-D cuboid
Sales(Q1, Thailand, Food), Sales(Q1, Thailand, NonFood)
Sales(Q2, Thailand, Food), Sales(Q2, Thailand, NonFood)
Sales(Q3, Thailand, Food), Sales(Q3, Thailand, NonFood)
Sales(Q4, Thailand, Food), Sales(Q4, Thailand, NonFood)
Sales(Q1, Japan, Food), Sales(Q1, Japan, NonFood)
Sales(Q2, Japan, Food), Sales(Q2, Japan, NonFood)
Sales(Q3, Japan, Food), Sales(Q3, Japan, NonFood)
Sales(Q4, Japan, Food), Sales(Q4, Japan, NonFood)
Add dim. Delete dim.
Add dim. Delete dim.
Add dim. Delete dim.
Data Warehousing and Data Mining by Kritsada Sriphaew
Data Cube: A Lattice of Cuboids (Dimension)
17
all
time product location customer
time, item time, location
time, customer
item, location
item, customer
location, customer
time, item, location
time, item, customer
time, location, customer
item, location, customer
time, item, location, customer
0-D (apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
Data Warehousing and Data Mining by Kritsada Sriphaew
Designing Data Warehouses
Data Warehousing and Data Mining by Kritsada Sriphaew 18
Conceptual Modeling (Star schema, Snowflake, Fact constellations), Concept Hierarchy
Physical Modeling (Data sources, Data storage, OLAP engine)
Conceptual Modeling of Data Warehouses
19
Modeling data warehouses: dimensions & measures
Star schema: A fact table in the middle connected to a set of dimension tables
Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
Data Warehousing and Data Mining by Kritsada Sriphaew
Example of Star Schema
20
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_street
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Data Warehousing and Data Mining by Kritsada Sriphaew
Example of Snowflake Schema
21
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name
branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
province_or_street
country
city
Data Warehousing and Data Mining by Kritsada Sriphaew
Example of Fact Constellation
22
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_street
country
location
Sales Fact Table
time_key
item_key
branch_key location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipper
Data Warehousing and Data Mining by Kritsada Sriphaew
Measures: Three Categories
23
Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning.
e.g., count(), sum(), min(), max().
Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.
e.g., avg(), standard_deviation().
Holistic: if there is no constant bound on the storage size needed to describe a sub-aggregate.
e.g., median(), mode(), rank().
Data Warehousing and Data Mining by Kritsada Sriphaew
A Concept Hierarchy: Dimension (location)
24
all
Europe North_America
Mexico Canada Spain Germany
Vancouver
M. Wind L. Chan
...
... ...
... ...
...
all
region
office
country
Toronto Frankfurt city
Data Warehousing and Data Mining by Kritsada Sriphaew
A Concept Hierarchy: Dimension
(Distributive - count(), sum(), min(), max())
25
all
Europe North_America
Canada Spain Germany
Vancouver
all
region
country
Toronto Frankfurt city Berlin Aechen Segovia Madrid
400 200 400 300
sum = 900
sum = 1400
Mexico
Mexico
100 200 400
100
sum = 500 sum = 600 sum = 100
sum = 700
sum = 2100
Data Warehousing and Data Mining by Kritsada Sriphaew
A Concept Hierarchy: Dimension
(Algebraic - avg(), standard_deviation())
26
all
Europe North_America
Canada Spain Germany
Vancouver
all
region
country
Toronto Frankfurt city Berlin Aechen Segovia Madrid
400 200 400 300
avg = 300
avg = 280
Mexico
Mexico
100 200 400
100
avg = 250 avg = 300 avg = 100
avg = 233.33
avg = 262.5
count=3 count=1 count=2
count=5
count=2
count=3
count=8
Data Warehousing and Data Mining by Kritsada Sriphaew
A Concept Hierarchy: Dimension
(Holistic - median(), mode(), rank())
27
all
Europe North_America
Canada Spain Germany
Vancouver
all
region
country
Toronto Frankfurt city Berlin Aechen Segovia Madrid
400 200 400 300
median = 300
median = 300
Mexico
Mexico
100 200 400
100
median = 250 median = 300
median = 100
median = 200
median = 250
Data Warehousing and Data Mining by Kritsada Sriphaew
Multidimensional Data
28
Sales volume as a function of product, month, and region
Pro
duct
Month
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
Data Warehousing and Data Mining by Kritsada Sriphaew
A Sample Data Cube
29
Total annual sales
of TV in U.S.A. Time
Cou
ntr
y
sum
sum
TV
VCR PC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
Data Warehousing and Data Mining by Kritsada Sriphaew
Browsing a Data Cube
30
Visualization OLAP capabilities Interactive manipulation
Data Warehousing and Data Mining by Kritsada Sriphaew
Typical OLAP Operations
31
Drill up (roll up): summarize data by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of drill up from higher level summary to lower level summary or detailed data, or introducing new
dimensions
Slice and dice: project and select Slice (select on 1 dim.), dice (select on 2 or more dim.)
Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.
Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables
(using SQL)
Data Warehousing and Data Mining by Kritsada Sriphaew
A Star-Net Query Model
32
Product
Shipping Method
Shipping type
group
Shipping type Order
Customer Orders
Contracts
Product line
Item
Salesperson
District
Division
Organization Promotion
City
Country
Region
Location
Daily Quarterly
Annually Time
Each circle is called
a footprint
Customer Sex
Male/Female
All
All
All
Promotion
type
All
All
Product group
All
All All
Data Warehousing and Data Mining by Kritsada Sriphaew
An Example
33
* A multi-dimensional data model
Item Subcategory Category
VCD Electric Non-Food
TV Electric Non-Food
Shirt Clothes Non-Food
Ham Process food Food
Milk Fresh food Food
Orange Fresh food Food
Dimension Table
Branch Country Continent
London UK Europe
Glasgow UK Europe
Berlin Germany Europe
Bangkok Thailand Asia
Phuket Thailand Asia
Tokyo Japan Asia
Fact Table
Location Dimension Table (3 levels)
Date Branch Item Buyer Units sold Dollars sold
1/1/2008 London VCD First Company 20 5000
1/1/2008 Bangkok TV First Company 30 9000
10/1/2008 London Ham First Company 20 1000
4/2/2008 London Milk First Company 80 1600
15/2/2008 Bangkok VCD Best Company 30 7500
2/5/2008 Bangkok Orange Best Company 20 500
Three concepts in data cubes are
(1) Multidimension (2) Hierarchy
(3) Measure
Buyer Buyer Group
First Company Group 1
Second Company Group 1
Third Company Group 1
Best Company Group 2
Good Company Group 2
Customer Dimension Table (2 levels)
Product Dimension Table (3 levels)
4 dimensions 2 measures
Data Warehousing and Data Mining by Kritsada Sriphaew
OLAP Operations (Drill down vs. Drill up)
34
Drill down Roll down
Drill up Roll up
Drill down Roll down
Drill up Roll up
Time Thailand Japan Total
Food NonFood Food NonFood
Q1 1500 500 2000 1000 5000
Q2 900 600 1500 1000 4000
Q3 1200 800 4000 2000 8000
Q4 2000 1000 2500 1500 7000
Total 2008 5600 2900 10000 5500 24000
Time Thailand Japan Total
Food NonFood Food NonFood
January 700 200 800 500 2200
February 500 100 700 200 1500
March 300 200 500 300 1300
Total Q1 1500 500 2000 1000 5000
Time Thailand Japan Total
Food NonFood Food NonFood
2006 2400 2200 11000 5000 20600
2007 4500 3200 12000 6000 25700
2008 5600 2900 10000 5500 24000
Total 12500 8300 33000 16500 70300
All customers All customers
All customers
Location
Time Product
Customer
day
month year all
category
subcategory
item
all
all
buyer
buyer group
country
branch
all
continent
quarter
Data Warehousing and Data Mining by Kritsada Sriphaew
OLAP Operations (Slice and Dice)
35
Slice
Location
Time Product
Customer
day
month year all
category
subcategory
item
all
all
buyer
buyer group
country
branch
all
continent
quarter
Buyer Group 1
Dice Buyer Group 1 January
Time Thailand Japan Total
Food NonFood Food NonFood
2006 2400 2200 11000 5000 20600
2007 4500 3200 12000 6000 25700
2008 5600 2900 10000 5500 24000
Total 12500 8300 33000 16500 70300
All customers
Time Thailand Japan Total
Food NonFood Food NonFood
2006 1100 1500 7000 2000 11600
2007 1200 1200 8000 3000 13400
2008 3000 1200 5000 1800 11000
Total 5300 3900 20000 6800 36000
Time Thailand Japan Total
Food NonFood Food NonFood
2006 200 200 800 300 1500
2007 100 100 900 400 1500
2008 300 200 400 100 1000
Total 600 500 2100 800 4000
Data Warehousing and Data Mining by Kritsada Sriphaew
OLAP Operations (Pivot)
36
Pivot
Location
Time
Product
Customer
day
month year all
category
subcategory
item
all
all
buyer
buyer group
country
branch
all
continent
quarter
Time Thailand Japan Total
Food NonFood Food NonFood
2006 2400 2200 11000 5000 20600
2007 4500 3200 12000 6000 25700
2008 5600 2900 10000 5500 24000
Total 12500 8300 33000 16500 70300
All customers
Time Thailand Japan Total
2006 2007 2008 2006 2007 2008
Food 2400 4500 5600 11000 12000 10000 45500
NonFood 2200 3200 2900 5000 6000 5500 24800
Total 4600 7700 8500 16000 18000 15500 70300
All customers
Data Warehousing and Data Mining by Kritsada Sriphaew
OLAP Operations (Drill Across)
37
Drill Across
Time Thailand Japan Total
Food NonFood Food NonFood
2006 2400 2200 11000 5000 20600
2007 4500 3200 12000 6000 25700
2008 5600 2900 10000 5500 24000
Total 12500 8300 33000 16500 70300
All customers
Pro
du
ct
Location
Sales
Time Thailand Japan Total
Food NonFood Food NonFood
2006 1500 1500 6000 3000 12000
2007 3500 2500 8000 4000 18000
2008 4000 1500 5000 3000 13500
Total 9000 5500 19000 10000 43500
All customers
Purchase
Pro
du
ct
Location
Drill Across
Data Warehousing and Data Mining by Kritsada Sriphaew
OLAP Operations (Drill Through)
38
Drill Through
Time Thailand Japan Total
Food NonFood Food NonFood
2006 2400 2200 11000 5000 20600
2007 4500 3200 12000 6000 25700
2008 5600 2900 10000 5500 24000
Total 12500 8300 33000 16500 70300
All customers
Pro
du
ct
Location
Sales
Date Branch Item Buyer Units sold Dollars sold
1/1/2008 Phuket VCD First Company 10 250
1/1/2008 Bangkok TV First Company 30 900
10/1/2008 Phuket TV First Company 10 300
4/2/2008 Phuket Stereo First Company 40 200
15/2/2008 Bangkok VCD Best Company 30 750
2/5/2008 Bangkok Computer Best Company 20 600
In order to see the detail at the relational table, we can perform ‘drill through’.
Data Warehousing and Data Mining by Kritsada Sriphaew
An Example of OLAP tools
39 Data Warehousing and Data Mining by Kritsada Sriphaew
An Example of OLAP tools
40
Dimension
Measures Time (year level)
Organization (continent level)
2 Dimensions: (1) Time (2) Organization
Data Warehousing and Data Mining by Kritsada Sriphaew
An Example of OLAP tools
41 Data Warehousing and Data Mining by Kritsada Sriphaew
An Example of OLAP tools
42 Data Warehousing and Data Mining by Kritsada Sriphaew
Designing Data Warehouses
Data Warehousing and Data Mining by Kritsada Sriphaew 43
Conceptual Modeling (Star schema, Snowflake, Fact constellations), Concept Hierarchy
Physical Modeling (Data sources, Data storage, OLAP engine)
Multi-Tiered Architecture
44
Data
Warehouse
Extract
Transform
Load
OLAP Engine
Analysis
Query
Reports
Data
mining
Monitor
&
Integrator
Metadata
Data Sources Front-End Tools
Serve
Data Marts
Operational
DBs
other
sources
Data Storage
OLAP Server
Top Tier Middle Tier Bottom Tier
Data Warehousing and Data Mining by Kritsada Sriphaew
Three Physical Data Warehouse Models
45
Enterprise Warehouse collects all of the information about subjects spanning the entire
organization Data Mart
a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart
Independent vs. dependent (directly from warehouse) data mart
Virtual Warehouse A set of views over operational databases Only some of the possible summary views may be materialized Easy to build but require excess capacity on operational DB server.
Data Warehousing and Data Mining by Kritsada Sriphaew
OLAP Server Architectures
46
Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools/services greater scalability, we may keep summary in relational DB
Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine(sparse matrix techniques) fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP) User flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers specialized support for SQL queries over star/snowflake schemas in read-
only environment.
Data Warehousing and Data Mining by Kritsada Sriphaew
Efficient Data Cube Computation
47
Data cube can be viewed as a lattice of cuboids The bottom-most cuboid is the base cuboid The top-most cuboid (apex) contains only one cell How many cuboids in an n-dimensional cube with L levels?
Materialization of data cube Materialize every (cuboid) (full materialization), none (no materialization),
or some (partial materialization) Selection of which cuboids to materialize
Based on size, sharing, access frequency, etc.
)11(
n
ii
LT
Data Warehousing and Data Mining by Kritsada Sriphaew
Cube: A Lattice of Cuboids (Dimension)
48
all
time product location customer
time, item time, location
time, customer
item, location
item, customer
location, customer
time, item, location
time, item, customer
time, location, customer
item, location, customer
time, item, location, customer
0-D (apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-way Array Aggregation for Cube Computation
49
Partition arrays into chunks (a small subcube which fits in memory).
Compressed sparse array addressing: (chunk_id, offset)
Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost.
The size of the dimensions A, B,
and C is 6000, 1000, and 100000.
What is the best traversing order
to do multi-way aggregation?
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
64 63 62 61 48 47 46 45
a1 a0
c3 c2
c1 c 0
b3
b2
b1
b0
a2 a3
C
B
44 28 56
40 24 52
36 20
60
Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-way Array Aggregation for Cube Computation
50
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
64 63 62 61 48 47 46 45
a1 a0
c3 c2
c1 c 0
b3
b2
b1
b0
a2 a3
C
44 28
56 40
24 52
36 20
60
B
Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-way Array Aggregation for Cube Computation
51
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
64 63 62 61 48 47 46 45
a1 a0
c3 c2
c1 c 0
b3
b2
b1
b0
a2 a3
C
44 28
56 40
24 52
36 20
60
B
Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-Way Array Aggregation for Cube Computation (Cont.)
52
Method: the planes should be sorted and computed according to their size in ascending order.
Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane
Limitation of the method: computing well only for a small number of dimensions
If there are a large number of dimensions, “bottom-up computation” and iceberg cube computation methods can be explored. Iceberg cubes store only cube partitions where the aggregate value is above some min. support.
Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-Way Array Aggregation (An Example)
53
Suppose that a base cuboid has three dimensions A, B, C with the following numbers of cells: |A| = 6,000, |B| = 1,000, |C| = 50,000. Suppose that the dimensions A, B and C are partitioned into 6, 5 and 1000 portions for chunking, respectively.
If each cube cell stores one measure with 4 bytes, what is the total size of the computed cube if the cube is very dense? That is, calculate the size of a base cuboid.
The number of cells in the computed cube is 6 x 103 x 103 x 5 x 104 = 3x1011 cells.
The total size of the computed cube is 4 x 3 x 1011 = 1.2x1012 bytes.
State the order for computing the chunks in the cube that requires the least amount of space, and compute the total amount of main memory space required for computing the 2-D planes.
Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-Way Array Aggregation for Cube Computation (Cont.)
54
B
A
C
One chunk includes
A: 6000/6 = 1000
B: 1000/5 = 200
C: 50000/1000 = 50
The space we need for
computing the cube is
(1) All elements of plane AB:
= 6000x1000 cells
= 6000000 cells
= 24000000 bytes
(2) One row of plane BC:
= 1000x50 cells
= 50000 cells
= 200000 bytes
(3) One chunk of plane AC:
= 1000x50 cells
= 50000 cells
= 200000 bytes
(4) One cube of ABC:
= 1000x200x50 cells
= 10000000 cells
= 40000000 bytes
Total memory for
keeping the result
= 64400000 bytes
= 6.44x107 bytes.
Data Warehousing and Data Mining by Kritsada Sriphaew
Efficient Processing OLAP Queries
55
Determine which operations should be performed on the available cuboids:
transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection
Determine to which materialized cuboid(s) the relevant operations should be applied.
Exploring indexing structures and compressed vs. dense array structures in MOLAP
Data Warehousing and Data Mining by Kritsada Sriphaew
Metadata Repository
56
Meta data is the data defining warehouse objects. It has the following kinds
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance
warehouse schema, view and derived data definitions
Business data
business terms and definitions, ownership of data, charging policies
Data Warehousing and Data Mining by Kritsada Sriphaew
Data Warehouse Back-End Tools and Utilities
57
Extraction: get data from multiple, heterogeneous, and external sources
Transformation: convert data from legacy or host format to warehouse format
Load: sort, summarize, consolidate, compute views, check integrity, and
build indices and partitions
Cleaning: detect errors in the data and rectify them when possible
Refresh: propagate the updates from the data sources to the warehouse
Data Warehousing and Data Mining by Kritsada Sriphaew
From OLAP to On Line Analytical Mining (OLAM)
58
Why online analytical mining? High quality of data in data warehouses
DW contains integrated, consistent, cleaned data Available information processing structure surrounding data warehouses
ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools
OLAP-based exploratory data analysis
mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions
integration and swapping of multiple mining functions, algorithms, and tasks.
Data Warehousing and Data Mining by Kritsada Sriphaew
An OLAM Architecture (Online Analytical Mining)
59
Data
Warehouse
Meta Data
MDDB
OLAM
Engine
OLAP
Engine
User GUI API
Data Cube API
Database API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data Repository
Layer4
User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result
Data Warehousing and Data Mining by Kritsada Sriphaew
Major Applications of Data Warehouses
Data Warehousing and Data Mining by Kritsada Sriphaew 60
Applications
Online Analytical Processing (OLAP)
Data Mining
Customer Relationship Management (CRM)
Some commercial tools for Data Warehouse
Data Warehousing and Data Mining by Kritsada Sriphaew 61
Microsoft Excel’s PivotTable
Crystal Reports’ Business Objects
IBM Cognos
Microsoft SQL Server Analysis Services
Oracle Express
Microsoft Proclarity
Etc.
Exercise 1
Knowledge Management and Discovery © Kritsada Sriphaew 62
Given the following fact table Date Branch Item Buyer Units
sold Dollars
sold 1/01/2008 London Chair First Company 20 5000 1/01/2008 Bangkok TV First Company 30 9000
10/01/2008 London Ham Second Company 20 1000 4/02/2008 London Milk First Company 80 1600
15/02/2008 Bangkok VCD Best Company 30 7500 2/05/2008 Berlin Orange Best Company 20 500
12/06/2008 London VCD Good Company 20 5000 14/06/2008 Bangkok TV First Company 10 12000 16/07/2008 Phuket VCD Good Company 20 1000 24/07/2008 London Orange First Company 30 6000
2/08/2008 Bangkok Table Best Company 30 7500 12/08/2008 Bangkok Ham Best Company 20 500 12/10/2008 London Table Good Company 20 5000 14/11/2008 Bangkok TV First Company 8 2000 16/11/2008 Phuket Ham Good Company 5 2000 24/12/2008 London Milk First Company 30 6000
Exercise 1 (cont.)
Knowledge Management and Discovery © Kritsada Sriphaew 63
Given the following dimensions of the data
Location
Time Product
Customer
day
month year all
category
subcategory
item
all
all
buyer
buyer group
country
branch
all
continent
quarter
Exercise 1 (cont.)
Knowledge Management and Discovery © Kritsada Sriphaew 64
Question 1: Write concept hierarchies of time, location, product and customer.
Buyer groups are
Group1: Quality Group; good company, best company
Group2: Number Group; first company, second company
Product’s categories are
Cat1: Food Subcategories are drink and non-drink
Cat2: NonFood Subcategories are electronic and non-electronic
Exercise 1 (cont.)
Knowledge Management and Discovery © Kritsada Sriphaew 65
Question 2: Write a result table of this star-net query model with sum dollars sold.
Location
Time Product
Customer
day
month year all
category
subcategory
item
all
all
buyer
buyer group
country
branch
all
continent
quarter
Exercise 1 (cont.)
Knowledge Management and Discovery © Kritsada Sriphaew 66
Question 3: Write a result table of this star-net query model with average units sold.
Location
Time Product
Customer
day
month year all
category
subcategory
item
all
all
buyer
buyer group
country
branch
all
continent
quarter
Exercise 1 (cont.)
Knowledge Management and Discovery © Kritsada Sriphaew 67
Question 4: Write a result table of this star-net query model with sum dollars sold. Slice only Bangkok location.
Location
Time Product
Customer
day
month year all
category
subcategory
item
all
all
buyer
buyer group
country
branch
all
continent
quarter
Exercise 2
Knowledge Management and Discovery © Kritsada Sriphaew 68
(Data Cube Computation) Suppose that a base cuboid has three dimensions A, B, C with the following numbers of cells: |A| = 20,000, |B| = 8,000, |C| = 1,000. Also suppose that the dimensions A, B and C are partitioned into 10, 8 and 4 portions for chunking, respectively.
Question 1: If each cube cell stores one measure with 4 bytes, what is the total size of the computed cube if the cube is very dense? That is, calculate the size of a base cuboid.
The number of cells in the computed cube ?
The total size (bytes) of the computed cube ?
Question 2: State the order for computing the chunks in the cube that requires the least amount of space, and compute the total amount of main memory space required for computing the 2-D planes.