39
Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: [email protected] Data Warehouse and Data Mining Lecture No. 08-15 OLAP and Multi-Dimensional Data

Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Naeem Ahmed

Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

Email: [email protected]

Data Warehouse and Data Mining Lecture No. 08-15

OLAP and Multi-Dimensional Data

Page 2: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

On-Line Analytical Processing •  A decision support system (DSS) that support ad-

hoc querying, i.e. enables managers and analysts to interactively manipulate data.

•  Analysis of information in a database for the purpose of making management decision

•  The idea is to allow the users to easy and quickly manipulate and visualize the data through multidimensional views (i.e. different perspectives)

•  OLAP analyzes historical data (terabytes) using complex queries

Page 3: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

On-Line Analytical Processing •  OLAP Council definition:

–  A category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user

•  OLAP is implemented in a multi-user client/server mode and offers consistently rapid response to queries, regardless of database size and complexity.

Page 4: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

On-Line Analytical Processing •  OLAP primarily involves aggregating large

amounts of diverse data •  OLAP functionality provides dynamic multi-

dimensional analysis, supporting analytical and navigational activities

•  OLAP functionality is provided by the OLAP Server •  OLAP Council defines OLAP Server as:

–  ‘A high capacity, multi-user data manipulation engine specifically designed to support and operate on multi-dimensional data structures.’

Page 5: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Data Dimensionality

Page 6: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Data Dimensionality: Cube

Date

Cou

ntry

sum

sum TV

VCR PC

1Qtr 2Qtr 3Qtr 4Qtr

Pakistan

China

India

sum

Total annual sales of TV in Pakistan

1st Qtr Sales of TV in Pakistan

Total annual sales of TV, PC & VCR in India

Cube: A group of data cells arranged by the dimensions of the data.

Page 7: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Data Dimensionality Possible Views of Sale •  How many Products sold at

Time to specific Customer(s)?

•  How many Customers bought at specific Time the Product(s)?

•  At which Time(s) the Customer(s) bought the specific Product(s)?

Products

Time

Customers

Sale

Page 8: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Multi-dimensional Data •  Measures - numerical data being tracked •  Dimensions - business parameters that define a

transaction •  Example: Analyst may want to view sales data

(measure) by geography, by time, and by product (dimensions)

•  Dimensional modeling is a technique for structuring data around the business concepts

•  ER models describe “entities” and “relationships” •  Dimensional models describe “measures” and

“dimensions”

Page 9: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Multi-dimensional Model “Sales by product line over the past six months” “Sales by store between 1990 and 1995”

Prod Code Time Code Store Code Sales Qty

Store Info

Product Info

Time Info . . .

Numerical Measures Key columns joining fact table

to dimension tables

Fact table for measures

Dimension tables

Page 10: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Multi-dimensional Model •  Every dimensional model (DM) is composed of one

table with a composite primary key, called the fact table, and a set of smaller tables called dimension tables

•  Forms ‘star-like’ structure, which is called a star schema or star join

•  Dimensions are organized into hierarchies –  E.g., Time dimension: days → weeks → quarters –  E.g., Product dimension: product → product line → brand

•  Dimensions have attributes –  e.g., owner city and county of store

Page 11: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Dimension Hierarchies Store Dimension Product Dimension

District

Region

Total

Brand

Manufacturer

Total

Stores Products

Page 12: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Operations in Multidimensional Data Model

•  Aggregation (roll-up) –  dimension reduction: e.g., total sales by city –  summarization over aggregate hierarchy: e.g., total sales by

city and year total sales by region and by year •  Selection (slice) defines a sub-cube

–  e.g., sales where city = Palo Alto and date = 20/1/2014

•  Navigation to detailed data (drill-down) –  e.g., (sales - expense) by city, top 3% of cities by average

income •  Visualization Operations (e.g., Pivot)

Page 13: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

A Visual Operation: Pivot

10

47

30 12

Juice

Cola

Milk Cream

3/1 3/2 3/3 3/4 Date

Reg

ion

Product

A pivot is a two dimensional lay-out of the summary data

The x and y axis are the dimensions and the intersection cells for any two dimension values contain the value of the measures

Page 14: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Drill-Down and Roll-Up

Page 15: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Multi-dimensionality: Cube

Page 16: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Multi-dimensionality

Page 17: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

On-Line Analytical Processing •  OLAP Tools are Market Driven. That is, no

standards either academic or from an organization exist

•  A common model approach is to use Star or Snowflake Database Schemata (common in Data Warehouse Modeling)

•  End users look for the following, independent tool architecture or vendor, characteristics:

Page 18: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

On-Line Analytical Processing •  Interactivity – How easy the end user interacts with

the tool? •  Customization – How easy the end user make

changes on the data representation provided by the tool?

•  Security – How easy the end user can access unauthorized data?

•  Visualization – How easy the tool provide multi-dimensional graphical representations?

Page 19: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

OLAP Servers •  Two possibilities for OLAP servers

–  Relational OLAP (ROLAP) •  Relational and specialized relational DBMS to store and manage

warehouse data •  OLAP middleware to support missing pieces

–  Multidimensional OLAP (MOLAP) •  Array-based storage structures •  Direct access to array data structures •  No SQL (Structured Query Language)

–  Special Language provided by vender (e.g. Multidimensional Expressions (MDX) of Microsoft)

Page 20: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

OLAP Taxonomy •  Multi-dimensional OLAP (MOLAP)

–  ‘A k-dimensional matrix based on a non relational storage structure.’ Agrawal et al.

•  Relational OLAP (ROLAP) –  ‘A relational back-end wherein operations of the data are

translated to relational queries.’ Agrawal et al. •  Hybrid OLAP (HOLAP)

–  Integration of MOLAP and ROLAP •  Desktop OLAP (DOLAP)

–  Provides a specific cube for analysis. Simplified version of MOLAP or ROLAP

Page 21: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Multi-dimensional OLAP •  Multi-dimensional data management in Multi-

Dimensional Database Management Systems (MDDBMS)

•  A special-purpose server that directly implements multidimensional data and operations

•  Advantages: Fast data access, many dimensions, performance

•  Further Research on storage techniques and realization of transactional concepts

Page 22: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

MOLAP: Dimensional Modeling Using the Multi Dimensional Model

•  MDDB: a special-purpose data model, MOLAP = “Cubes”

•  Facts stored in multi-dimensional arrays •  The Database system builds most of the

aggregates within a non-relational data store •  Dimensions used to index array •  Sometimes on top of relational DB •  Products: Pilot, Arbor Essbase, Gentia •  Limitations: Memory

Page 23: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Relational OLAP •  A multi-dimensional user view on relational data

storage using Star or Snowflake Database Schemata

Product Dimension

Time Dimension

Region Dimension

Customer Dimension

Product Dimension

Year Dimension

Country Dimension

Customer Dimension

Sales

Customer Characteristics

Product Kind

Region

Month

Snowflake Schema

Sales

Star Schema

Page 24: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Relational OLAP •  An extended relational DBMS that maps operations on

multidimensional data to standard relational operators (i.e., iterators like joins, loops, nested joins etc)

•  Fact tables are too big to query directly, It incorporates Aggregate tables –  Aggregate tables are built by running summarizing queries joining

the fact table with one or more dimensions and saving the result set –  Users don’t need to specify the aggregate table, vendors provide

automatic support of aggregate tables in data warehouse

•  Advantages: Easy to understand, easy to model, easy to implement

Page 25: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

ROLAP: Dimensional Modeling Using Relational DBMS

•  Special schema design: star, snowflake •  Special indexes: bitmap, multi-table join •  Special tuning: maximize query throughput •  Proven technology (relational model, DBMS), tend

to outperform specialized MDDB especially on large data sets

•  Products –  IBM DB2, Oracle, Sybase IQ, RedBrick, Informix

•  Limitations: Maintenance, Performance

Page 26: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Hybrid OLAP •  ‘A system, which supports (and integrates) multi-

dimensional and relational storage for data in an equivalent manner in order to benefit from the corresponding characteristics and optimization techniques.’ Dinter et al.

•  Advantages: use of best techniques introduced on MOLAP and ROLAP, transparency between MOLAP and ROLAP systems

•  Further Research on storage systems, on global multi-dimensional schema, on common interface and mutual integration of MOLAP and ROLAP

Page 27: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Desktop OLAP (DOLAP) •  All processing work is done in the desktop

–  E.g, bring data into Excel and build a pivot table

•  DOLAP can be inexpensive, easy and fast to setup on small data sets only (thousands of rows)

Page 28: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

OLTP versus OLAP OLTP OLAP

Operational processing Informational processing Transaction-oriented Analysis-oriented For operational staffs For managers, executive & analysts Daily operations Decision support Current, up-to-date data Historical data Primitive, highly detailed data Summarized, consolidated data Detailed, flat relational views Summarized, multi-dimensional views Short, simple transactions Complex aggregate queries Read/write Mostly read only Index on keys Many scans Many users Small number of users Large databases Very large databases

Page 29: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

OLTP versus OLAP •  On-Line Transaction Processing

–  Transfer $100 balance from my saving account to my checking account

•  On-Line Analytical Processing

–  What is the average balance of accounts by customer groups, account types, areas, account managers, and their combinations?

Page 30: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Aggregate •  A whole formed or calculated by the combination

of many separate units or items – Total •  Operators: sum, count, max, min, median, avg

–  Example: Add up amounts by day –  Example in SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

ans date sum1 812 48

sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4

Page 31: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Aggregate •  Add up amounts by day, product •  SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId!

sale prodId date amtp1 1 62p2 1 19p1 2 48

sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4

Roll-up

Drill-down

Page 32: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

MOLAP Cube

sale prodId storeId amtp1 s1 12p2 s1 11p1 s3 50p2 s2 8

s1 s2 s3p1 12 50p2 11 8

Fact table view Multi-dimensional cube

dimensions = 2

dimensions = 3

sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4

day 2 s1 s2 s3p1 44 4p2 s1 s2 s3

p1 12 50p2 11 8

day 1

Page 33: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Example: Cube Pr

oduc

t

Time

M T W Th F S S

Juice Milk Coke Cream Soap Bread

NY SF

LA 10 34 56 32 12 56

56 units of bread sold in LA on M

Dimensions: Time, Product, Store

Attributes: Product (upc, price, …) Store … …

Hierarchies: Product → Brand → … Day → Week → Quarter Store → Region →

Country

roll-up to week

roll-up to brand

roll-up to region

Page 34: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Cube Aggregation

day 2 s1 s2 s3p1 44 4p2 s1 s2 s3

p1 12 50p2 11 8

day 1

s1 s2 s3p1 56 4 50p2 11 8

s1 s2 s3sum 67 12 50

sump1 110p2 19

129

. . .

Example: computing sums

Roll-up

Drill-down

Page 35: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Aggregation Using Hierarchies

region A region Bp1 56 54p2 11 8

store

region

country

(store s1 in Region A; stores s2, s3 in Region B)

day 2 s1 s2 s3p1 44 4p2 s1 s2 s3

p1 12 50p2 11 8

day 1

Page 36: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Slicing •  Slicing means taking out the slice of a cube, given

certain set of select dimension –  e.g., sales where city =‘Karachi’ and date = ‘20/1/2014’

day 2 s1 s2 s3p1 44 4p2 s1 s2 s3

p1 12 50p2 11 8

day 1

s1 s2 s3p1 12 50p2 11 8

TIME = day 1

Page 37: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Dicing •  Dicing means viewing the slices from different

angles. –  Example -Revenue for different products within a given

state or revenue for different states for a given product •  Dicing is more zoom feature that selects a subset

over all the dimensions but for specific values of the dimension

•  One form of Slicing and Dicing is called pivoting

Page 38: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures

Dicing

Page 39: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures