Design of DW. Remember item, city, year, and sales_in_Euro (item)(city) () (year) (city, item)(city, year)(item, year) (city, item, year)

Design of DW

Remember

item, city, year, and sales_in_Euro

(item)(city)

()

(year)

(city, item) (city, year) (item, year)

(city, item, year)

2 leveles of hierarchies for each dimension Item(part,color) i1,i2 City(downtown,suburb) c1,c2

Year(good_year,bad_year) y1,y2

For a 3-dimensional data cube, where Li is the number of all levels (L1,2,3=2), the total number of cuboids that can be generated is

€

(2 +1) = 33 = 27i=1

3

∏

{(), (c1),(c2),(i1),(i2),(y1),(y2),

(c1,i1),(c1,i2),(c2,i1),(c2,i2), (c1,y1),(c1,y2),(c2,y1),(c2,y2), (i1,y1),(i1,y2),(i2,y1),(i2,y2),

(c1,i1,y1),(c1,i1,y2),(c1,i2,y1),(c1,i2,y2), (c2,i1,y1),(c2,i1,y2),(c2,i2,y1),(c2,i2,y2)}

DMQL Data Mining Query Language Relational database schema Translation into SQL query

Example, star schema, and relational data base MDX Multifeature cubes Design of a Data Warehouse Lifecycle models Data Warehouse models

DMQL DMQL: A Data Mining Query Language

for Relational Databases (Han et al, Simon Fraser University)

Data warehouses and data marts can be defined by cube definition and dimension definition

DMQL Create and manipulate data mining models

through a SQL-based interface (“Command-driven” data mining)

Abstract away the data mining particulars Data mining should be performed on data in the

database (should not need to export to a special-purpose environment)

Approaches differ on what kinds of models should be created, and what operations we should be able to perform

Cube Definition Syntax in DMQL

Cube Definition (Fact Table)define cube <cube_name> [<dimension_list>]:

<measure_list>

Dimension Definition (Dimension Table)define dimension <dimension_name> as

(<attribute_or_subdimension_list>)

Example of Star Schema time_key

dayday_of_the_weekmonthquarteryear

time

location_keystreetcitystate_or_provincecountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Defining Star Schema in DMQL

define cube sales_star [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, province_or_state, country)

The star schema contains two measures dollars_sold and units_sold

How are the DMQL commands interpreted to generate a specified data cube?

Relational database schematime(time_key,day_of_week,month,quater,year)

item(item_key,item_name,brand,type,supplier_type)

branch(branch_key,branch_name,branch_type)

location(location_key,street,city,province_or_state,country)

sales(time_key,item_key,branch_key,location_key,number_of_units_sold,price)

The DMQL specification is translated into the following SQL query which generates the base cuboid

SELECT s.time_key,s.item_key,s.branch_key,s.location_key, SUM(s.number_of_units_sold*s.price), SUM(s.number_of_units_sold)FROM time t, item i, branch b, location l, sales s,WHERE s.time_key=t.time_key AND s.item_key=i.item_key AND s.branch_key=b.branch_key AND

s.location_key=l.location_keyGROUP BY (s.time_key,s.item_key,s.branch_key,s.location_key);

The granularity (resolution) of each dimension is at the join key level

A join key is the key that links a fact table and the dimension table

The fact table associated with a base cuboid is sometimes referred as base fact table

By changing GROUP BY we may generate other cuboids

The apex cuboid representing the total sum of dollars_sold and total count of units_sold is generated by GROUP BY ();

Other cuboids may be generated by applying selection and projection operations on the base cuboid

To generate a data cube we may as well use GROUP BY CUBE

(s.time_key,s.item_key,s.branch_key,s.location_key);

Defining Snowflake Schemain DMQL

define cube sales_snowflake [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city(city_key, province_or_state, country))

Defining Fact Constellationin DMQL

define cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)define dimension item as (item_key, item_name, brand, type, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city, province_or_state, country)

define cube shipping [time, item, shipper, from_location, to_location]:dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as location in cube sales,

shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales

Example (Exercício 7) Suponha um datawarehouse que contém as seguintes

quatro dimensões: Data, Espectador, Localizaçaõ e Jogo e o facto “preço” que consiste no valor, em euros

Valor pago por um espectador quando assiste a um determinado jogo numa data

Espectadores podem ser estudantes, adultos ou séniores, e que cada uma destas categorias tem o seu preço de bilhete

Dimensão data contenha o dia, mês e ano; que a dimensão Localização contenha o nome do estádio, e que a dimensão Jogo contenha o nome das duas equipas defrontadas

Diagrama em estrelapara este DW

jogo(jogoId, equipa1, equipa2) data (dataId, dia, mes, ano) espectador (espId, nome, categoria) localizacao(localId, estadio) factos(jogoId, dataId, espId, localId, preco)

Escreva em SQL a interrogaçã o que devolve o preço total pago por espectadores estudantes para assistir ao jogo que se realizou no Estádio da Luz no dia 1 de Março de 2005

The corresponding SQL querry

SELECT SUM(preco)

FROM Factos F, Data D, Localizacao L, Espectador E

WHERE F.dataId = D.dataId

AND F.espId = E.espId

AND F.localId = L.localId

AND D.dia = 1

AND D.mes = 3

AND D.ano = 2005

AND L.estadio = ‘Estadio da Luz’

AND E.categoria= ‘Estudante’;

Esquema relacional que modele a mesma informação e uma interrogaçã o SQL que devolva a mesma informação:

jogo(jogoId, equipa1, equipa2, localId, data)localizacao(localId, estadio)espectador(espId, nome, categoriaId)categoria (categoriaId, nomeC, preco)jogoEspectador(jogoId, espId)

The corresponding SQL querry

SELECT SUM (C.preco)

FROM Categoria C, Espectador E, JogoEspectador JE, Jogo, J, Localizacao L

WHERE C.nomeC = ‘Estudante’

AND J.data = 1/3/2005

AND L.estadio = ‘Estadio da Luz’

AND C.categoriaId = E.categoriaId

AND E.espId = JE.espId

AND JE.jogoId = J.jogoId

AND J.localId = L.localId;

Difference between both approaches

Com o modelo em estrela, existe três joins, da tabela Factos com cada uma das 3 dimensões relevantes

Com o esquema relacional, existem 4 joins

In the star schema has less joins, corresponding to the relevant dimensions

In multidimensional model the base cuboid is already precomputed

MDX Multidimensional Expressions (MDX) as a

Language

MDX emerged circa 1998, when it first began to appear in commercial applications. MDX was created to query OLAP databases, and has become widely adopted within the realm of analytical applications

Provide the total sales and total cost amounts for the years 1997 and 1998 individually for all USA-based stores (including all products)

We are asked, moreover, to provide the information in a two-dimensional grid, with the sales and cost amounts (called measures in our data warehouse) in the rows and the years (1997 and 1998) in the columns

SELECT{[Time].[1997],[Time].[1998]} ON COLUMNS,{[Measures].[Warehouse Sales],[Measures].[Warehouse

Cost]} ON ROWSFROM WarehouseWHERE ([Store].[All Stores].[USA])

The cube that is targeted by the query (the query scope) appears in the FROM clause of the query

The FROM clause in MDX works much as it does in SQL, where it stipulates the tables used as sources for the query

The query syntax also uses other keywords that are common in SQL, such as SELECT and WHERE.

Important difference is that the output of an MDX query, which uses a cuboid as a data source, is another cuboid, whereas the output of an SQL query (which uses a columnar table as a source) is typically columnar

A query has one or more dimensions. The query above has two. (The first three dimensions (=axes) that are found in MDX queries are known as rows, columns and pages)

SELECT{[Time].[1997].[Q1],[Time].[1997].[Q2]}ON COLUMNS,{[Warehouse].[All Warehouses].[USA]} ON

ROWSFROM WarehouseWHERE ([Measures].[Warehouse Sales])

Curled brackets "{}" are used in MDX to represent a set of members of a dimension or group of dimensions

Complex Aggregation at Multiple Granularities Multifeature cubes compute complex queries

involving multiple aggregates at multiple granularities (resolution)

Example item is purchased in a sales region on a business

day (year,month,day) The shelf life in months of a given item is stored in

shelf The item price and sales is stored in price and sales

Find the total sales in 2000, broken down by item, region, and month with subtotal for each dimension

A data cube is constructed {(item,region,month),(item,region),(item,month),

(month,region),(item),(month),(region,()}

Simple data cube, since it does not involve any dependent aggregates

What are dependent aggregates?

Example

Grouping by all subsets (cuboids) {item,region,month} (=data cube)

Find maximum price for each group (cuboid) in 2000

Among the maximum price tuples find the minimum and maximum shelf lives

Multifeature cube graph for the example query

R0 cube

R1 cube {=MAX(price}

R2 cube {=MIN(R1.shelf)} R3 cube {=MAX(R1.shelf)}

The multifeature graph illustrates the aggregate dependencies

R0,R1,R2,R3 are the grouping variables The grouping variables R2,R3 are

dependent on R1 In extended SQL

R2 in R1 R3 in R1

Query in extended SQL

SELECT item,region,month,MAX(price),MIN(R1.shelf),MAX(R1.shelf)

FROM Purchases

WHERE year=2000

CUBE BY item,region,month:R1,R2,R3

SUCH THAT R1.price=MAX(price) AND

R2 IN R1 and R2.shelf=MIN(R1.shelf) AND

R3 IN R1 and R3.shelf=MAX(R1.shelf);

Design of Data Warehouse: A Business Analysis Framework

Four views regarding the design of a data warehouse Top-down view

• allows selection of the relevant information necessary for the data warehouse

Data source view• exposes the information being captured, stored, and managed by

operational systems

Data warehouse view• consists of fact tables and dimension tables

Business query view • sees the perspectives of data in the warehouse from the view of end-

user

Data Warehouse Design Process

Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid)

From software engineering point of view Waterfall: structured and systematic analysis at each step before

proceeding to the next Spiral: rapid generation of increasingly functional systems, short

turn around time, quick turn around

Lifecycle planning Translation from user requirements into

software requirements Transformation of the software requirements

into software design Implementation of the design into programming

code The sequence of this steps is defined by the

lifecyle model

A software lifecycle model must be defined for every project!

The lifecycle model you choose has as much influence over your project’s success as any other planning decision you make!

Pure Waterfall modelSoftware Concept

RequirementsAnalysis

ArchitecuralDesign

DetailedDesign

Coding andDebugging

System Testing

Pure Waterfal model Document driven model which means that the main

work products that are carried from phase to phase are documents

The disadvantage of the pure waterfall model arise from the difficulty of fully specifying requirements at the beginning of the project, before any design works has been done and before any code has been written

Salmon modelSoftware Concept

RequirementsAnalysis

ArchitecuralDesign

DetailedDesign

Coding andDebugging

System Testing

Code-and-Fix I

Code-And-FixSystem Specification

(maybe)Release

Code-and-Fix IIAdvantages

No overhead, you don’t spend any time on planning, documentation, quality assurance, enforcement, or other activities than pure coding

Since you jump right into coding, you can show signs of progress immediately

It requires little expertise

Code-and-Fix III

Maintainability and reliability decrease with the complexity and the time

For any kind of project other than a tiny project, this model is dangerous. It might have no overhead, but it also provides no means of assessing progress, you just code until you’re done

Spiral I

Risk analysis

Prototypes

Simulation models

StartStart

Determine objectives,Alternatives, andconstraints

Review

Evaluate

DevelopmentPlan

Reqirements

codetestRealse

I II IIIIV

Spiral II

The basic idea behind the diagram is that you start on a small scale in the middle of the spine, explore the risks, make a plan to handle the risks, and then commit to an approach of the next iteration. Each iteration moves your project to a larger scale

Spiral III

The spiral model is a risk-oriented lifecycle model that breaks a software project up into mini projects. Each mini project addresses one or more major risks until all the major risks have been addressed

Spiral IV Determine objectives, and constraints Identify and resolve risks Evaluate alternatives Develop the deliverables for that iteration, and verify

that they are correct Plan next iteration Commit to an approach for the next iteration

One of the most important advantages of the spiral model is that as costs increase, risk decrease. The more time and money you

spend, the less risk your’re taking

Data Warehouse: A Multi-Tiered ArchitectureData Warehouse: A Multi-Tiered Architecture

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

Othersources

Data Storage

OLAP Server

Three Data Warehouse Models

Enterprise warehouse collects all of the information about subjects spanning the entire

organization

Data Mart a subset of corporate-wide data that is of value to a specific groups

of users. Its scope is confined to specific, selected groups, such as marketing data mart

• Independent vs. dependent (directly from warehouse) data mart

Virtual warehouse A set of views over operational databases Only some of the possible summary views may be materialized

Data Warehouse Usage

Three kinds of data warehouse applications Information processing

• supports querying, basic statistical analysis, and reporting using

crosstabs, tables, charts and graphs

Analytical processing

• multidimensional analysis of data warehouse data

• supports basic OLAP operations, slice-dice, drilling, pivoting

Data mining

• knowledge discovery from hidden patterns

• supports associations, constructing analytical models, performing

classification and prediction, and presenting the mining results using

visualization tools

Data Warehouse Back-End Tools and Utilities

Data extraction get data from multiple, heterogeneous, and external sources

Data cleaning detect errors in the data and rectify them when possible

Data transformation convert data from legacy or host format to warehouse format

Load sort, summarize, consolidate, compute views, check integrity, and build

indicies and partitions Refresh

propagate the updates from the data sources to the warehouse

DMQL Data Mining Query Language Relational database schema Translation into SQL query

Example, star schema, and relational data base MDX Multifeature cubes Design of a Data Warehouse Lifecycle models Data Warehouse models

Data Cleaning

(De)normalization(?) Missing Values ...

Documents

Design of DW. Remember item, city, year, and sales_in_Euro (item)(city) () (year) (city, item)(city, year)(item, year) (city, item, year)