Upload
nina-cable
View
221
Download
2
Embed Size (px)
Citation preview
Design of DW
Remember
item, city, year, and sales_in_Euro
(item)(city)
()
(year)
(city, item) (city, year) (item, year)
(city, item, year)
2 leveles of hierarchies for each dimension Item(part,color) i1,i2 City(downtown,suburb) c1,c2
Year(good_year,bad_year) y1,y2
For a 3-dimensional data cube, where Li is the number of all levels (L1,2,3=2), the total number of cuboids that can be generated is
€
(2 +1) = 33 = 27i=1
3
∏
{(), (c1),(c2),(i1),(i2),(y1),(y2),
(c1,i1),(c1,i2),(c2,i1),(c2,i2), (c1,y1),(c1,y2),(c2,y1),(c2,y2), (i1,y1),(i1,y2),(i2,y1),(i2,y2),
(c1,i1,y1),(c1,i1,y2),(c1,i2,y1),(c1,i2,y2), (c2,i1,y1),(c2,i1,y2),(c2,i2,y1),(c2,i2,y2)}
DMQL Data Mining Query Language Relational database schema Translation into SQL query
Example, star schema, and relational data base MDX Multifeature cubes Design of a Data Warehouse Lifecycle models Data Warehouse models
DMQL DMQL: A Data Mining Query Language
for Relational Databases (Han et al, Simon Fraser University)
Data warehouses and data marts can be defined by cube definition and dimension definition
DMQL Create and manipulate data mining models
through a SQL-based interface (“Command-driven” data mining)
Abstract away the data mining particulars Data mining should be performed on data in the
database (should not need to export to a special-purpose environment)
Approaches differ on what kinds of models should be created, and what operations we should be able to perform
Cube Definition Syntax in DMQL
Cube Definition (Fact Table)define cube <cube_name> [<dimension_list>]:
<measure_list>
Dimension Definition (Dimension Table)define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
Example of Star Schema time_key
dayday_of_the_weekmonthquarteryear
time
location_keystreetcitystate_or_provincecountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Defining Star Schema in DMQL
define cube sales_star [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
The star schema contains two measures dollars_sold and units_sold
How are the DMQL commands interpreted to generate a specified data cube?
Relational database schematime(time_key,day_of_week,month,quater,year)
item(item_key,item_name,brand,type,supplier_type)
branch(branch_key,branch_name,branch_type)
location(location_key,street,city,province_or_state,country)
sales(time_key,item_key,branch_key,location_key,number_of_units_sold,price)
The DMQL specification is translated into the following SQL query which generates the base cuboid
SELECT s.time_key,s.item_key,s.branch_key,s.location_key, SUM(s.number_of_units_sold*s.price), SUM(s.number_of_units_sold)FROM time t, item i, branch b, location l, sales s,WHERE s.time_key=t.time_key AND s.item_key=i.item_key AND s.branch_key=b.branch_key AND
s.location_key=l.location_keyGROUP BY (s.time_key,s.item_key,s.branch_key,s.location_key);
The granularity (resolution) of each dimension is at the join key level
A join key is the key that links a fact table and the dimension table
The fact table associated with a base cuboid is sometimes referred as base fact table
By changing GROUP BY we may generate other cuboids
The apex cuboid representing the total sum of dollars_sold and total count of units_sold is generated by GROUP BY ();
Other cuboids may be generated by applying selection and projection operations on the base cuboid
To generate a data cube we may as well use GROUP BY CUBE
(s.time_key,s.item_key,s.branch_key,s.location_key);
Defining Snowflake Schemain DMQL
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key, province_or_state, country))
Defining Fact Constellationin DMQL
define cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)define dimension item as (item_key, item_name, brand, type, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city, province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as location in cube sales,
shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales
Example (Exercício 7) Suponha um datawarehouse que contém as seguintes
quatro dimensões: Data, Espectador, Localizaçaõ e Jogo e o facto “preço” que consiste no valor, em euros
Valor pago por um espectador quando assiste a um determinado jogo numa data
Espectadores podem ser estudantes, adultos ou séniores, e que cada uma destas categorias tem o seu preço de bilhete
Dimensão data contenha o dia, mês e ano; que a dimensão Localização contenha o nome do estádio, e que a dimensão Jogo contenha o nome das duas equipas defrontadas
Diagrama em estrelapara este DW
jogo(jogoId, equipa1, equipa2) data (dataId, dia, mes, ano) espectador (espId, nome, categoria) localizacao(localId, estadio) factos(jogoId, dataId, espId, localId, preco)
Escreva em SQL a interrogaçã o que devolve o preço total pago por espectadores estudantes para assistir ao jogo que se realizou no Estádio da Luz no dia 1 de Março de 2005
The corresponding SQL querry
SELECT SUM(preco)
FROM Factos F, Data D, Localizacao L, Espectador E
WHERE F.dataId = D.dataId
AND F.espId = E.espId
AND F.localId = L.localId
AND D.dia = 1
AND D.mes = 3
AND D.ano = 2005
AND L.estadio = ‘Estadio da Luz’
AND E.categoria= ‘Estudante’;
Esquema relacional que modele a mesma informação e uma interrogaçã o SQL que devolva a mesma informação:
jogo(jogoId, equipa1, equipa2, localId, data)localizacao(localId, estadio)espectador(espId, nome, categoriaId)categoria (categoriaId, nomeC, preco)jogoEspectador(jogoId, espId)
The corresponding SQL querry
SELECT SUM (C.preco)
FROM Categoria C, Espectador E, JogoEspectador JE, Jogo, J, Localizacao L
WHERE C.nomeC = ‘Estudante’
AND J.data = 1/3/2005
AND L.estadio = ‘Estadio da Luz’
AND C.categoriaId = E.categoriaId
AND E.espId = JE.espId
AND JE.jogoId = J.jogoId
AND J.localId = L.localId;
Difference between both approaches
Com o modelo em estrela, existe três joins, da tabela Factos com cada uma das 3 dimensões relevantes
Com o esquema relacional, existem 4 joins
In the star schema has less joins, corresponding to the relevant dimensions
In multidimensional model the base cuboid is already precomputed
MDX Multidimensional Expressions (MDX) as a
Language
MDX emerged circa 1998, when it first began to appear in commercial applications. MDX was created to query OLAP databases, and has become widely adopted within the realm of analytical applications
Provide the total sales and total cost amounts for the years 1997 and 1998 individually for all USA-based stores (including all products)
We are asked, moreover, to provide the information in a two-dimensional grid, with the sales and cost amounts (called measures in our data warehouse) in the rows and the years (1997 and 1998) in the columns
SELECT{[Time].[1997],[Time].[1998]} ON COLUMNS,{[Measures].[Warehouse Sales],[Measures].[Warehouse
Cost]} ON ROWSFROM WarehouseWHERE ([Store].[All Stores].[USA])
The cube that is targeted by the query (the query scope) appears in the FROM clause of the query
The FROM clause in MDX works much as it does in SQL, where it stipulates the tables used as sources for the query
The query syntax also uses other keywords that are common in SQL, such as SELECT and WHERE.
Important difference is that the output of an MDX query, which uses a cuboid as a data source, is another cuboid, whereas the output of an SQL query (which uses a columnar table as a source) is typically columnar
A query has one or more dimensions. The query above has two. (The first three dimensions (=axes) that are found in MDX queries are known as rows, columns and pages)
SELECT{[Time].[1997].[Q1],[Time].[1997].[Q2]}ON COLUMNS,{[Warehouse].[All Warehouses].[USA]} ON
ROWSFROM WarehouseWHERE ([Measures].[Warehouse Sales])
Curled brackets "{}" are used in MDX to represent a set of members of a dimension or group of dimensions
Complex Aggregation at Multiple Granularities Multifeature cubes compute complex queries
involving multiple aggregates at multiple granularities (resolution)
Example item is purchased in a sales region on a business
day (year,month,day) The shelf life in months of a given item is stored in
shelf The item price and sales is stored in price and sales
Find the total sales in 2000, broken down by item, region, and month with subtotal for each dimension
A data cube is constructed {(item,region,month),(item,region),(item,month),
(month,region),(item),(month),(region,()}
Simple data cube, since it does not involve any dependent aggregates
What are dependent aggregates?
Example
Grouping by all subsets (cuboids) {item,region,month} (=data cube)
Find maximum price for each group (cuboid) in 2000
Among the maximum price tuples find the minimum and maximum shelf lives
Multifeature cube graph for the example query
R0 cube
R1 cube {=MAX(price}
R2 cube {=MIN(R1.shelf)} R3 cube {=MAX(R1.shelf)}
The multifeature graph illustrates the aggregate dependencies
R0,R1,R2,R3 are the grouping variables The grouping variables R2,R3 are
dependent on R1 In extended SQL
R2 in R1 R3 in R1
Query in extended SQL
SELECT item,region,month,MAX(price),MIN(R1.shelf),MAX(R1.shelf)
FROM Purchases
WHERE year=2000
CUBE BY item,region,month:R1,R2,R3
SUCH THAT R1.price=MAX(price) AND
R2 IN R1 and R2.shelf=MIN(R1.shelf) AND
R3 IN R1 and R3.shelf=MAX(R1.shelf);
Design of Data Warehouse: A Business Analysis Framework
Four views regarding the design of a data warehouse Top-down view
• allows selection of the relevant information necessary for the data warehouse
Data source view• exposes the information being captured, stored, and managed by
operational systems
Data warehouse view• consists of fact tables and dimension tables
Business query view • sees the perspectives of data in the warehouse from the view of end-
user
Data Warehouse Design Process
Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view Waterfall: structured and systematic analysis at each step before
proceeding to the next Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
Lifecycle planning Translation from user requirements into
software requirements Transformation of the software requirements
into software design Implementation of the design into programming
code The sequence of this steps is defined by the
lifecyle model
A software lifecycle model must be defined for every project!
The lifecycle model you choose has as much influence over your project’s success as any other planning decision you make!
Pure Waterfall modelSoftware Concept
RequirementsAnalysis
ArchitecuralDesign
DetailedDesign
Coding andDebugging
System Testing
Pure Waterfal model Document driven model which means that the main
work products that are carried from phase to phase are documents
The disadvantage of the pure waterfall model arise from the difficulty of fully specifying requirements at the beginning of the project, before any design works has been done and before any code has been written
Salmon modelSoftware Concept
RequirementsAnalysis
ArchitecuralDesign
DetailedDesign
Coding andDebugging
System Testing
Code-and-Fix I
Code-And-FixSystem Specification
(maybe)Release
Code-and-Fix IIAdvantages
No overhead, you don’t spend any time on planning, documentation, quality assurance, enforcement, or other activities than pure coding
Since you jump right into coding, you can show signs of progress immediately
It requires little expertise
Code-and-Fix III
Maintainability and reliability decrease with the complexity and the time
For any kind of project other than a tiny project, this model is dangerous. It might have no overhead, but it also provides no means of assessing progress, you just code until you’re done
Spiral I
Risk analysis
Prototypes
Simulation models
StartStart
Determine objectives,Alternatives, andconstraints
Review
Evaluate
DevelopmentPlan
Reqirements
codetestRealse
I II IIIIV
Spiral II
The basic idea behind the diagram is that you start on a small scale in the middle of the spine, explore the risks, make a plan to handle the risks, and then commit to an approach of the next iteration. Each iteration moves your project to a larger scale
Spiral III
The spiral model is a risk-oriented lifecycle model that breaks a software project up into mini projects. Each mini project addresses one or more major risks until all the major risks have been addressed
Spiral IV Determine objectives, and constraints Identify and resolve risks Evaluate alternatives Develop the deliverables for that iteration, and verify
that they are correct Plan next iteration Commit to an approach for the next iteration
One of the most important advantages of the spiral model is that as costs increase, risk decrease. The more time and money you
spend, the less risk your’re taking
Data Warehouse: A Multi-Tiered ArchitectureData Warehouse: A Multi-Tiered Architecture
DataWarehouse
ExtractTransformLoadRefresh
OLAP Engine
AnalysisQueryReportsData mining
Monitor&
IntegratorMetadata
Data Sources Front-End Tools
Serve
Data Marts
Operational DBs
Othersources
Data Storage
OLAP Server
Three Data Warehouse Models
Enterprise warehouse collects all of the information about subjects spanning the entire
organization
Data Mart a subset of corporate-wide data that is of value to a specific groups
of users. Its scope is confined to specific, selected groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse A set of views over operational databases Only some of the possible summary views may be materialized
Data Warehouse Usage
Three kinds of data warehouse applications Information processing
• supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs
Analytical processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results using
visualization tools
Data Warehouse Back-End Tools and Utilities
Data extraction get data from multiple, heterogeneous, and external sources
Data cleaning detect errors in the data and rectify them when possible
Data transformation convert data from legacy or host format to warehouse format
Load sort, summarize, consolidate, compute views, check integrity, and build
indicies and partitions Refresh
propagate the updates from the data sources to the warehouse
DMQL Data Mining Query Language Relational database schema Translation into SQL query
Example, star schema, and relational data base MDX Multifeature cubes Design of a Data Warehouse Lifecycle models Data Warehouse models
Data Cleaning
(De)normalization(?) Missing Values ...