44
11/16/15 1 Introduction to Data Warehouses Helena Galhardas DEI/IST References A. Vaisman and E. Zimányi, Data Warehouse Systems: Design and Implementation, Springer, 2014 (chpts. 1 and 3) J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001 (chpt. 2) A. Doan, A. Halevy, Z. Ives, Data Integration Principles, Morgan Kaufmann, 2012 (chpt. 10) C. Ciferri, R. Ciferri, L.I. Gómez, M. Schneider, A.A. Vaisman, E. Zimányi, Cube algebra: a generic user-centric model and query language for OLAP cubes. Int. J. Data Warehousing Mining 9(2), 39–65, 2013 A. Wichert, H. Galhardas, SAD slides, MEIC/IST 2

Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

Embed Size (px)

Citation preview

Page 1: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

1  

Introduction to Data Warehouses

Helena Galhardas DEI/IST

References

•  A. Vaisman and E. Zimányi, Data Warehouse Systems: Design and Implementation, Springer, 2014 (chpts. 1 and 3)

•  J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001 (chpt. 2)

•  A. Doan, A. Halevy, Z. Ives, Data Integration Principles, Morgan Kaufmann, 2012 (chpt. 10)

•  C. Ciferri, R. Ciferri, L.I. Gómez, M. Schneider, A.A. Vaisman, E. Zimányi, Cube algebra: a generic user-centric model and query language for OLAP cubes. Int. J. Data Warehousing Mining 9(2), 39–65, 2013

•  A. Wichert, H. Galhardas, SAD slides, MEIC/IST

2  

Page 2: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

2  

Outline

•  Introduction – Motivation for data warehousing – Definition of data warehouse – New domains and challenges

•  The multidimensional model •  Typical data warehouse architecture •  OLAP operations

3  

Introduction

•  Organizations face increasingly complex challenges to achieve operational goals so need analysis tools for decision support

•  Business intelligence (BI): Methodologies, processes, architectures, and technologies to transform raw data into useful information for decision making –  Collect and summarize vast amounts of data

•  Extraction, transformation, integration, and cleansing processes take data from sources, and store them in a common repository called: –  Data warehouse (DW): integral part of decision-support

systems. Provides an infrastructure that enables users to get efficient, accurate responses to complex queries

4  

Page 3: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

3  

Exploiting data in a DW •  A wide variety of systems and tools to exploit the data in a

warehouse •  Online Analytical Processing (OLAP)

–  Allows users to interactively query and aggregate data in a warehouse –  Decision makers can analyze information at various levels of detail

•  Data mining extracts interesting knowledge hidden in data warehouses

•  Typical techniques that exploit a data warehouse: –  Reporting: dashboards, alerts –  Performance management: metrics, key performance indicators

(KPIs), dashboards –  Analytics: OLAP, data mining, time series analysis, text mining, web

analytics, data visualization

5  

Motivation

•  Traditional operational or transactional databases do not satisfy the requirements for data analysis –  Designed/optimized to support daily business operations;

primary concern: concurrent access and recovery techniques to guarantee data consistency

–  Contain detailed data, do not include historical data, and perform poorly for complex queries that involve many tables or aggregate large volumes of data

•  To analyze the behavior of an organization, data from several operational systems must be integrated –  Difficult to accomplish due to many differences in data definition

and content

•  Data warehouse: proposed as a solution to the growing demands of decision-making users 6  

Page 4: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

4  

Definition of Data Warehouse (DW)

•  Collection of subject-oriented, integrated, nonvolatile, and time-varying data to support management decisions (Immon definition)

7  

Data Warehouse: Subject-Oriented

•  Organized around major subjects, such as customer, product, sales

•  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing

•  Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process

Page 5: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

5  

Data Warehouse: Integrated

•  Constructed by integrating multiple, heterogeneous data sources –  relational databases, flat files, on-line transaction records

•  Data cleaning and data integration techniques are applied –  Ensure consistency in naming conventions, encoding structures,

attribute measures, etc. among different data sources •  E.g., Hotel price: currency, tax, breakfast covered, etc.

–  When data is moved to the warehouse, it is converted.

Data Warehouse: Nonvolatile

•  A physically separate store of data transformed from the operational environment

•  Operational update of data does not occur in the data warehouse environment –  Does not require transaction processing, recovery,

and concurrency control mechanisms

–  Requires only three operations in data accessing:

•  initial loading of data, access of data, and periodic data refreshment

Page 6: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

6  

Data Warehouse: Time Variant

•  The time horizon for the data warehouse is significantly longer than that of operational systems –  Operational database: current value data

–  Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

•  Every key structure in the data warehouse –  Contains an element of time, explicitly or implicitly

–  But the key of operational data may or may not contain “time element”

Normalized vs non-normalized data

•  Relational databases: highly normalized to guarantee consistency under frequent updates –  Usually achieved at a higher cost of querying (normalization

partitions data into multiple tables) –  This is not appropriate for data warehouses

•  Data warehouses must deliver good performance for the complex queries needed for analysis tasks –  Less degree of normalization required => multidimensional

modeling

12  

Page 7: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

7  

Multidimensional modeling •  Views data as consisting of facts linked to dimensions •  Facts represent the focus of analysis (e.g., analysis of sales in

stores) –  Measures quantified facts; usually numeric values, e.g., amount or

number of sales •  Dimensions used to analyze measures from several perspectives

–  E.g., Time dimension to analyze changes in sales over various periods of time

–  E.g., Location dimension to analyze sales according to the geographic distribution of stores

•  Dimensions include attributes that form hierarchies which enable decision-making users to explore measures at various levels of detail, e.g.: –  month quarter year in the time dimension –  city state country in the location dimension

•  Aggregation of measures occurs when a hierarchy is traversed, e.g., moving from month to year yields aggregated values of sales for the various years

13  

Star and snowflakes schemas

•  At the logical level, the multidimensional model is usually represented by relational tables organized in: –  Star schemas use a unique table for each dimension, even in the

presence of hierarchies (yields denormalized dimension tables) –  Snowflake schemas use normalized tables for dimensions and

their hierarchies

•  Over this relational representation of a data warehouse, an OLAP server builds a data cube, which provides a multidimensional view of the data

14  

Page 8: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

8  

Example of Star Schema time_key day day_of_the_week month quarter year

time

location_key street city state_or_province country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales Measures

item_key item_name brand type supplier_type

item

branch_key branch_name branch_type

branch

Example of Snowflake Schema time_key day day_of_the_week month quarter year

time

location_key street city_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key item_name brand type supplier_key

item

branch_key branch_name branch_type

branch

supplier_key supplier_type

supplier

city_key city state_or_province country

city

Page 9: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

9  

•  Query Language –  Once a data warehouse has been implemented, analytical

queries can be submitted –  MDX (MultiDimensional eXpressions): de facto standard

language for querying a multidimensional database •  Physical level: concerned with implementation issues •  Three techniques are normally used for improving

system performance: –  Materialized views –  Indexing –  Data partitioning

17  

ETL Process

•  Extracts data from several source systems, transforms data to fit the data warehouse model, and loads transformed data into the data warehouse

•  Crucial for the success of a data warehousing project – About 80% of the total cost – Still no consensus on a methodology for ETL

design, and most problems are solved ad hoc – Several proposals regarding ETL conceptual

design

18  

Page 10: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

10  

Data exploitation

•  Data analytics is the process of exploiting the contents of a data warehouse in order to provide essential information to the decision-making process

•  Three main tools: –  Data mining: a series of techniques that analyze the data

in a warehouse in order to discover hidden useful knowledge

–  Key performance indicators (KPIs) are measurable organizational objectives used for monitoring how an organization is performing

–  Dashboards are interactive reports that present the data in a warehouse, including the KPIs, in a visual way, providing an overview of the performance of an organization for decision-support purposes

19  

New domains and challenges

•  Enormous amounts of data (big data) is calling for a shift in data warehouse and BI practices

•  Many emerging domains where BI practices are gaining acceptance, such as social networks or geospatial data analytics

•  New database architectures are gaining momentum: –  Parallelism: a must for large data warehouses –  Column-store databases: MonetDB and Vertica –  In-memory databases: SAP HANA –  The MapReduce programming model increasingly popular,

challenges traditional parallel DBMSs (e.g., the Facebook data warehouse was built using Hadoop - an open source implementation of MapReduce)

20  

Page 11: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

11  

New domains and challenges (cont.) •  The web is changing the way in which data warehouses are being

designed, used, and exploited –  For some data analysis tasks (like worldwide price evolution of some

product), data in a conventional data warehouse may not suffice –  External data sources (e.g., the web) can provide useful

multidimensional information, although usually too volatile to be permanently stored

•  The semantic web aims at representing web content in a machine-processable way –  The basic layer of the data representation for the semantic web is the

Resource Description Framework (RDF) –  Domain ontologies (in RDF or OWL) define a common terminology for

the concepts involved in a particular domain –  Semantic annotations are especially useful for describing unstructured,

semi-structured, and textual data •  Large repositories of semantically annotated data currently

available, new opportunities for enhancing current decision-support system (SemanticWeb DataWarehouses)

21  

Outline

•  Introduction – Motivation for data warehousing – Definition of data warehouse – New domains and challenges

 The multidimensional model •  Typical data warehouse architecture •  OLAP operations

22  

Page 12: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

12  

What is OLAP?

•  The term OLAP („online analytical processing“) was coined in a white paper written for Arbor Software Corp. in 1993

–  Interactive process of creating, managing, analyzing and reporting on data

– Analyzing large quantities of data in real-time

OLAP vs OLTP •  Traditional database systems designed and tuned to support the

day-to-day operation: –  Ensure fast, concurrent access to data, transaction processing and

concurrency control –  Focus on online update data consistency –  Known as operational databases or online transaction processing

(OLTP) •  OLTP DB data characteristics:

–  Detailed data –  Do not include historical data –  Highly normalized –  Poor performance on complex queries including joins an aggregation

•  Data analysis requires a new paradigm: online analytical processing (OLAP) –  Typical OLTP query: pending orders for customer c –  Typical OLAP query: total sales amount by product and by customer

24  

Page 13: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

13  

OLAP characteristics •  OLTP paradigm focused on transactions, OLAP focused on

analytical queries –  Normalization not good for analytical queries, reconstructing

data requires a high number of joins •  OLAP databases support a heavy query load •  OLTP indexing techniques not efficient in OLAP: oriented to

access few records –  OLAP queries typically include aggregation

•  The need for a different database model to support OLAP was clear; led to –  Data warehouse: (usually) large repository that consolidate data

from different sources, is updated online, follows the multidimensional data model, designed and optimized to efficiently support OLAP queries

25  

OLAP

•  Data is perceived and manipulated as it was stored in a multi-dimensional array

•  But ideas are explained in terms of conventional relational tables

Page 14: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

14  

Data Grouping and Aggregation

•  Data grouping and aggregation in many different ways"

•  The number of possible groupings quickly becomes large"– The user has to consider all groupings"– Analytical processing problem"

Example: OLAP-style Queries for Supplier-and-Parts Database

1)  Get the total shipment quantity 2)  Get total shipment quantities by supplier 3)  Get total shipment quantities by part 4)  Get the shipment by supplier and part

Page 15: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

15  

Supplier-Parts

•  SP"

S#" P#" QTY"S1" P1" 300"S1" P2" 200"S2" P1" 300"S2" P2" 400"S3" P2" 200"S4" P2" 200"

Get the total shipment quantity

1. SELECT SUM(QTY) AS TOTQTY" FROM SP" GROUP BY ()"

TOTQTY"1600"

Page 16: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

16  

Get total shipment quantities by supplier

2. SELECT S#,"" " SUM(QTY) AS TOTQTY"

FROM SP"" GROUP BY S#"

S#" TOTQTY"S1" 500"S2" 700"S3" 200"S4" 200"

Get total shipment quantities by part

3. SELECT P#," SUM(QTY) AS TOTQTY" FROM SP" GROUP BY P#"

P#" TOTQTY"P1" 600"P2" 1000"

Page 17: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

17  

4. SELECT S#, P#," SUM(QTY) AS TOTQTY" FROM SP" GROUP BY S#,P#"

S#" P#" TOTQTY"S1" P1" 300"S1" P2" 200"S2" P1" 300"S2" P2" 400"S3" P2" 200"S4" P2" 200"

Get the shipment by supplier and part

Drawbacks

•  Formulation so many similar but distinct queries is tedious

•  Executing the queries is expensive •  Make life easier

– more efficient computation •  Single query

– GROUPING SETS, ROLLUP, CUBE options – Added to SQL standard 1999

Page 18: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

18  

GROUPING SETS

•  Execute several queries simultaneously

SELECT S#, P#, SUM (QTY) AS TOTQTY FROM SP GROUP BY GROUPING SETS ( (S#), (P#) ) ;

Single results table Not a relation !! null missing information

S#" P#" TOTQTY"S1" null" 500"S2" null" 700"S3" null" 200"S4" null" 200"null" P1" 600"null" P2" 1000"

SELECT CASE GROUPING ( S# ) WHEN 1 THEN ‘??‘ ELSE S# AS S#, CASE GROUPING ( P# ) WHEN 1 THEN ‘!!‘ ELSE P# AS P#, SUM ( QTY ) AS TOTQTY FROM SP GROUP BY GROUPING SETS ( ( S# ), ( P# ) );

S#" P#" TOTQTY"S1" !!" 500"S2" !!" 700"S3" !!" 200"S4" !!" 200"??" P1" 600"??" P2" 1000"

Page 19: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

19  

ROLLUP operation

SELECT S#,P#, SUM ( QTY ) AS TOTQTY FROM SP GROUP BY ROLLUP (S#, P#) ;

GROUP BY GROUPING SETS ( ( S#, P# ), ( S# ) , ( ) )

S#" P#" TOTQTY"S1" P1" 300"S1" P2" 200"S2" P1" 300"S2" P2" 400"S3" P2" 200"S4" P2" 200"S1" null" 500"S2" null" 700"S3" null" 200"S4" null" 200"null" null" 1600"

ROLLUP definition

•  The quantities have been rolled up for each supplier

•  Rolled up along supplier dimension

GROUP BY ROLLUP (A,B,...,Z)

(A,B,...,Z) (A,B,...) (A,B) (A) ()

GROUP BY ROLLUP (A,B) is not symmetric in A and B !

Page 20: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

20  

CUBE operation

SELECT S#, P#, SUM ( QTY ) AS TOTQTY FROM SP GROUP BY CUBE (S#, P#);

GROUP BY GROUPING SETS ( (S#, P#), ( S# ), ( P# ), ( ) )

S#" P#" TOTQTY"S1" P1" 300"S1" P2" 200"S2" P1" 300"S2" P2" 400"S3" P2" 200"S4" P2" 200"S1" null" 500"S2" null" 700"S3" null" 200"S4" null" 200"null" P1" 600"null" P1" 1000"null" null" 1600"

CUBE

•  Confusing term CUBE (?) –  Derived from the fact that in multidimensional

terminology, data values are stored in cells of a multidimensional array or a hypercube

•  The actual physical storage may differ –  In our example

•  Cube has just two dimensions (supplier, part) •  The two dimensions are unequal (no square rectangle..)

•  Means group by all possible subsets of the set {A, B, ..., Z }

Page 21: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

21  

CUBE

•  Means group by all possible subsets of the set {A, B, ..., Z } –  M={A, B, ..., Z } |M|=n

–  Power Set (Algebra) –  P(M):={N | N⊆M}, |P(M)|=2n

..proof by induction

•  Subset represent different grade of summarization

•  In Data Mining, such a subset is called a Cuboid

Multidimensional model •  Views data in an n-dimensional space: data cube

–  composed of dimensions and facts •  Dimensions: perspectives used to analyze the data

–  Example: A 3-dimensional cube for sales data with dimensions Product, Time, and Customer, and a measure Quantity

•  Attributes describe dimensions –  Product dimension may have attributes ProductNumber and UnitPrice (not shown)

•  Cells or facts have associated numeric values called measures –  Each cell of the data cube represents Quantity of units sold by category, quarter, and

customer’s city 42  

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

measure values

dimensions

Page 22: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

22  

Characteristics of a data cube

•  Data granularity: level of detail at which measures are represented for each dimension of the cube –  Example: sales figures aggregated to granularities

Category, Quarter, and City •  Instances of a dimension are called members

–  Ex: Seafood and Beverages are members of the Product at the granularity Category

•  A data cube may contain several measures –  e.g. amount, indicating the total sales amount (not shown)

•  A data cube may be sparse (typical case) or dense –  Ex: not all customers may have ordered products of all

categories during all quarters

43  

Hierarchies

•  Allow viewing data at several granularities –  Define a sequence of mappings relating lower-level, detailed concepts to

higher-level ones –  The lower level is called the child and the higher level is called the parent –  The hierarchical structure of a dimension is called the dimension schema –  A dimension instance comprises all members at all levels in a dimension

•  Example –  Hierarchies of the Product, –  Time, and Customer dimensions

44  

All

Category

Product

ProductAll

Year

Semester

Quarter

Month

Day

TimeAll

Continent

Country

State

City

Customer

Customer

Page 23: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

23  

Members of hierarchy

•  Members of the hierarchy Product Category

45  

all

Beverages

Chai Chang

Seafood

Ikura Konbu

...

... ...Product

Category

All

Classification of measures

•  Each measure is associated to an aggregation function that combines several measure values into a single one –  Aggregation of measures takes place when we change the level

of detail at which data in a cube is visualized

•  Measures can be classified according to the way they can be aggregated: –  Additive: can be meaningfully summarized along all the

dimensions, using addition (most common type) –  Semiadditive: can be meaningfully summarized using addition

along some dimensions (example: inventory quantities, which cannot be added along the Time dimension)

–  Nonadditive measures cannot be meaningfully summarized using addition across any dimension (Ex: item price, cost per unit, and exchange rate)

46  

Page 24: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

24  

Another Classification of Measures

•  Another classification of measures: –  Distributive: defined by an aggregation function that can be

computed in a distributed way; functions count, sum, minimum, and maximum are distributive, distinct count is not (ex: S = {3; 3; 4; 5; 8; 4; 7; 3; 8} partitioned in subsets {3; 3; 4}, {5; 8; 4}, {7; 3; 8} gives a result of 8, while the answer over the original set is 5)

–  Algebraic: defined by an aggregation function that can be expressed as a scalar function of distributive ones; example: average, computed by dividing the sum by the count

–  Holistic: cannot be computed from other subaggregates (e.g., median, rank)

•  Most large data cube applications require efficient computation of distributive and algebraic measures –  It is difficult to efficiently compute holistic measures

47  

More about measures •  When defining a measure we must determine the associated

aggregation functions –  For example, a semiadditive measure representing inventory

quantities can be aggregated using average along the Time dimension, and using addition along other dimensions

•  Summarizability refers to the correct aggregation of cube measures along dimension hierarchies

•  Summarizability conditions: –  Disjointness of instances: the grouping of instances in a level

with respect to their parent in the next level must result in disjoint subsets

–  Completeness: all instances are included in the hierarchy and each instance is related to one parent in the next level

–  Correctness: refers to the correct use of the aggregation functions

48  

Page 25: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

25  

Outline

•  Introduction – Motivation for data warehousing – Definition of data warehouse – New domains and challenges

•  The multidimensional model  Typical data warehouse architecture •  OLAP operations

49  

Data Warehouse Architecture

50  

Operationaldatabases

External sources

Internal sources

OLAP tools

Reporting tools

Data mining tools

Data marts

Back-endtier

OLAP tier

Front-end tier

Data sources

Data warehousetier

Statistical tools

Data staging Metadata

ETLprocess

Enterprisedata

warehouseOLAP server

Page 26: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

26  

Components of the DW architecture •  Back-end tier:

–  Extraction, Transformation, and Loading (ETL) process: feeds data into the data warehouse from operational databases and other data sources

–  Data Staging Area (DSA): intermediate database where all the data integration and transformation processes are run prior to the loading of the data into the data warehouse

•  Data warehouse tier: –  Enterprise data warehouse and/or several data marts –  Metadata repository storing information about the data warehouse and its

contents •  OLAP tier composed of:

–  OLAP server which provides a multidimensional view of the data, regardless the actual way in which data are stored

•  Front-end tier is used for data analysis and visualization –  Contains client tools such as OLAP tools, reporting tools, statistical tools,

and data-mining tools

51  

Data Warehouse Architecture

52  

Operationaldatabases

External sources

Internal sources

OLAP tools

Reporting tools

Data mining tools

Data marts

Back-endtier

OLAP tier

Front-end tier

Data sources

Data warehousetier

Statistical tools

Data staging Metadata

ETLprocess

Enterprisedata

warehouseOLAP server

Page 27: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

27  

Back-End Tier •  It is a 3-step process: Extraction, Transformation, and Loading 1.  Extraction gathers data from multiple, heterogeneous data sources

internal or external to the organization 2.  Transformation modifies the data from the format of the data sources to

the warehouse format; this includes: –  Cleaning: Removes errors and inconsistencies in the data and converts it into a

standardized format –  Integration: Reconciles data from dierent data sources, both at the schema and at

the data level –  Aggregation: Summarizes the data obtained from data sources according

granularity of the data warehouse 3.  Loading feeds the data warehouse with the transformed data, including

refreshing the data warehouse, that is, propagating updates from the data sources to the data warehouse at a specified frequency

•  Data staging area (also called operational data store): A database where data extracted from the sources undergoes successive modifications before being loaded into the data warehouse

53  

Data Warehouse Architecture

54  

Operationaldatabases

External sources

Internal sources

OLAP tools

Reporting tools

Data mining tools

Data marts

Back-endtier

OLAP tier

Front-end tier

Data sources

Data warehousetier

Statistical tools

Data staging Metadata

ETLprocess

Enterprisedata

warehouseOLAP server

Page 28: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

28  

DW Tier •  Enterprise data warehouse, centralized and encompassing an entire

organization •  Several data marts: specialized departmental data warehouses •  Metadata

–  Business metadata describes the semantics of the data, organizational rules, policies, and constraints related to the data

–  Technical metadata describes how data are structured and stored in a computer system, and the applications and processes that manipulate the data

•  Metadata repository may contain information such as: –  Metadata describing the structure of the data warehouse and the data marts, at the

conceptual/logical level (facts, dimensions, hierarchies, ...) and at the physical level (indexes, partitions,...)

–  Security information (user authorization and access control), and monitoring information (usage statistics, error reports, audit trails)

–  Metadata describing data sources: schemas, ownership, update frequencies, legal limitations, access methods

–  Metadata describing the ETL: data lineage, data extraction, cleaning, transformation rules, etc.

55  

Data Warehouse Architecture

56  

Operationaldatabases

External sources

Internal sources

OLAP tools

Reporting tools

Data mining tools

Data marts

Back-endtier

OLAP tier

Front-end tier

Data sources

Data warehousetier

Statistical tools

Data staging Metadata

ETLprocess

Enterprisedata

warehouseOLAP server

Page 29: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

29  

OLAP Tier

•  OLAP server, which presents business users with multidimensional data from data warehouses or data marts –  Products include OLAP extensions and tools allowing building,

querying, and navigating cubes, analysis, and reporting

•  Not yet a standardized language for defining and manipulating data cubes –  MDX (MultiDimensional eXpressions): query language for OLAP

databases, a de facto standard for querying OLAP systems –  SQL extended for providing analytical capabilities: SQL/OLAP

57  

OLAP Server Architectures •  Relational OLAP (ROLAP)

–  Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware

–  Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services

–  Greater scalability

•  Multidimensional OLAP (MOLAP) –  Sparse array-based multidimensional storage engine –  Fast indexing to pre-computed summarized data

•  Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer) –  Flexibility, e.g., low level: relational, high-level: array

•  Specialized SQL servers (e.g., Redbricks) –  Specialized support for SQL queries over star/snowflake schemas

Page 30: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

30  

Data Warehouse Architecture

59  

Operationaldatabases

External sources

Internal sources

OLAP tools

Reporting tools

Data mining tools

Data marts

Back-endtier

OLAP tier

Front-end tier

Data sources

Data warehousetier

Statistical tools

Data staging Metadata

ETLprocess

Enterprisedata

warehouseOLAP server

Front-End Tier

•  Client tools that allow users to exploit the content of the data warehouse –  OLAP tools: allow interactive exploration and manipulation

of the warehouse data and formulation of complex ad hoc queries

–  Reporting tools enable the production, delivery, and management of reports, which can be paper-based, interactive, or web-based

•  Reports use predefined queries queries asking for specific information in a specific format, performed on a regular basis

–  Statistical tools: used to analyze and visualize the cube data using statistical methods

–  Data mining tools allow users to analyze data in order to discover valuable knowledge such as patterns and trends, and also allow to make predictions based on current data

60  

Page 31: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

31  

Variations of the architecture 1.  Only an enterprise data warehouse without data marts or,

alternatively, an enterprise data warehouse does not exist 2.  An OLAP server does not exist and/or the client tools directly

access the data warehouse 3.  Neither a data warehouse nor an OLAP server - virtual data

warehouse (virtual data integration) - which defines a set of views over operational databases that are materialized for efficient access

–  does not contain historical data, centralized metadata, etc. 4.  Data staging area may not be needed when the data in the

source systems conforms very closely to the data in the warehouse

61  

Outline

•  Introduction •  Motivation for data warehousing •  New domains and challenges •  Definition of data warehouse •  The multidimensional model •  Typical data warehouse architecture  OLAP operations

62  

Page 32: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

32  

What is OLAP?

•  The term OLAP (Online Analytical Processing“) was coined in a white paper written for Arbor Software Corp. in 1993

–  Interactive process of creating, managing, analyzing and reporting on data

– Analyzing large quantities of data in real-time

Data Grouping and Aggregation

•  Data grouping and aggregation in many different ways"

•  The number of possible groupings quickly becomes large"– The user has to consider all groupings"– Analytical processing problem"

Page 33: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

33  

Multidimensional model •  Views data in an n-dimensional space: data cube

–  composed of dimensions and facts •  Dimensions: perspectives used to analyze the data

–  Example: A 3-dimensional cube for sales data with dimensions Product, Time, and Customer, and a measure Quantity

•  Attributes describe dimensions –  Product dimension may have attributes ProductNumber and UnitPrice (not shown)

•  Cells or facts have associated numeric values called measures –  Each cell of the data cube represents Quantity of units sold by category, quarter, and

customer’s city 65  

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduceCus

tomer

(City

)

measure values

dimensions

Hierarchies

•  Allow viewing data at several granularities –  Define a sequence of mappings relating lower-level, detailed concepts to

higher-level ones –  The lower level is called the child and the higher level is called the parent –  The hierarchical structure of a dimension is called the dimension schema –  A dimension instance comprises all members at all levels in a dimension

•  Example –  Hierarchies of: –  Product –  Time –  Customer dimensions

66  

All

Category

Product

ProductAll

Year

Semester

Quarter

Month

Day

TimeAll

Continent

Country

State

City

Customer

Customer

Page 34: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

34  

Classification of measures

•  Each measure is associated to an aggregation function that combines several measure values into a single one –  Aggregation of measures takes place when we change the level

of detail at which data in a cube is visualized

•  Measures can be classified according to the way they can be aggregated: –  Additive: can be meaningfully summarized along all the

dimensions, using addition (most common type) –  Semiadditive: can be meaningfully summarized using addition

along some dimensions (example: inventory quantities, which cannot be added along the Time dimension)

–  Nonadditive measures cannot be meaningfully summarized using addition across any dimension (Ex: item price, cost per unit, and exchange rate)

67  

OLAP Operations: definition

•  Allows the user to view data from different perspectives and at several levels of detail by exploiting dimensions and their hierarchies

•  Provide an interactive data analysis environment

68  

Page 35: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

35  

OLAP Operations (1)

69  

Q4

FranceGermany

Product (Category)

Tim

e (Q

uarte

r)

Beverages

Q3

Q2

Q1

Condiments

SeafoodProduce

Cust

omer

(Cou

ntry

)

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

...

ParisLyon

Köln

Product (Category)

Tim

e (M

onth

)

Beverages

Mar

Feb

Jan

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

Dec

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments SeafoodProduce

Custo

mer(C

ity)

70  

Q4

Köln

Berlin

Paris

Produ

ct

(Cate

gory

)

Time (Quarter)

Beverages

Q3Q2Q1

Lyon

CondimentsSeafood

Produce

Cus

tom

er (C

ity)

Q4

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Condiments

SeafoodProduce

ParisLyon

Product (Category)

Tim

e(Q

uart

er)

Beverages

Q2

Q1

Condiments

SeafoodProduce

Custo

mer(C

ity)

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

OLAP Operations (2)

Page 36: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

36  

OLAP Operations (3)

71  

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

Q4

ParisLyon

Köln

Product (Category)Ti

me

(Qua

rter

)Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduceCus

tomer

(City

)

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

OLAP Operations (4)

72  

SUM BY Time, Customer

84

72

93

84

Q4

Customer (City)

Tim

e (Q

uart

er)

Paris

96

Q3

Q2

Q1

Berlin

Lyon

89 106

79

8865105

82 77

61112 102

Köln

max() by quarter and city

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

•  According to the authors of the book, agg. functions can be classified as: –  cumulative: compute the measure value of

a cell from several other cells(e.g., SUM; COUNT, AVG)

–  filtering: filter the members of a dimension that appears in the result (MIN, MAX); must compute not only the aggregated value but also detemine the dimension members that belong to the result

Page 37: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

37  

OLAP Operations (5)

73  

...

ParisLyon

Köln

Product (Category)

Tim

e (M

onth

)

Beverages

Mar

Feb

Jan

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

Dec

...

ParisLyon

Köln

Product (Category)

Tim

e (M

onth

)

Beverages

Mar

Feb

Jan

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

Dec

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity) Bilbao

Madrid

Algebra of OLAP Operations

•  There is not yet a standard definition of OLAP operations in a similar way to the relational algebra

•  Many proposals of OLAP algebra in the literature •  We adopt the one proposed in [Ciferri et al 2013]

74  

Page 38: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

38  

Algebra of OLAP Operations - rollup

•  Roll-up: aggregates measures along a dimension hierarchy (using an aggregate function) to obtain measures at a coarser granularity ROLLUP(CubeName, (Dimension Level)*, AggFunction(Measure)*)

75  

Q4

FranceGermany

Product (Category)

Tim

e (Q

uarte

r)

Beverages

Q3

Q2

Q1

Condiments

SeafoodProduce

Cust

omer

(Cou

ntry

)

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

ROLLUP(Sales2012,  Customer    Country,  SUM(Quan;ty))  

Algebra of OLAP Operations – drill-down

•  Drill-down moves from a more general level to a more detailed level in a hierarchy –  DRILLDOWN(CubeName, (Dimension Level)*)

76  

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

...

ParisLyon

Köln

Product (Category)

Tim

e (M

onth

)

Beverages

Mar

Feb

Jan

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

Dec

DRILLDOWN(Sales2012,  Time      Month)  

Page 39: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

39  

Algebra of OLAP Operations – sort

•  Sort returns a cube where the members of a dimension have been sorted –  SORT(CubeName, Dimension, Expression [ASC | DESC]) –  where the members of Dimension are sorted according to the value of

Expression

77  

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments SeafoodProduce

Custo

mer(C

ity)

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

SORT(Sales2012,  Product,  Category)  

•  Pivot (or rotate): rotates the axes of a cube to provide an alternative presentation of data –  PIVOT(CubeName, (Dimension Axis)*) –  where the axes are specified as {X; Y; Z; X1; Y1; Z1; : : :}.

78  

Algebra of OLAP Operations – pivot

Q4

Köln

Berlin

Paris

Produ

ct

(Cate

gory

)

Time (Quarter)

Beverages

Q3Q2Q1

Lyon

CondimentsSeafood

Produce

Cus

tom

er (C

ity)

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

PIVOT(Sales,  Time      X,  Customer      Y,  Product      Z)  

Page 40: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

40  

•  Slice: removes a dimension in a cube so a cube of n-1 dimensions is obtained from a cube of n dimensions

–  SLICE(CubeName, Dimension, Level = Value)

•  Dimension will be dropped by fixing a single Value in the Level; other dimensions unchanged

•  Slice supposes that the granularity of the cube is at the specified level of the dimension

79  

Algebra of OLAP Operations – slice

Q4

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Condiments

SeafoodProduceQ4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

SLICE(Sales,  Customer,  City  =  ’Paris’)  

•  Dice: keeps the cells of a cube that satisfy a Boolean condition Φ –  DICE(CubeName, Φ)

•  Φ is a Boolean condition over dimension levels, attributes, and measures.

80  

Algebra of OLAP Operations – dice

ParisLyon

Product (Category)

Tim

e(Q

uart

er)

Beverages

Q2

Q1

Condiments

SeafoodProduce

Custo

mer(C

ity)

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

DICE(Sales,  (Customer.City  =  ’Paris’  OR  Customer.City  =  ’Lyon’)  AND  (Time.Quarter  =  ’Q1’  OR  Time.Quarter  =  ’Q2’))  

Page 41: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

41  

•  Drill-across: combines cells from two data cubes that have the same schema –  DRILLACROSS(CubeName1, CubeName2, [Condition])

81  

Algebra of OLAP Operations – drill-across

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

Sales2011-­‐2012      DRILLACROSS(Sales2011,  Sales2012)  

•  Add Measure: adds new measures to a cube –  ADDMEASURE(CubeName, (NewMeasure = Expression, [AggFct])* )

•  Drop measure: Deletes a measure from a cube schema –  DROPMEASURE(CubeName, Measure*)

82  

Algebra of OLAP Operations – ADD-MEASURE

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

ADDMEASURE(Sales2011-­‐2012,  PercChange  =  (Quan;ty2011-­‐Quan;ty2012)/Quan;ty2011)  

Page 42: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

42  

•  Another ex: –  Computes the value of a cell by aggregating the measures of several

nearby cells

83  

Algebra of OLAP Operations – ADD-MEASURE

...

ParisLyon

Köln

Product (Category)

Tim

e (M

onth

)

Beverages

Mar

Feb

Jan

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

Dec

...

ParisLyon

Köln

Product (Category)

Tim

e (M

onth

)

Beverages

Mar

Feb

Jan

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

Dec

ADDMEASURE(Sales,  MovAvg  =  AVG(Quan;ty)  OVER  Time  2  CELLS  PRECEDING)  

•  Aggregation functions in OLAP are also needed at the current granularity, that is without performing roll-up.

–  AggFunction(CubeName, Measure) [BY Dimension*] –  Cumulative: compute the measure value of a cell from several other cells; examples are

SUM, COUNT, and AVG –  Filtering: Filters the members of a dimension that appear in the result; examples are

MIN and MAX. Filtering functions compute not only the aggregated value, but also the members of the dimension that belong to the result

84  

Algebra of OLAP Operations – aggregate functions

SUM BY Time, Customer

84

72

93

84

Q4

Customer (City)

Tim

e (Q

uart

er)

Paris

96

Q3

Q2

Q1

Berlin

Lyon

89 106

79

8865105

82 77

61112 102

KölnQ4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

SUM(Sales,  Quan;ty)  BY  Time,  Customer  

Page 43: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

43  

•  Another example: max sales by quarter and city

85  

Algebra of OLAP Operations – aggregate functions

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uarte

r)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

MAX(Sales,  Quan;ty)  BY  Time,  Customer  

•  Union merges two cubes having the same schema but disjoint instances •  Ex: If CubeSpain is a cube having the same schema as the original cube but containing only

the sales to Spanish customers, we can perform: •  Difference removes the cells in a cube that belong to another one; the two cubes must have

the same schema •  Drill-through allows to move from data at the bottom level in a cube to data in the

operational systems from which the cube was derived; Could be used when trying to determine the reason for outlier values in a data cube

86  

Algebra of OLAP Operations – union, difference, drill-through

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity)

Q4

ParisLyon

Köln

Product (Category)

Tim

e (Q

uart

er)

Beverages

Q3

Q2

Q1

Berlin

Condiments

SeafoodProduce

Custo

mer(C

ity) Bilbao

Madrid

UNION(Sales,  SalesSpain)  

Page 44: Introduction to Data Warehouses - Técnico Lisboa ... · Introduction to Data Warehouses ... Design and Implementation, ... • OLTP indexing techniques not efficient in OLAP:

11/16/15  

44  

Next Lecture

•  Conceptual Data Warehouse Design

87  

•  Slice (city = lisbon or city= porto) é um slice ou um dice, assumindo que parto de um cubo com três dimensões

•  City = lisbon and quarter =Q1

88