Upload
trinhmien
View
219
Download
1
Embed Size (px)
Citation preview
11/16/15
1
Introduction to Data Warehouses
Helena Galhardas DEI/IST
References
• A. Vaisman and E. Zimányi, Data Warehouse Systems: Design and Implementation, Springer, 2014 (chpts. 1 and 3)
• J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001 (chpt. 2)
• A. Doan, A. Halevy, Z. Ives, Data Integration Principles, Morgan Kaufmann, 2012 (chpt. 10)
• C. Ciferri, R. Ciferri, L.I. Gómez, M. Schneider, A.A. Vaisman, E. Zimányi, Cube algebra: a generic user-centric model and query language for OLAP cubes. Int. J. Data Warehousing Mining 9(2), 39–65, 2013
• A. Wichert, H. Galhardas, SAD slides, MEIC/IST
2
11/16/15
2
Outline
• Introduction – Motivation for data warehousing – Definition of data warehouse – New domains and challenges
• The multidimensional model • Typical data warehouse architecture • OLAP operations
3
Introduction
• Organizations face increasingly complex challenges to achieve operational goals so need analysis tools for decision support
• Business intelligence (BI): Methodologies, processes, architectures, and technologies to transform raw data into useful information for decision making – Collect and summarize vast amounts of data
• Extraction, transformation, integration, and cleansing processes take data from sources, and store them in a common repository called: – Data warehouse (DW): integral part of decision-support
systems. Provides an infrastructure that enables users to get efficient, accurate responses to complex queries
4
11/16/15
3
Exploiting data in a DW • A wide variety of systems and tools to exploit the data in a
warehouse • Online Analytical Processing (OLAP)
– Allows users to interactively query and aggregate data in a warehouse – Decision makers can analyze information at various levels of detail
• Data mining extracts interesting knowledge hidden in data warehouses
• Typical techniques that exploit a data warehouse: – Reporting: dashboards, alerts – Performance management: metrics, key performance indicators
(KPIs), dashboards – Analytics: OLAP, data mining, time series analysis, text mining, web
analytics, data visualization
5
Motivation
• Traditional operational or transactional databases do not satisfy the requirements for data analysis – Designed/optimized to support daily business operations;
primary concern: concurrent access and recovery techniques to guarantee data consistency
– Contain detailed data, do not include historical data, and perform poorly for complex queries that involve many tables or aggregate large volumes of data
• To analyze the behavior of an organization, data from several operational systems must be integrated – Difficult to accomplish due to many differences in data definition
and content
• Data warehouse: proposed as a solution to the growing demands of decision-making users 6
11/16/15
4
Definition of Data Warehouse (DW)
• Collection of subject-oriented, integrated, nonvolatile, and time-varying data to support management decisions (Immon definition)
7
Data Warehouse: Subject-Oriented
• Organized around major subjects, such as customer, product, sales
• Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing
• Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
11/16/15
5
Data Warehouse: Integrated
• Constructed by integrating multiple, heterogeneous data sources – relational databases, flat files, on-line transaction records
• Data cleaning and data integration techniques are applied – Ensure consistency in naming conventions, encoding structures,
attribute measures, etc. among different data sources • E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is converted.
Data Warehouse: Nonvolatile
• A physically separate store of data transformed from the operational environment
• Operational update of data does not occur in the data warehouse environment – Does not require transaction processing, recovery,
and concurrency control mechanisms
– Requires only three operations in data accessing:
• initial loading of data, access of data, and periodic data refreshment
11/16/15
6
Data Warehouse: Time Variant
• The time horizon for the data warehouse is significantly longer than that of operational systems – Operational database: current value data
– Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse – Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain “time element”
Normalized vs non-normalized data
• Relational databases: highly normalized to guarantee consistency under frequent updates – Usually achieved at a higher cost of querying (normalization
partitions data into multiple tables) – This is not appropriate for data warehouses
• Data warehouses must deliver good performance for the complex queries needed for analysis tasks – Less degree of normalization required => multidimensional
modeling
12
11/16/15
7
Multidimensional modeling • Views data as consisting of facts linked to dimensions • Facts represent the focus of analysis (e.g., analysis of sales in
stores) – Measures quantified facts; usually numeric values, e.g., amount or
number of sales • Dimensions used to analyze measures from several perspectives
– E.g., Time dimension to analyze changes in sales over various periods of time
– E.g., Location dimension to analyze sales according to the geographic distribution of stores
• Dimensions include attributes that form hierarchies which enable decision-making users to explore measures at various levels of detail, e.g.: – month quarter year in the time dimension – city state country in the location dimension
• Aggregation of measures occurs when a hierarchy is traversed, e.g., moving from month to year yields aggregated values of sales for the various years
13
Star and snowflakes schemas
• At the logical level, the multidimensional model is usually represented by relational tables organized in: – Star schemas use a unique table for each dimension, even in the
presence of hierarchies (yields denormalized dimension tables) – Snowflake schemas use normalized tables for dimensions and
their hierarchies
• Over this relational representation of a data warehouse, an OLAP server builds a data cube, which provides a multidimensional view of the data
14
11/16/15
8
Example of Star Schema time_key day day_of_the_week month quarter year
time
location_key street city state_or_province country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales Measures
item_key item_name brand type supplier_type
item
branch_key branch_name branch_type
branch
Example of Snowflake Schema time_key day day_of_the_week month quarter year
time
location_key street city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key item_name brand type supplier_key
item
branch_key branch_name branch_type
branch
supplier_key supplier_type
supplier
city_key city state_or_province country
city
11/16/15
9
• Query Language – Once a data warehouse has been implemented, analytical
queries can be submitted – MDX (MultiDimensional eXpressions): de facto standard
language for querying a multidimensional database • Physical level: concerned with implementation issues • Three techniques are normally used for improving
system performance: – Materialized views – Indexing – Data partitioning
17
ETL Process
• Extracts data from several source systems, transforms data to fit the data warehouse model, and loads transformed data into the data warehouse
• Crucial for the success of a data warehousing project – About 80% of the total cost – Still no consensus on a methodology for ETL
design, and most problems are solved ad hoc – Several proposals regarding ETL conceptual
design
18
11/16/15
10
Data exploitation
• Data analytics is the process of exploiting the contents of a data warehouse in order to provide essential information to the decision-making process
• Three main tools: – Data mining: a series of techniques that analyze the data
in a warehouse in order to discover hidden useful knowledge
– Key performance indicators (KPIs) are measurable organizational objectives used for monitoring how an organization is performing
– Dashboards are interactive reports that present the data in a warehouse, including the KPIs, in a visual way, providing an overview of the performance of an organization for decision-support purposes
19
New domains and challenges
• Enormous amounts of data (big data) is calling for a shift in data warehouse and BI practices
• Many emerging domains where BI practices are gaining acceptance, such as social networks or geospatial data analytics
• New database architectures are gaining momentum: – Parallelism: a must for large data warehouses – Column-store databases: MonetDB and Vertica – In-memory databases: SAP HANA – The MapReduce programming model increasingly popular,
challenges traditional parallel DBMSs (e.g., the Facebook data warehouse was built using Hadoop - an open source implementation of MapReduce)
20
11/16/15
11
New domains and challenges (cont.) • The web is changing the way in which data warehouses are being
designed, used, and exploited – For some data analysis tasks (like worldwide price evolution of some
product), data in a conventional data warehouse may not suffice – External data sources (e.g., the web) can provide useful
multidimensional information, although usually too volatile to be permanently stored
• The semantic web aims at representing web content in a machine-processable way – The basic layer of the data representation for the semantic web is the
Resource Description Framework (RDF) – Domain ontologies (in RDF or OWL) define a common terminology for
the concepts involved in a particular domain – Semantic annotations are especially useful for describing unstructured,
semi-structured, and textual data • Large repositories of semantically annotated data currently
available, new opportunities for enhancing current decision-support system (SemanticWeb DataWarehouses)
21
Outline
• Introduction – Motivation for data warehousing – Definition of data warehouse – New domains and challenges
The multidimensional model • Typical data warehouse architecture • OLAP operations
22
11/16/15
12
What is OLAP?
• The term OLAP („online analytical processing“) was coined in a white paper written for Arbor Software Corp. in 1993
– Interactive process of creating, managing, analyzing and reporting on data
– Analyzing large quantities of data in real-time
OLAP vs OLTP • Traditional database systems designed and tuned to support the
day-to-day operation: – Ensure fast, concurrent access to data, transaction processing and
concurrency control – Focus on online update data consistency – Known as operational databases or online transaction processing
(OLTP) • OLTP DB data characteristics:
– Detailed data – Do not include historical data – Highly normalized – Poor performance on complex queries including joins an aggregation
• Data analysis requires a new paradigm: online analytical processing (OLAP) – Typical OLTP query: pending orders for customer c – Typical OLAP query: total sales amount by product and by customer
24
11/16/15
13
OLAP characteristics • OLTP paradigm focused on transactions, OLAP focused on
analytical queries – Normalization not good for analytical queries, reconstructing
data requires a high number of joins • OLAP databases support a heavy query load • OLTP indexing techniques not efficient in OLAP: oriented to
access few records – OLAP queries typically include aggregation
• The need for a different database model to support OLAP was clear; led to – Data warehouse: (usually) large repository that consolidate data
from different sources, is updated online, follows the multidimensional data model, designed and optimized to efficiently support OLAP queries
25
OLAP
• Data is perceived and manipulated as it was stored in a multi-dimensional array
• But ideas are explained in terms of conventional relational tables
11/16/15
14
Data Grouping and Aggregation
• Data grouping and aggregation in many different ways"
• The number of possible groupings quickly becomes large"– The user has to consider all groupings"– Analytical processing problem"
Example: OLAP-style Queries for Supplier-and-Parts Database
1) Get the total shipment quantity 2) Get total shipment quantities by supplier 3) Get total shipment quantities by part 4) Get the shipment by supplier and part
11/16/15
15
Supplier-Parts
• SP"
S#" P#" QTY"S1" P1" 300"S1" P2" 200"S2" P1" 300"S2" P2" 400"S3" P2" 200"S4" P2" 200"
Get the total shipment quantity
1. SELECT SUM(QTY) AS TOTQTY" FROM SP" GROUP BY ()"
TOTQTY"1600"
11/16/15
16
Get total shipment quantities by supplier
2. SELECT S#,"" " SUM(QTY) AS TOTQTY"
FROM SP"" GROUP BY S#"
S#" TOTQTY"S1" 500"S2" 700"S3" 200"S4" 200"
Get total shipment quantities by part
3. SELECT P#," SUM(QTY) AS TOTQTY" FROM SP" GROUP BY P#"
P#" TOTQTY"P1" 600"P2" 1000"
11/16/15
17
4. SELECT S#, P#," SUM(QTY) AS TOTQTY" FROM SP" GROUP BY S#,P#"
S#" P#" TOTQTY"S1" P1" 300"S1" P2" 200"S2" P1" 300"S2" P2" 400"S3" P2" 200"S4" P2" 200"
Get the shipment by supplier and part
Drawbacks
• Formulation so many similar but distinct queries is tedious
• Executing the queries is expensive • Make life easier
– more efficient computation • Single query
– GROUPING SETS, ROLLUP, CUBE options – Added to SQL standard 1999
11/16/15
18
GROUPING SETS
• Execute several queries simultaneously
SELECT S#, P#, SUM (QTY) AS TOTQTY FROM SP GROUP BY GROUPING SETS ( (S#), (P#) ) ;
Single results table Not a relation !! null missing information
S#" P#" TOTQTY"S1" null" 500"S2" null" 700"S3" null" 200"S4" null" 200"null" P1" 600"null" P2" 1000"
SELECT CASE GROUPING ( S# ) WHEN 1 THEN ‘??‘ ELSE S# AS S#, CASE GROUPING ( P# ) WHEN 1 THEN ‘!!‘ ELSE P# AS P#, SUM ( QTY ) AS TOTQTY FROM SP GROUP BY GROUPING SETS ( ( S# ), ( P# ) );
S#" P#" TOTQTY"S1" !!" 500"S2" !!" 700"S3" !!" 200"S4" !!" 200"??" P1" 600"??" P2" 1000"
11/16/15
19
ROLLUP operation
SELECT S#,P#, SUM ( QTY ) AS TOTQTY FROM SP GROUP BY ROLLUP (S#, P#) ;
GROUP BY GROUPING SETS ( ( S#, P# ), ( S# ) , ( ) )
S#" P#" TOTQTY"S1" P1" 300"S1" P2" 200"S2" P1" 300"S2" P2" 400"S3" P2" 200"S4" P2" 200"S1" null" 500"S2" null" 700"S3" null" 200"S4" null" 200"null" null" 1600"
ROLLUP definition
• The quantities have been rolled up for each supplier
• Rolled up along supplier dimension
GROUP BY ROLLUP (A,B,...,Z)
(A,B,...,Z) (A,B,...) (A,B) (A) ()
GROUP BY ROLLUP (A,B) is not symmetric in A and B !
11/16/15
20
CUBE operation
SELECT S#, P#, SUM ( QTY ) AS TOTQTY FROM SP GROUP BY CUBE (S#, P#);
GROUP BY GROUPING SETS ( (S#, P#), ( S# ), ( P# ), ( ) )
S#" P#" TOTQTY"S1" P1" 300"S1" P2" 200"S2" P1" 300"S2" P2" 400"S3" P2" 200"S4" P2" 200"S1" null" 500"S2" null" 700"S3" null" 200"S4" null" 200"null" P1" 600"null" P1" 1000"null" null" 1600"
CUBE
• Confusing term CUBE (?) – Derived from the fact that in multidimensional
terminology, data values are stored in cells of a multidimensional array or a hypercube
• The actual physical storage may differ – In our example
• Cube has just two dimensions (supplier, part) • The two dimensions are unequal (no square rectangle..)
• Means group by all possible subsets of the set {A, B, ..., Z }
11/16/15
21
CUBE
• Means group by all possible subsets of the set {A, B, ..., Z } – M={A, B, ..., Z } |M|=n
– Power Set (Algebra) – P(M):={N | N⊆M}, |P(M)|=2n
..proof by induction
• Subset represent different grade of summarization
• In Data Mining, such a subset is called a Cuboid
Multidimensional model • Views data in an n-dimensional space: data cube
– composed of dimensions and facts • Dimensions: perspectives used to analyze the data
– Example: A 3-dimensional cube for sales data with dimensions Product, Time, and Customer, and a measure Quantity
• Attributes describe dimensions – Product dimension may have attributes ProductNumber and UnitPrice (not shown)
• Cells or facts have associated numeric values called measures – Each cell of the data cube represents Quantity of units sold by category, quarter, and
customer’s city 42
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
measure values
dimensions
11/16/15
22
Characteristics of a data cube
• Data granularity: level of detail at which measures are represented for each dimension of the cube – Example: sales figures aggregated to granularities
Category, Quarter, and City • Instances of a dimension are called members
– Ex: Seafood and Beverages are members of the Product at the granularity Category
• A data cube may contain several measures – e.g. amount, indicating the total sales amount (not shown)
• A data cube may be sparse (typical case) or dense – Ex: not all customers may have ordered products of all
categories during all quarters
43
Hierarchies
• Allow viewing data at several granularities – Define a sequence of mappings relating lower-level, detailed concepts to
higher-level ones – The lower level is called the child and the higher level is called the parent – The hierarchical structure of a dimension is called the dimension schema – A dimension instance comprises all members at all levels in a dimension
• Example – Hierarchies of the Product, – Time, and Customer dimensions
44
All
Category
Product
ProductAll
Year
Semester
Quarter
Month
Day
TimeAll
Continent
Country
State
City
Customer
Customer
11/16/15
23
Members of hierarchy
• Members of the hierarchy Product Category
45
all
Beverages
Chai Chang
Seafood
Ikura Konbu
...
... ...Product
Category
All
Classification of measures
• Each measure is associated to an aggregation function that combines several measure values into a single one – Aggregation of measures takes place when we change the level
of detail at which data in a cube is visualized
• Measures can be classified according to the way they can be aggregated: – Additive: can be meaningfully summarized along all the
dimensions, using addition (most common type) – Semiadditive: can be meaningfully summarized using addition
along some dimensions (example: inventory quantities, which cannot be added along the Time dimension)
– Nonadditive measures cannot be meaningfully summarized using addition across any dimension (Ex: item price, cost per unit, and exchange rate)
46
11/16/15
24
Another Classification of Measures
• Another classification of measures: – Distributive: defined by an aggregation function that can be
computed in a distributed way; functions count, sum, minimum, and maximum are distributive, distinct count is not (ex: S = {3; 3; 4; 5; 8; 4; 7; 3; 8} partitioned in subsets {3; 3; 4}, {5; 8; 4}, {7; 3; 8} gives a result of 8, while the answer over the original set is 5)
– Algebraic: defined by an aggregation function that can be expressed as a scalar function of distributive ones; example: average, computed by dividing the sum by the count
– Holistic: cannot be computed from other subaggregates (e.g., median, rank)
• Most large data cube applications require efficient computation of distributive and algebraic measures – It is difficult to efficiently compute holistic measures
47
More about measures • When defining a measure we must determine the associated
aggregation functions – For example, a semiadditive measure representing inventory
quantities can be aggregated using average along the Time dimension, and using addition along other dimensions
• Summarizability refers to the correct aggregation of cube measures along dimension hierarchies
• Summarizability conditions: – Disjointness of instances: the grouping of instances in a level
with respect to their parent in the next level must result in disjoint subsets
– Completeness: all instances are included in the hierarchy and each instance is related to one parent in the next level
– Correctness: refers to the correct use of the aggregation functions
48
11/16/15
25
Outline
• Introduction – Motivation for data warehousing – Definition of data warehouse – New domains and challenges
• The multidimensional model Typical data warehouse architecture • OLAP operations
49
Data Warehouse Architecture
50
Operationaldatabases
External sources
Internal sources
OLAP tools
Reporting tools
Data mining tools
Data marts
Back-endtier
OLAP tier
Front-end tier
Data sources
Data warehousetier
Statistical tools
Data staging Metadata
ETLprocess
Enterprisedata
warehouseOLAP server
11/16/15
26
Components of the DW architecture • Back-end tier:
– Extraction, Transformation, and Loading (ETL) process: feeds data into the data warehouse from operational databases and other data sources
– Data Staging Area (DSA): intermediate database where all the data integration and transformation processes are run prior to the loading of the data into the data warehouse
• Data warehouse tier: – Enterprise data warehouse and/or several data marts – Metadata repository storing information about the data warehouse and its
contents • OLAP tier composed of:
– OLAP server which provides a multidimensional view of the data, regardless the actual way in which data are stored
• Front-end tier is used for data analysis and visualization – Contains client tools such as OLAP tools, reporting tools, statistical tools,
and data-mining tools
51
Data Warehouse Architecture
52
Operationaldatabases
External sources
Internal sources
OLAP tools
Reporting tools
Data mining tools
Data marts
Back-endtier
OLAP tier
Front-end tier
Data sources
Data warehousetier
Statistical tools
Data staging Metadata
ETLprocess
Enterprisedata
warehouseOLAP server
11/16/15
27
Back-End Tier • It is a 3-step process: Extraction, Transformation, and Loading 1. Extraction gathers data from multiple, heterogeneous data sources
internal or external to the organization 2. Transformation modifies the data from the format of the data sources to
the warehouse format; this includes: – Cleaning: Removes errors and inconsistencies in the data and converts it into a
standardized format – Integration: Reconciles data from dierent data sources, both at the schema and at
the data level – Aggregation: Summarizes the data obtained from data sources according
granularity of the data warehouse 3. Loading feeds the data warehouse with the transformed data, including
refreshing the data warehouse, that is, propagating updates from the data sources to the data warehouse at a specified frequency
• Data staging area (also called operational data store): A database where data extracted from the sources undergoes successive modifications before being loaded into the data warehouse
53
Data Warehouse Architecture
54
Operationaldatabases
External sources
Internal sources
OLAP tools
Reporting tools
Data mining tools
Data marts
Back-endtier
OLAP tier
Front-end tier
Data sources
Data warehousetier
Statistical tools
Data staging Metadata
ETLprocess
Enterprisedata
warehouseOLAP server
11/16/15
28
DW Tier • Enterprise data warehouse, centralized and encompassing an entire
organization • Several data marts: specialized departmental data warehouses • Metadata
– Business metadata describes the semantics of the data, organizational rules, policies, and constraints related to the data
– Technical metadata describes how data are structured and stored in a computer system, and the applications and processes that manipulate the data
• Metadata repository may contain information such as: – Metadata describing the structure of the data warehouse and the data marts, at the
conceptual/logical level (facts, dimensions, hierarchies, ...) and at the physical level (indexes, partitions,...)
– Security information (user authorization and access control), and monitoring information (usage statistics, error reports, audit trails)
– Metadata describing data sources: schemas, ownership, update frequencies, legal limitations, access methods
– Metadata describing the ETL: data lineage, data extraction, cleaning, transformation rules, etc.
55
Data Warehouse Architecture
56
Operationaldatabases
External sources
Internal sources
OLAP tools
Reporting tools
Data mining tools
Data marts
Back-endtier
OLAP tier
Front-end tier
Data sources
Data warehousetier
Statistical tools
Data staging Metadata
ETLprocess
Enterprisedata
warehouseOLAP server
11/16/15
29
OLAP Tier
• OLAP server, which presents business users with multidimensional data from data warehouses or data marts – Products include OLAP extensions and tools allowing building,
querying, and navigating cubes, analysis, and reporting
• Not yet a standardized language for defining and manipulating data cubes – MDX (MultiDimensional eXpressions): query language for OLAP
databases, a de facto standard for querying OLAP systems – SQL extended for providing analytical capabilities: SQL/OLAP
57
OLAP Server Architectures • Relational OLAP (ROLAP)
– Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware
– Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services
– Greater scalability
• Multidimensional OLAP (MOLAP) – Sparse array-based multidimensional storage engine – Fast indexing to pre-computed summarized data
• Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer) – Flexibility, e.g., low level: relational, high-level: array
• Specialized SQL servers (e.g., Redbricks) – Specialized support for SQL queries over star/snowflake schemas
11/16/15
30
Data Warehouse Architecture
59
Operationaldatabases
External sources
Internal sources
OLAP tools
Reporting tools
Data mining tools
Data marts
Back-endtier
OLAP tier
Front-end tier
Data sources
Data warehousetier
Statistical tools
Data staging Metadata
ETLprocess
Enterprisedata
warehouseOLAP server
Front-End Tier
• Client tools that allow users to exploit the content of the data warehouse – OLAP tools: allow interactive exploration and manipulation
of the warehouse data and formulation of complex ad hoc queries
– Reporting tools enable the production, delivery, and management of reports, which can be paper-based, interactive, or web-based
• Reports use predefined queries queries asking for specific information in a specific format, performed on a regular basis
– Statistical tools: used to analyze and visualize the cube data using statistical methods
– Data mining tools allow users to analyze data in order to discover valuable knowledge such as patterns and trends, and also allow to make predictions based on current data
60
11/16/15
31
Variations of the architecture 1. Only an enterprise data warehouse without data marts or,
alternatively, an enterprise data warehouse does not exist 2. An OLAP server does not exist and/or the client tools directly
access the data warehouse 3. Neither a data warehouse nor an OLAP server - virtual data
warehouse (virtual data integration) - which defines a set of views over operational databases that are materialized for efficient access
– does not contain historical data, centralized metadata, etc. 4. Data staging area may not be needed when the data in the
source systems conforms very closely to the data in the warehouse
61
Outline
• Introduction • Motivation for data warehousing • New domains and challenges • Definition of data warehouse • The multidimensional model • Typical data warehouse architecture OLAP operations
62
11/16/15
32
What is OLAP?
• The term OLAP (Online Analytical Processing“) was coined in a white paper written for Arbor Software Corp. in 1993
– Interactive process of creating, managing, analyzing and reporting on data
– Analyzing large quantities of data in real-time
Data Grouping and Aggregation
• Data grouping and aggregation in many different ways"
• The number of possible groupings quickly becomes large"– The user has to consider all groupings"– Analytical processing problem"
11/16/15
33
Multidimensional model • Views data in an n-dimensional space: data cube
– composed of dimensions and facts • Dimensions: perspectives used to analyze the data
– Example: A 3-dimensional cube for sales data with dimensions Product, Time, and Customer, and a measure Quantity
• Attributes describe dimensions – Product dimension may have attributes ProductNumber and UnitPrice (not shown)
• Cells or facts have associated numeric values called measures – Each cell of the data cube represents Quantity of units sold by category, quarter, and
customer’s city 65
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduceCus
tomer
(City
)
measure values
dimensions
Hierarchies
• Allow viewing data at several granularities – Define a sequence of mappings relating lower-level, detailed concepts to
higher-level ones – The lower level is called the child and the higher level is called the parent – The hierarchical structure of a dimension is called the dimension schema – A dimension instance comprises all members at all levels in a dimension
• Example – Hierarchies of: – Product – Time – Customer dimensions
66
All
Category
Product
ProductAll
Year
Semester
Quarter
Month
Day
TimeAll
Continent
Country
State
City
Customer
Customer
11/16/15
34
Classification of measures
• Each measure is associated to an aggregation function that combines several measure values into a single one – Aggregation of measures takes place when we change the level
of detail at which data in a cube is visualized
• Measures can be classified according to the way they can be aggregated: – Additive: can be meaningfully summarized along all the
dimensions, using addition (most common type) – Semiadditive: can be meaningfully summarized using addition
along some dimensions (example: inventory quantities, which cannot be added along the Time dimension)
– Nonadditive measures cannot be meaningfully summarized using addition across any dimension (Ex: item price, cost per unit, and exchange rate)
67
OLAP Operations: definition
• Allows the user to view data from different perspectives and at several levels of detail by exploiting dimensions and their hierarchies
• Provide an interactive data analysis environment
68
11/16/15
35
OLAP Operations (1)
69
Q4
FranceGermany
Product (Category)
Tim
e (Q
uarte
r)
Beverages
Q3
Q2
Q1
Condiments
SeafoodProduce
Cust
omer
(Cou
ntry
)
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
...
ParisLyon
Köln
Product (Category)
Tim
e (M
onth
)
Beverages
Mar
Feb
Jan
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
Dec
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments SeafoodProduce
Custo
mer(C
ity)
70
Q4
Köln
Berlin
Paris
Produ
ct
(Cate
gory
)
Time (Quarter)
Beverages
Q3Q2Q1
Lyon
CondimentsSeafood
Produce
Cus
tom
er (C
ity)
Q4
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Condiments
SeafoodProduce
ParisLyon
Product (Category)
Tim
e(Q
uart
er)
Beverages
Q2
Q1
Condiments
SeafoodProduce
Custo
mer(C
ity)
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
OLAP Operations (2)
11/16/15
36
OLAP Operations (3)
71
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
Q4
ParisLyon
Köln
Product (Category)Ti
me
(Qua
rter
)Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduceCus
tomer
(City
)
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
OLAP Operations (4)
72
SUM BY Time, Customer
84
72
93
84
Q4
Customer (City)
Tim
e (Q
uart
er)
Paris
96
Q3
Q2
Q1
Berlin
Lyon
89 106
79
8865105
82 77
61112 102
Köln
max() by quarter and city
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
• According to the authors of the book, agg. functions can be classified as: – cumulative: compute the measure value of
a cell from several other cells(e.g., SUM; COUNT, AVG)
– filtering: filter the members of a dimension that appears in the result (MIN, MAX); must compute not only the aggregated value but also detemine the dimension members that belong to the result
11/16/15
37
OLAP Operations (5)
73
...
ParisLyon
Köln
Product (Category)
Tim
e (M
onth
)
Beverages
Mar
Feb
Jan
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
Dec
...
ParisLyon
Köln
Product (Category)
Tim
e (M
onth
)
Beverages
Mar
Feb
Jan
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
Dec
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity) Bilbao
Madrid
Algebra of OLAP Operations
• There is not yet a standard definition of OLAP operations in a similar way to the relational algebra
• Many proposals of OLAP algebra in the literature • We adopt the one proposed in [Ciferri et al 2013]
74
11/16/15
38
Algebra of OLAP Operations - rollup
• Roll-up: aggregates measures along a dimension hierarchy (using an aggregate function) to obtain measures at a coarser granularity ROLLUP(CubeName, (Dimension Level)*, AggFunction(Measure)*)
75
Q4
FranceGermany
Product (Category)
Tim
e (Q
uarte
r)
Beverages
Q3
Q2
Q1
Condiments
SeafoodProduce
Cust
omer
(Cou
ntry
)
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
ROLLUP(Sales2012, Customer Country, SUM(Quan;ty))
Algebra of OLAP Operations – drill-down
• Drill-down moves from a more general level to a more detailed level in a hierarchy – DRILLDOWN(CubeName, (Dimension Level)*)
76
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
...
ParisLyon
Köln
Product (Category)
Tim
e (M
onth
)
Beverages
Mar
Feb
Jan
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
Dec
DRILLDOWN(Sales2012, Time Month)
11/16/15
39
Algebra of OLAP Operations – sort
• Sort returns a cube where the members of a dimension have been sorted – SORT(CubeName, Dimension, Expression [ASC | DESC]) – where the members of Dimension are sorted according to the value of
Expression
77
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments SeafoodProduce
Custo
mer(C
ity)
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
SORT(Sales2012, Product, Category)
• Pivot (or rotate): rotates the axes of a cube to provide an alternative presentation of data – PIVOT(CubeName, (Dimension Axis)*) – where the axes are specified as {X; Y; Z; X1; Y1; Z1; : : :}.
78
Algebra of OLAP Operations – pivot
Q4
Köln
Berlin
Paris
Produ
ct
(Cate
gory
)
Time (Quarter)
Beverages
Q3Q2Q1
Lyon
CondimentsSeafood
Produce
Cus
tom
er (C
ity)
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
PIVOT(Sales, Time X, Customer Y, Product Z)
11/16/15
40
• Slice: removes a dimension in a cube so a cube of n-1 dimensions is obtained from a cube of n dimensions
– SLICE(CubeName, Dimension, Level = Value)
• Dimension will be dropped by fixing a single Value in the Level; other dimensions unchanged
• Slice supposes that the granularity of the cube is at the specified level of the dimension
79
Algebra of OLAP Operations – slice
Q4
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Condiments
SeafoodProduceQ4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
SLICE(Sales, Customer, City = ’Paris’)
• Dice: keeps the cells of a cube that satisfy a Boolean condition Φ – DICE(CubeName, Φ)
• Φ is a Boolean condition over dimension levels, attributes, and measures.
80
Algebra of OLAP Operations – dice
ParisLyon
Product (Category)
Tim
e(Q
uart
er)
Beverages
Q2
Q1
Condiments
SeafoodProduce
Custo
mer(C
ity)
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
DICE(Sales, (Customer.City = ’Paris’ OR Customer.City = ’Lyon’) AND (Time.Quarter = ’Q1’ OR Time.Quarter = ’Q2’))
11/16/15
41
• Drill-across: combines cells from two data cubes that have the same schema – DRILLACROSS(CubeName1, CubeName2, [Condition])
81
Algebra of OLAP Operations – drill-across
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
Sales2011-‐2012 DRILLACROSS(Sales2011, Sales2012)
• Add Measure: adds new measures to a cube – ADDMEASURE(CubeName, (NewMeasure = Expression, [AggFct])* )
• Drop measure: Deletes a measure from a cube schema – DROPMEASURE(CubeName, Measure*)
82
Algebra of OLAP Operations – ADD-MEASURE
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
ADDMEASURE(Sales2011-‐2012, PercChange = (Quan;ty2011-‐Quan;ty2012)/Quan;ty2011)
11/16/15
42
• Another ex: – Computes the value of a cell by aggregating the measures of several
nearby cells
83
Algebra of OLAP Operations – ADD-MEASURE
...
ParisLyon
Köln
Product (Category)
Tim
e (M
onth
)
Beverages
Mar
Feb
Jan
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
Dec
...
ParisLyon
Köln
Product (Category)
Tim
e (M
onth
)
Beverages
Mar
Feb
Jan
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
Dec
ADDMEASURE(Sales, MovAvg = AVG(Quan;ty) OVER Time 2 CELLS PRECEDING)
• Aggregation functions in OLAP are also needed at the current granularity, that is without performing roll-up.
– AggFunction(CubeName, Measure) [BY Dimension*] – Cumulative: compute the measure value of a cell from several other cells; examples are
SUM, COUNT, and AVG – Filtering: Filters the members of a dimension that appear in the result; examples are
MIN and MAX. Filtering functions compute not only the aggregated value, but also the members of the dimension that belong to the result
84
Algebra of OLAP Operations – aggregate functions
SUM BY Time, Customer
84
72
93
84
Q4
Customer (City)
Tim
e (Q
uart
er)
Paris
96
Q3
Q2
Q1
Berlin
Lyon
89 106
79
8865105
82 77
61112 102
KölnQ4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
SUM(Sales, Quan;ty) BY Time, Customer
11/16/15
43
• Another example: max sales by quarter and city
85
Algebra of OLAP Operations – aggregate functions
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uarte
r)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
MAX(Sales, Quan;ty) BY Time, Customer
• Union merges two cubes having the same schema but disjoint instances • Ex: If CubeSpain is a cube having the same schema as the original cube but containing only
the sales to Spanish customers, we can perform: • Difference removes the cells in a cube that belong to another one; the two cubes must have
the same schema • Drill-through allows to move from data at the bottom level in a cube to data in the
operational systems from which the cube was derived; Could be used when trying to determine the reason for outlier values in a data cube
86
Algebra of OLAP Operations – union, difference, drill-through
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity)
Q4
ParisLyon
Köln
Product (Category)
Tim
e (Q
uart
er)
Beverages
Q3
Q2
Q1
Berlin
Condiments
SeafoodProduce
Custo
mer(C
ity) Bilbao
Madrid
UNION(Sales, SalesSpain)
11/16/15
44
Next Lecture
• Conceptual Data Warehouse Design
87
• Slice (city = lisbon or city= porto) é um slice ou um dice, assumindo que parto de um cubo com três dimensões
• City = lisbon and quarter =Q1
88