Upload
naveen-jaishankar
View
218
Download
0
Embed Size (px)
Citation preview
8/13/2019 Olap-Oct2013 Data Mining
1/32
Unit - 2
Online Analytical Processing (OLAP)G. K Gupta, Second Edition
8/13/2019 Olap-Oct2013 Data Mining
2/32
2
OLAP Dimension is an attribute or an ordinate within multidimensional structure with a
list of values.
On-line Analytical Processing (OLAP) is a technique used for providing
management decision support using historical and summarized data that is
consolidated in the data warehouse.
A fact determined by combining dimension values.
Fact table is multidimensional and is a way to flatten a cube values of measures
SQL command GROUP BY used as aggregation operator
OLAP systems are data warehouse front-end to make aggregate data
DW and OLAP based on a multidimensional conceptual view of enterprise data.
8/13/2019 Olap-Oct2013 Data Mining
3/32
University student data with the dimensions degree, country, scholarship and year
given below(Multidimensional view):
The above table is for the year 2000. We have collect data for other years as well.
Use of spreadsheet has scalability problem as it is difficult to represent millions of
rows or with thousands of formulas.
Data cubes generalize spreadsheets to any number of dimensions.
3
Degreecountry
BSc LLB MBBS B.com BIT ALLAustralia 5 20 15 50 11 101
India 10 0 15 25 17 67
Malaysia 5 1 10 12 23 51
8/13/2019 Olap-Oct2013 Data Mining
4/32
1. Definition - E. F Codd
Is a dynamic enterprise analysis required to create,manipulate, animate and synthesisinformation fromexegetical, contemplative and formulaic data analysismodels.
Information is manipulated from point of view of a
manager (exegetical), from the point of view of someonewho has thought about it (contemplative) and accordingto some formula (formulaic)
4
8/13/2019 Olap-Oct2013 Data Mining
5/32
5
OLAP FeaturesFour enterprise data model
Categorical- comparison of historical values
Exegetical- discovering reasons for what categorical
model found
Contemplative- what if analysis of the data
Formulaic- how to reach a desired goal
8/13/2019 Olap-Oct2013 Data Mining
6/32
2. Characteristics of OLAP systems1.Users: OLTP systems designed for office workers while the
OLAP systems are deigned for decision makers.
2.Functions : OLTP systems are mission-critical whichsupport enterprise day-to-day operations; OLAP systems are
called management-critical to support enterprise decision-
support functions.
3.Nature: OLTP designed to process one record at a time;OLAP is to deal with many customer records at a time to
provide summary or aggregate data to a manager.
6
8/13/2019 Olap-Oct2013 Data Mining
7/32
4.Design : OLTP is application-oriented which viewenterprise data as a collection of tables; OLAP system is
subject-oriented which view enterprise as multidimensional
5. Data : OLTP systems deal with current status ofinformation eg: employee who left three years ago; OLAP
requires historical data over several years
6. Kind of use: OLTP systems are used for read and writeoperations; OLAP systems do not update the data.
7
8/13/2019 Olap-Oct2013 Data Mining
8/32
8
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hocaccess read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
8/13/2019 Olap-Oct2013 Data Mining
9/32
FASMI Characteristics
9
Derived from first letters of OLAP systems:
FastOLAP queries to be answered quickly like searchengine. To achieve such performance is difficult. So a good
data structure and hardware to precompute most commonly
queried aggregates.
AnalyticOLAP queries to be answered without anyprogramming. Vendor tool used to cope with any relevant
queries for application and user.
8/13/2019 Olap-Oct2013 Data Mining
10/32
10
sharedOLAP system is a shared resource but not shred by manypeople; while accessed only by a group of managers and used by
selected users. Concurrency control needed if users write or update
data in the database
Multidimensional: OLAP to provide multidimensional conceptual viewof data that refers data as a cube with dimensions shown as parent/
child relationships.
Information: OLAP obtain information from a data warehouse so as tohandle large amount of input data.
8/13/2019 Olap-Oct2013 Data Mining
11/32
11
Codds OL P CharacteristicsCodd in his 1993 paper lists the following 12 rules for evaluating OLAP
products:
1. Multidimensional conceptual view- to make a variety of manipulations(e.g. slice and dice) relatively easy.
2. AccessibilityShould be in between data sources and an OLAP front-end.3. Batch extraction vs interpretiveOLAP system to provide
multidimensional data staging plus partial pre-calculations of aggregates
4. Multi user supportTo provide normal database operations like retrieval,update, concurrency control, integrity and security
5. Storing OLAP resultsResults not to be kept separate from source data.Read-write OLAP applications should not be implemented directly on
transaction data.
8/13/2019 Olap-Oct2013 Data Mining
12/32
12
6. Extraction of missing values: Should distinguish missing valuesfrom zero values as aggregates will be computed incorrectly.
Large cubes may have large number of zeroes.7. Treatment of missing values : Should ignore all missing values of
regardless of their source
8. Uniform reporting performance- consistent reportingperformance as the number of dimensions grows
9. Generic dimensionality- different dimensions should not betreated differently.
10. Unlimited dimensions and aggregation levels- someapplications need as many as 15-20 dimensions. Allow unlimited
dimensions.
8/13/2019 Olap-Oct2013 Data Mining
13/32
13
3. Multidimensional view and Data CubeA data warehouse is based on a multidimensional data model which
views data in the form of a data cube.
Consider the following database:
Student(sid, name1, stu_name, country, DOB, address)
Enrolment(sid, Degree_id, SSemester) Degree( Degree_id, Degree_name, Degree_length, Fee, Dept)
Detailed Example in page 413 We consider a two-dimensional view is considered. ie country X degree
8/13/2019 Olap-Oct2013 Data Mining
14/32
14
Degreecountry BSc LLB MBBS B.com BIT ALLAustralia 5 20 15 50 11 101
India 10 0 15 25 17 67
Malaysia 5 1 10 12 23 51
For year 2000
Degreecountry BSc LLB MBBS B.com BIT ALLAustralia 7 10 16 53 10 96
India 9 0 17 22 13 61
Malaysia 5 1 19 19 20 64
For year 2001
Degreecountry BSc LLB MBBS B.com BIT ALLAustralia 12 30 31 103 21 197
India 19 0 32 47 30 128
Malaysia 10 2 29 31 43 115
Aggregates for bothsemesters
8/13/2019 Olap-Oct2013 Data Mining
15/32
15
4. Data Cube Number of students as a function of country, degree and
semester
country
semester
Dimensions: country, degree, semHierarchical summarization pathscontinent school Yearregion ug/pgcountry degree semester
8/13/2019 Olap-Oct2013 Data Mining
16/32
16
Each edge of the cube is called a dimension.
A user therefore has a multidimensional conceptual viewof the data
which is represented by the cube.
The points inside a cube provide aggregations. For example, a point
may provide the number of students from Malaysia admitted to BCom
in year 1998.
The cube is not always three-dimensional
Each dimension may be associated with a table that describes the
dimension.
For example, a dimension table for country would contain the country
names and could contain other information e.g. category.
Other dimensions like time do not naturally have such table of
information.
8/13/2019 Olap-Oct2013 Data Mining
17/32
A cube is represented in three dimensions: country X degree X semester for any country (x), any degree (y) and any start semester (z).
17
Degree
ouryAll
BIT
71 681019
20
1922932
3163147
2174330
12 30 31 103 212002
20002001
BSc LLB MBBS B.com
Australia
India
Malaysia
sum
India
All
863
115
128
197
For the query : SELECT degree_id, count(*) FROM enrolment GROUP BY
degree_id. B.sc LLB MBBS B.Com BIT71 68 192 315 217
8/13/2019 Olap-Oct2013 Data Mining
18/32
Each of the edges in cube represents a dimension with members in degree B.Sc,
LLB, MBBS, B.com, BIT.
All space gives total number of students joined in each course in respective
country.
Measures called as semi-additive or non-additive as they cannot be combined.
A data cube allows data to be modeled and viewed in multiple dimensions
Dimension tables, such as item (item_name, brand, type), or time(day, week,
month, quarter, year)
Fact table contains measures (such as dollars_sold) and keys to each of the
related dimension tables
Eight types of aggregations or queries possible are: Null, degrees, semester,
country, degrees& semester , Semester & country , degrees & country, all
2naggregation possible in n dimensions.18
8/13/2019 Olap-Oct2013 Data Mining
19/32
Solutions to aggregate and store the data are:
1. Pre_compute and store all: Millions of aggregates need to be computed andstored. So indexing large amounts of data is also expensive.
2. Pre_compute (and store) none : done when a query is executed; does not needextra space for storing the cube but query response time is very poor.
3. Pre_compute and store some : Pre_compute and store most frequently queriedaggregates.
Let a be degree dimension, b be country, c be the starting semester the queries
will be based on (ALL, ALL, ALL) , (a, ALL, ALL) , (ALL, ALL, c) , (ALL, b, ALL),
(a, ALL, c) , (ALL, b, c), (a, b, ALL), (a, b, c)
Data cube uses many techniques for pre_computing aggregates and store them.
19
5. Data cube implementations
8/13/2019 Olap-Oct2013 Data Mining
20/32
20
8/13/2019 Olap-Oct2013 Data Mining
21/32
November 13 GKGupta 21
Example of fact dimension tabletime_keydayday_of_the_weekmonthquarter
year
time
location_key
streetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_solddollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_key
branch_namebranch_type
branch
8/13/2019 Olap-Oct2013 Data Mining
22/32
22
OLAP implementation modelsRelational OLAP (ROLAP)
Use relational or extended-relational DBMS to store andmanage warehouse data and OLAP middle ware to support
missing pieces
Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services. Data warehouse provides multidimensional capabilities by
representing data in fact table and dimension table.
Advantage is it more easily used with existing RDBMS and
data is stored without any fact table storage Disadvantage is it poor query performance
8/13/2019 Olap-Oct2013 Data Mining
23/32
23
OLAP implementation models
Multidimensional OLAP (MOLAP)
Based on Multidimensional DBMS (top-down approach)
No standard approach to store and maintain data.
Array-based multidimensional storage engine (sparse matrix
techniques)
fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP)
User flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers
8/13/2019 Olap-Oct2013 Data Mining
24/32
24
Data Cube
But in the two-dimensional situation, we dont just want
to find out the number of students for any country (x)and any degree (y). We may have many other queriese.g.
1. How many students are doing MIT?
2. How many students from Thailand?
3. How many Asian students doing Law degrees?
Thus there is kind of hierarchy that we wish to use, for
example, the world, the continents, the regions, thecountries etc. In degrees, we may want a hierarchy ofuniversity, Schools, UG and PG, individual degrees.
8/13/2019 Olap-Oct2013 Data Mining
25/32
25
Data Cube operationsA number of operations may be applied to data
cubes. The common ones are:
- roll-up (increasing the level of abstraction)
- drill-down (increasing detail)
- slice and dice (selection and projection)
- pivot (re-orienting the view)
8/13/2019 Olap-Oct2013 Data Mining
26/32
Roll-up(less detail)
26
Zooming out the data cube ie it performs further aggregation on the data
Used in further abstraction (i.e. less detail). Eg: single degree programs to all programs offered by a school in single
countries to Continents or from three dimensions to two dimensions.
Drill-down(increasing detail)
reverse of roll up, when we wish to partition more finely or want to
focus on some particular values of certain dimensions.
Drill-down adds more detail to the data, it may involve adding anotherdimension.
8/13/2019 Olap-Oct2013 Data Mining
27/32
8/13/2019 Olap-Oct2013 Data Mining
28/32
Slice and dice(selection and projection)
28
Slice operation performs a selection on one dimension of the cube
(e.g. degree = MIT).
The dice operation performs a selection on two or more dimensions
(e.g. degree = BIT and country = Australia or India)
Pivot (re-orienting the view)
An alternate presentation of the data e.g. rotating the axes in a
3-D cube.
8/13/2019 Olap-Oct2013 Data Mining
29/32
November 13 GKGupta 29
8/13/2019 Olap-Oct2013 Data Mining
30/32
30
Data Cube Operations
8/13/2019 Olap-Oct2013 Data Mining
31/32
Guidelines for OLAP1.Vision: To be consulted with users with a clear vision including clearly defined,
understood business objectives which is shared by stakeholders.
2. Senior management support : Supported by senior managers3.Selecting an OLAP tool : Familiar with ROLAP & MOLAP tools required for
enterprise. Some times combination of ROLAP & MOLAP that is cost effective
4.Corporate strategy: OLAP strategy to fit with the enterprise strategy andbusiness objectives.
5.Focus on the users: Should be based on the technical or non-technical usersbased on personal skill & information needs
6.Join management : Jointly managed by IT and business professionals.Committee of people to be involved to provide ideas.
7.Review and adapt: Regular reviews of project required to ensure that theproject meets the current need of enterprise.
31
8/13/2019 Olap-Oct2013 Data Mining
32/32
Consider a university which spread across 5
countries whose number of students admitted for
the courses like BSc, LLB, MBBS, B.com in 3
years from 2010. Construct the 2-Dimensional
View for each years course entry and theAggregates after three years. Finally develop a
data cube with dimension country X degree Xyear.
32