View
217
Download
2
Embed Size (px)
Citation preview
Data Warehousing and OLAP(Under construction)
Introduction to Data Mining with Case StudiesAuthor: G. K. Gupta
Prentice Hall India, 2006.
December 2008 ©GKGupta 2
Data Warehousing and OLAP
• What is a data warehouse?
• A multi-dimensional data model
• Data warehouse architecture
• Data warehouse implementation
December 2008 ©GKGupta 3
What is a Data Warehouse?
Data warehousing is a process, not a product, for assembling and managing data from various sources for the purpose of gaining a single detailed view of part or all of a business. The single view is the data warehouse (DW) which provides the enterprise’s information environment that is separate from OLTP. DW is important since information is a powerful asset for every enterprise.
December 2008 ©GKGupta 4
What is a Data Warehouse?
The information in a DW must be complete, timely, accurate and understandable for decision making. This requires data to be cleaned, filtered, and transformed.
On-line Analytical Processing (OLAP) is a technique used for providing management decision support using historical and summarized
data that is consolidated in the data warehouse.
December 2008 ©GKGupta 5
What is a Data Warehouse?
Questions that we will deal with:• Where does the data in the DW come from?
• How does it get into the DW?
• What data is in the DW?
• How is the DW data structured?
December 2008 ©GKGupta 6
Data Warehousing
A DW integrates information from several sources into a global schema and is stored separately from the operational data. It does not represent a snapshot of the operational database.
Moving data from various sources to a DW is a very difficult process involving data cleansing and data integration. Sometime called ETL (Extract, Transform and Load).
Most database systems are error-prone. A DW should have as few errors as possible.
December 2008 ©GKGupta 7
Data Warehouse
Most OLTP database systems continue to grow but a data warehouse grows at a slower rate. However, the volume of data in the DW can be very high.
User updates to a data warehouse are usually forbidden, updates must come from the underlying databases to maintain consistency.
December 2008 ©GKGupta 8
Data Warehousing
To speed up OLAP queries, a warehouse contains summarized and consolidated information representing materialized aggregate views of the enterprise data from a number of databases.
DW and OLAP are complementary. A warehouse stores data while OLAP derives strategic information from it.
DW may be used to provide an enterprise memory which operational data does not provide.
Question: Why does an OLTP system not provide enterprise memory?
December 2008 ©GKGupta 9
Data Warehousing
Warehouse usually contains information over time helping analysis of trends.
A DW is repackaging information to support business decision making. The aim in DW may be to generate new revenue by selling the repackaged information.
Question: How can one sell repackaged information?
December 2008 ©GKGupta 10
A definition
According to W. H. Inmon: A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision making process.
Important to note subject-oriented, integrated, and time-variant properties of a data warehouse.
December 2008 ©GKGupta 11
Subject-oriented
• A DW is organized around major subjects, such as student, degree, country.
• Focusing on the modeling and analysis of data for decision makers, not on daily operations.
• A DW provides a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
Question: Think of a OLTP database system. What are the major subjects?
December 2008 ©GKGupta 12
Integrated
• A DW may be constructed by integrating information from multiple data sources e.g. multiple OLTP databases.
• Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources.
December 2008 ©GKGupta 13
Time Variant• A DW usually has long time horizon, significantly
longer than that of operational systems.– Operational database: current value data.– DW data: provide information from a historical
perspective (e.g. past 5-10 years)• Every key structure in the DW contains an element
of time, explicitly or implicitly• Operational data may or may not contain time
element.
December 2008 ©GKGupta 14
Non-volatile
• A physically separate store of data transformed from the operational environment.
• No update of data • Does not require transaction processing,
recovery, and concurrency control mechanisms• Requires only two operations in data accessing:
initial loading of data and access of data.
December 2008 ©GKGupta 15
Why Separate Data Warehouse?
• High performance for both systems– DBMS— tuned for OLTP– Warehouse—tuned for OLAP.
• Different functions and different data:– missing data: Decision support requires
historical data which operational DBs do not typically maintain
– data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources
– data quality: different sources typically use inconsistent data representations, codes and formats
December 2008 ©GKGupta 16
Data Warehouse Process
• Define the architecture, do capacity planning, select hardware and software
• Design the warehouse schema and the views• Design the physical data structures• Design data extraction, cleaning,
transformation, load and refresh software• Populate the repository with data and software• Design and implement end-user application
(Refer to Chaudhuri and Dayal)
December 2008 ©GKGupta 17
Data Cubes
• A data warehouse is based on a multidimensional data model which views data in the form of a data cube
• A data cube allows data to be modeled and viewed in multiple dimensions– Dimension tables, such as item (item_name,
brand, type), or time(day, week, month, quarter, year)
– Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables
December 2008 ©GKGupta 18
Conceptual Modeling
• Modeling data warehouses: dimensions & measures– Star schema: A fact table in the middle
connected to a set of dimension tables – Snowflake schema: A refinement of
star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
December 2008 ©GKGupta 19
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
December 2008 ©GKGupta 20
Measures: Three Categories
• distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning.
e.g., count(), sum(), min(), max().• algebraic: if it can be computed by an algebraic
function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.
December 2008 ©GKGupta 21
Measures: Three Categories
• holistic: if there is no constant bound on the storage size needed to describe a subaggregate.
e.g., median(), mode(), rank().
December 2008 ©GKGupta 22
Data Warehouse Design
The E-R Model approach which consists of entities and relationships is not suitable for designing a schema for a warehouse.
What is the nature of data in data warehouse? Essentially data warehouses are based on multidimensional data model which views data as a data cube.
December 2008 ©GKGupta 23
An Example
Consider the following database:
Student(sid, name1, dob, country, degree,
startsem, address1, telephone, address2, email,
scholarship, ..)
Enrolment(sid, subject-id, mark, tutegroup,
tutor,..)
Subject(sub-id, name, school-id, whenstarted,
lecturer,..)
School(name, id, ..)
Not all of this data is needed for decision
making. Let us extract some data from this
database.
December 2008 ©GKGupta 24
Data Cube
yob, country, degree, startsem, nsubjects,
scholarship1965, Thailand, MIT, 991, 5, 25%1970, Canada, BIT, 992, 4, 01967, Australia, LLB, 993, 3, 30%1966, Australia, LLB, 983, 4, 40%1972, Australia, Bcom, 973, 5, 10%1972, India, BIT/Bcom, 991, 5, 10%1982, Sweden, MSc(IT), 991, 3, 10%
Is this information useful for decision making? Not really!
December 2008 ©GKGupta 25
Example
We could look at the information as
yob X country X degree X startsem X
numsubjects X scholarship
In fact it is natural to think of an enterprise data as multidimensional.
December 2008 ©GKGupta 26
Example
The university management may be interested in retrieving information like:
• How many students are doing BIT? How many students from Thailand? How many students started in 1998? (queries involving only one variable)
• How many students doing BIT are from Thailand? How many MIT students started in 981? How many students from Thailand started in 993? (queries involving two variables)
December 2008 ©GKGupta 27
Example
•How many students doing MIT from Thailand started in 981? (query involving three variables)
Special type of database systems, called data cube systems, are often used for answering such queries.
December 2008 ©GKGupta 28
Data Cube
The example queries discussed earlier may be represented by a three-dimensional data cube with each edge representing one of the variables viz. startsem, country, and degree.
A point inside the cube is an intersection of the coordinates defined by the edges of the cube. The coordinates of the point define the meaning of the data at that point.
December 2008 ©GKGupta 29
Data Cube
Let us look at a simple two-dimensional situation:
country X degree
For decision making this may be useful information. If we had a 2-dimensional matrix then we could find out the number of students for any country (x) and any degree (y).
December 2008 ©GKGupta 30
Data Cube
But in the two-dimensional situation, we don’t just want to find out the number of students for any country (x) and any degree (y). We may have many other queries e.g.
1. How many students are doing MIT?
2. How many students from Thailand?
3. How many Asian students doing Law degrees?
Thus there is kind of hierarchy that we wish to use, for example, the world, the continents, the regions, the countries etc. In degrees, we may want a hierarchy of university, Schools, UG and PG, individual degrees.
December 2008 ©GKGupta 31
Data Cube
Consider a slightly more complex situation in which we have three dimensions:
country X degree X startsem
for any country (x), any degree (y) and any start semester (z).
We may now look at this information as a 3-dimensional cube as shown on the following slide.
December 2008 ©GKGupta 32
Data Cube(based on a slide from book by J. Han and M. Kamber)
• Number of students as a function of country, degree and semester
cou
ntr
yde
gree
semester
Dimensions: country, degree, semHierarchical summarization paths
continent school Year
region ug/pg
country degree semester
December 2008 ©GKGupta 33
A Sample Data Cube(based on a slide from book by J. Han and M. Kamber)
Total LLB enrolmentsFrom U.S.A.semester
degr
ee
Cou
ntr
y
Sum
sum LLB
MITBCom
991 992 993 001
U.S.A
Norway
Australia
sum
December 2008 ©GKGupta 34
Data Cube
Each edge of the cube is called a dimension. A user normally has a number of different dimensions from which the given data may be analyzed. A user therefore has a multidimensional conceptual view of the data which is represented by the cube.
The points inside a cube provide aggregations. For example, a point may provide the number of students from Malaysia admitted to BCom in year 1998.
December 2008 ©GKGupta 35
Multidimensional View
A particular user will have one multidimensional view of the database while another user in the same enterprise may have another view. Therefore many different multidimensional views of the same database are possible and the same data may be
consolidated in many different ways.
December 2008 ©GKGupta 36
Multidimensional data model
The cube is not always three-dimensional since often an enterprise would have many more, perhaps eight or even ten, dimensions of interest. Each dimension may be associated with a table that describes the dimension. For example, a dimension table for country would contain the country names and could contain other information e.g. category. Other dimensions like time do not naturally have such table of information.
December 2008 ©GKGupta 37
Data Cube
A number of operations may be applied to data cubes. The common ones are:
- roll-up (increasing the level of abstraction)- drill-down (increasing detail)- slice and dice (selection and projection)- pivot (re-orienting the view)
December 2008 ©GKGupta 38
Data Cube Operations
• Roll-up (less detail) - when we wish further abstraction (i.e. less detail). This operation performs further aggregation on the data, for example, from single degree programs to Schools, single countries to Continents or from three dimensions to two dimensions.
• Drill-down (increasing detail) - reverse of roll up, when we wish to partition more finely or want to focus on some particular values of certain dimensions. Drill-down adds more detail to the data, it may involve adding another dimension.
December 2008 ©GKGupta 39
Data Cube Operations
• Slice and dice (selection and projection) - the slice operation performs a selection on one dimension of the cube (e.g. degree = “MIT”). The dice operation performs a selection on two or more dimensions (e.g. degree = “BIT” and country = “Australia” or “India”)
• Pivot (re-orienting the view) - an alternate presentation of the data e.g. rotating the axes in a 3-D cube.
December 2008 ©GKGupta 40
Benefits of Multidimensional Analysis
• Small high-level database with pre-computed aggregates is created for efficient high-level queries
• Multiple-level views• Selection by slicing and dicing
However multidimensional analysis does not provide data mining.
December 2008 ©GKGupta 41
OLAP Server Architectures
Relational OLAP (ROLAP) – Use relational or extended-relational DBMS to
store and manage warehouse data and OLAP middle ware to support missing pieces
– Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services
– greater scalability
December 2008 ©GKGupta 42
OLAP Server Architectures
• Multidimensional OLAP (MOLAP) – Array-based multidimensional storage engine
(sparse matrix techniques)– fast indexing to pre-computed summarized data
• Hybrid OLAP (HOLAP)– User flexibility, e.g., low level: relational, high-
level: array• Specialized SQL servers
December 2008 ©GKGupta 43
Data Warehouse Design
One approach is the star schema to represent the multidimensional data model. The schema in this model consists of a large single fact table containing the bulk of the data, with no redundancy and a set of smaller tables called dimension table, one for each dimension.
Other models have been used. These include snowflakes model and fact constellations model.
December 2008 ©GKGupta 44
Example of Star Schema (based on a slide from book by J. Han and M. Kamber)
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
December 2008 ©GKGupta 45
ETL
• Extraction - data relevant to the tasks are selected and retrieved from a variety of sources.
• Transformation - data is consolidated by performing summary or aggregations
• Cleansing - since data comes from a number of sources, errors and anomalies are common. There is a need to remove anomalies, remove errors, handling missing and irrelevant data. Some tools are available for doing this.
December 2008 ©GKGupta 46
Data Cleaning
Data Cleaning overcomes problems like the following:
• Inconsistent field lengths of same items• Inconsistent values for same items• Inconsistent interpretation of same terms• Missing entries• Violation of integrity constraints
Data cleaning can be a very demanding task
December 2008 ©GKGupta 47
Data Warehouse Process
• Integration - combining data from many perhaps heterogeneous sources. This is a non-trivial task since different sources will use different formats, field lengths, codes, descriptions, for the same data items.
• Loading - before loading additional processing may be needed e.g. checking integrity constraints, building derived tables, indices, access paths
December 2008 ©GKGupta 48
Data Warehouse Process
• Refresh - warehouse data needs to be periodically updated as the operational data changes. There are several different ways of updating a data warehouse: • The data warehouse could be periodically
reconstructed from the base sources (perhaps overnight, once a week)
• The data warehouse could be updated periodically, for example each week or even each month.
The updates need to be logically correct since the warehouse data is derived data.
December 2008 ©GKGupta 49
Why Separate Data Warehouse?
• High performance for both systems– DBMS— tuned for OLTP– Warehouse—tuned for OLAP
• Different functions and different data:– missing data: Decision support requires
historical data which operational DBs do not typically maintain
– data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many sources
– data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
December 2008 ©GKGupta 50
OLAP
Codd defines On-line Analytical Processing or OLAP as the dynamic enterprise analysis required to create, manipulate, animate, and synthesize information from exegetical, contemplative, and formulaic data analysis models.
OLAP generally involves highly complex queries involving large amounts of data that use one or more aggregates. OLAP deals only with historical data accurate at a given point in time.
December 2008 ©GKGupta 51
OLAP Characteristics
We discuss the following OLAP characteristics listed by Codd:
• Dynamic data analysis - involving historical data of multiple dimensions manipulated in many different ways with the aim of studying changes occurring in the enterprise
• Common enterprise data - OLAP uses the enterprise data but in a very different way to discover why some particular situations occurred
• Synergistic implementation - data synthesis, analysis, and consolidation
December 2008 ©GKGupta 52
OLAP Characteristics
• Four enterprise data model• Categorical - comparison of historical values• Exegetical - discovering reasons for what
categorical model found• Contemplative - “what if ” analysis of the data• Formulaic - how to reach a desired goal
December 2008 ©GKGupta 53
Codd’s OLAP Evaluation Rules
Codd in his 1993 paper lists the following 12 rules for evaluating OLAP products:
• Multidimensional conceptual view - to make a variety of manipulations (e.g. slice and dice) relatively easy
• Transparency - user should know what data is being used and where from
• Accessibility - able to use data in enterprise database as well as legacy systems
December 2008 ©GKGupta 54
Codd’s OLAP Rules
• Consistent reporting performance - consistent reporting performance as the number of dimensions grows
• Client-server architecture - OLAP often uses mainframe data but users want access from desktop
• Generic dimensionality - different dimensions should not be treated differently
• Dynamic sparse matrix handling - always a lot of missing data, OLAP should be able to adjust to that
• Multi-user support - obviously required
December 2008 ©GKGupta 55
Codd’s OLAP Rules
• Unrestricted cross-dimensional operations - should be able to infer associated calculations
• Intuitive data manipulation - operations should not require the use of a menu or a number of iterations
• Flexible reporting - logical presentation of data
• Unlimited dimensions and aggregation levels - some applications need as many as 15-20 dimensions
December 2008 ©GKGupta 56
Summary
• Data warehouse
– A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process
• A multi-dimensional model of a data warehouse– Star schema, snowflake schema, fact
constellations
– A data cube consists of dimensions & measures
• OLAP operations: drilling, rolling, slicing, dicing and pivoting