92
1 Chapter 32 OLAP and Data Mining Transparencies

OLAP and Data Mining

  • Upload
    tommy96

  • View
    3.183

  • Download
    1

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: OLAP and Data Mining

1

Chapter 32

OLAP and Data MiningTransparencies

Page 2: OLAP and Data Mining

2

Chapter 32 - Objectives

Purpose of online analytical processing (OLAP) and how OLAP differs from data warehousing.

Key features of OLAP applications.

Potential benefits associated with successful OLAP applications.

Rules for OLAP tools and main types of tools including: multi-dimensional OLAP (MOLAP), relational OLAP (ROLAP), and managed query environment (MQE).

Page 3: OLAP and Data Mining

3

Chapter 32 - Objectives

OLAP extensions to SQL.

Concepts associated with data mining.

Main data mining operations including predictive modeling, database segmentation, link analysis, and deviation detection.

Relationship between data mining and data warehousing.

Page 4: OLAP and Data Mining

4

Data Warehousing and End-User Access Tools

Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities.

Key developments include:– Online analytical processing (OLAP).– SQL extensions for complex data analysis.– Data mining tools.

Page 5: OLAP and Data Mining

5

Introducing OLAP

The dynamic synthesis, analysis, and consolidation of large volumes of multi-dimensional data, Codd (1993).

Describes a technology that uses a multi-dimensional view of aggregate data to provide quick access to strategic information for purposes of advanced analysis.

Page 6: OLAP and Data Mining

6

Introducing OLAP

Enables users to gain a deeper understanding and knowledge about various aspects of their corporate data through fast, consistent, interactive access to a wide variety of possible views of the data.

Allows users to view corporate data in such a way that it is a better model of the true dimensionality of the enterprise.

Page 7: OLAP and Data Mining

7

Introducing OLAP

Can easily answer ‘who?’ and ‘what?’ questions, however, ability to answer ‘what if?’ and ‘why?’ type questions distinguishes OLAP from general-purpose query tools.

Types of analysis ranges from basic navigation and browsing (slicing and dicing) to calculations, to more complex analyses such as time series and complex modeling.

Page 8: OLAP and Data Mining

8

OLAP Benchmarks

OLAP Council published an analytical processing benchmark referred to as the APB-1 (OLAP Council, 1998).

Aim is to measure a server’s overall OLAP performance rather than the performance of individual tasks.

Page 9: OLAP and Data Mining

9

OLAP Benchmarks

APB-1 assesses most common business operations including:– bulk loading of data from internal/external data sources;– incremental loading of data from operational systems;– aggregation of input/level data along hierarchies;– calculation of new data based on business models;– time series analysis;– queries with a high degree of complexity;– drill-down through hierarchies;– ad hoc queries;– multiple online sessions.

Page 10: OLAP and Data Mining

10

OLAP Benchmarks

OLAP applications are judged on their ability to provide just-in-time (JIT) information, a core requirement of supporting effective decision-making.

Assessing a server’s ability to satisfy this requirement is more than measuring processing performance but includes its abilities to model complex business relationships and to respond to changing business requirements.

Page 11: OLAP and Data Mining

11

OLAP Benchmarks

APB-1 uses a standard benchmark metric called AQM (Analytical Queries per Minute).

AQM represents number of analytical queries processed per minute including data loading and computation time. Thus, AQM incorporates data loading performance, calculation performance, and query performance into a singe metric.

Page 12: OLAP and Data Mining

12

OLAP Benchmarks

Publication of APB-1 benchmark results must include both the database schema and all code required for executing the benchmark.

An essential requirement of all OLAP applications is ability to provide users with JIT information, to make effective decisions about an organization’s strategic directions.

Page 13: OLAP and Data Mining

13

OLAP Applications

JIT information is computed data that usually reflects complex relationships and is often calculated on the fly.

Also, as data relationships may not be known in advance, the data model must be flexible.

Page 14: OLAP and Data Mining

14

Examples of OLAP Applications in Various Functional Areas

Page 15: OLAP and Data Mining

15

OLAP Applications

Although OLAP applications are found in widely divergent functional areas, all have following key features:– multi-dimensional views of data;

– support for complex calculations;

– time intelligence.

Page 16: OLAP and Data Mining

16

OLAP Applications - Multi-Dimensional Views of Data

Core requirement of building a ‘realistic’ business model.

Provides basis for analytical processing through flexible access to corporate data.

The underlying database design that provides the multi-dimensional view of data should treat all dimensions equally.

Page 17: OLAP and Data Mining

17

OLAP Applications - Support for Complex Calculations

Must provide a range of powerful computational methods such as that required by sales forecasting, which uses trend algorithms such as moving averages and percentage growth.

Mechanisms for implementing computational methods should be clear and non-procedural.

Page 18: OLAP and Data Mining

18

OLAP Applications – Time Intelligence

Key feature of almost any analytical application as performance is almost always judged over time.

Time hierarchy is not always used in same manner as other hierarchies.

Concepts such as year-to-date and period-over-period comparisons should be easily defined.

Page 19: OLAP and Data Mining

19

OLAP Benefits

Increased productivity of end-users. Reduced backlog of applications development

for IT staff. Retention of organizational control over the

integrity of corporate data. Reduced query drag and network traffic on

OLTP systems or on the data warehouse. Improved potential revenue and profitability.

Page 20: OLAP and Data Mining

20

Representing Multi-Dimensional Data

Example of two-dimensional query.– What is the total revenue generated by property

sales in each city, in each quarter of 1997?’

Choice of representation is based on types of queries end-user may ask.

Compare representation - three-field relational table versus two-dimensional matrix.

Page 21: OLAP and Data Mining

21

Multi-Dimensional Data as Three-Field Table versus Two-Dimensional Matrix

Page 22: OLAP and Data Mining

22

Representing Multi-Dimensional Data

Example of three-dimensional query.– ‘What is the total revenue generated by property

sales for each type of property (Flat or House) in each city, in each quarter of 1997?’

Compare representation - four-field relational table versus three-dimensional cube.

Page 23: OLAP and Data Mining

23

Multi-Dimensional Data as Four-Field Table versus Three-Dimensional Cube

Page 24: OLAP and Data Mining

24

Representing Multi-Dimensional Data

Cube represents data as cells in an array.

Relational table only represents multi-dimensional data in two dimensions.

Page 25: OLAP and Data Mining

25

Multi-Dimensional OLAP Servers

Use multi-dimensional structures to store data and relationships between data.

Multi-dimensional structures are best visualized as cubes of data, and cubes within cubes of data. Each side of cube is a dimension.

A cube can be expanded to include other dimensions.

Page 26: OLAP and Data Mining

26

Multi-Dimensional OLAP Servers

A cube supports matrix arithmetic.

Multi-dimensional query response time depends on how many cells have to be added ‘on the fly’.

As number of dimensions increases, number of the cube’s cells increases exponentially.

Page 27: OLAP and Data Mining

27

Multi-Dimensional OLAP Servers

However, majority of multi-dimensional queries use summarized, high-level data.

Solution is to pre-aggregate (consolidate) all logical subtotals and totals along all dimensions.

Pre-aggregation is valuable, as typical dimensions are hierarchical in nature.– (e.g. Time dimension hierarchy - years, quarters,

months, weeks, and days)

Page 28: OLAP and Data Mining

28

Multi-Dimensional OLAP Servers

Predefined hierarchy allows logical pre-aggregation and, conversely, allows for a logical ‘drill-down’.

Supports common analytical operations– Consolidation.

– Drill-down.

– Slicing and dicing.

Page 29: OLAP and Data Mining

29

Multi-Dimensional OLAP Servers

Consolidation - aggregation of data such as simple ‘roll-ups’ or complex expressions involving inter-related data.

Drill-Down - is reverse of consolidation and involves displaying the detailed data that comprises the consolidated data.

Slicing and Dicing - (also called pivoting) refers to the ability to look at the data from different viewpoints.

Page 30: OLAP and Data Mining

30

Multi-Dimensional OLAP servers

Can store data in a compressed form by dynamically selecting physical storage organizations and compression techniques that maximize space utilization.

Dense data (i.e., data that exists for high percentage of cells) can be stored separately from sparse data (i.e., significant percentage of cells are empty).

Page 31: OLAP and Data Mining

31

Multi-Dimensional OLAP Servers

Ability to omit empty or repetitive cells can greatly reduce the size of the cube and the amount of processing.

Allows analysis of exceptionally large amounts of data.

Page 32: OLAP and Data Mining

32

Multi-Dimensional OLAP Servers

In summary, pre-aggregation, dimensional hierarchy, and sparse data management can significantly reduce the size of the cube and the need to calculate values ‘on-the-fly’.

Removes need for multi-table joins and provides quick and direct access to arrays of data, thus significantly speeding up execution of multi-dimensional queries.

Page 33: OLAP and Data Mining

33

Codd’s Rules for OLAP Systems

In 1993, E.F. Codd formulated twelve rules as the basis for selecting OLAP tools.– Multi-dimensional conceptual view– Transparency– Accessibility– Consistent reporting performance– Client-server architecture– Generic dimensionality

Page 34: OLAP and Data Mining

34

Codd’s Rules for OLAP

– Dynamic sparse matrix handling– Multi-user support– Unrestricted cross-dimensional operations– Intuitive data manipulation– Flexible reporting– Unlimited dimensions and aggregation levels.

Page 35: OLAP and Data Mining

35

Codd’s Rules for OLAP Systems

There are proposals to re-define or extend the rules. For example, to also include:– Comprehensive database management tools.

– Ability to drill down to detail (source record) level.

– Incremental database refresh.

– SQL interface to the existing enterprise environment.

Page 36: OLAP and Data Mining

36

Categories of OLAP Tools

OLAP tools are categorized according to the architecture of the underlying database.

Three main categories of OLAP tools include – Multi-dimensional OLAP (MOLAP or MD-OLAP)

– Relational OLAP (ROLAP), also called multi-relational OLAP

– Managed query environment (MQE)

Page 37: OLAP and Data Mining

37

Multi-Dimensional OLAP (MOLAP)

Uses specialized data structures and multi-dimensional Database Management Systems (MDDBMSs) to organize, navigate, and analyze data.

Data is typically aggregated and stored according to predicted usage to enhance query performance.

Page 38: OLAP and Data Mining

38

Multi-Dimensional OLAP (MOLAP)

Use array technology and efficient storage techniques that minimize the disk space requirements through sparse data management.

Provides excellent performance when data is used as designed, and the focus is on data for a specific decision-support application.

Page 39: OLAP and Data Mining

39

Multi-Dimensional OLAP (MOLAP)

Traditionally, require a tight coupling with the application layer and presentation layer.

Recent trends segregate the OLAP from the data structures through the use of published application programming interfaces (APIs).

Page 40: OLAP and Data Mining

40

Typical Architecture for MOLAP Tools

Page 41: OLAP and Data Mining

41

MOLAP Tools - Development Issues

Underlying data structures are limited in their ability to support multiple subject areas and to provide access to detailed data.

Navigation and analysis of data is limited because the data is designed according to previously determined requirements.

Page 42: OLAP and Data Mining

42

MOLAP Tools - Development Issues

MOLAP products require a different set of skills and tools to build and maintain the database, thus increasing the cost and complexity of support.

Page 43: OLAP and Data Mining

43

Relational OLAP (ROLAP)

Fastest growing style of OLAP technology.

Supports RDBMS products using a metadata layer - avoids need to create a static multi-dimensional data structure - facilitates the creation of multiple multi-dimensional views of the two-dimensional relation.

Page 44: OLAP and Data Mining

44

Relational OLAP (ROLAP)

To improve performance, some products use SQL engines to support complexity of multi-dimensional analysis, while others recommend, or require, the use of highly denormalized database designs such as the star schema.

Page 45: OLAP and Data Mining

45

Typical Architecture for ROLAP Tools

Page 46: OLAP and Data Mining

46

ROLAP Tools - Development Issues

Middleware to facilitate the development of multi-dimensional applications. (Software that converts the two-dimensional relation into a multi-dimensional structure).

Development of an option to create persistent, multi-dimensional structures with facilities to assist in the administration of these structures.

Page 47: OLAP and Data Mining

47

Hybrid OLAP (HOLAP)

Can use data from either a RDBMS directly or a multi-dimension server.

Page 48: OLAP and Data Mining

48

Managed Query Environment (MQE)

Relatively new development.

Provide limited analysis capability, either directly against RDBMS products, or by using an intermediate MOLAP server.

Page 49: OLAP and Data Mining

49

Managed Query Environment (MQE)

Deliver selected data directly from DBMS or via a MOLAP server to desktop (or local server) in form of a datacube, where it is stored, analyzed, and maintained locally.

Promoted as being relatively simple to install and administer with reduced cost and maintenance.

Page 50: OLAP and Data Mining

50

Typical Architecture for MQE Tools

Page 51: OLAP and Data Mining

51

MQE Tools - Development Issues

Architecture results in significant data redundancy and may cause problems for networks that support many users.

Ability of each user to build a custom datacube

may cause a lack of data consistency among users.

Only a limited amount of data can be efficiently maintained.

Page 52: OLAP and Data Mining

52

OLAP Extensions to SQL

SQL promoted as easy to learn, non-procedural, free-format, DBMS-independent, and international standard.

However, major disadvantage has been inability to represent many of the questions most commonly asked by business analysts.

IBM and Oracle jointly proposed OLAP extensions to SQL early in 1999, adopted as an amendment to SQL.

Page 53: OLAP and Data Mining

53

OLAP Extensions to SQL

Many database vendors including IBM, Oracle, Informix, and Red Brick Systems have already implemented portions of specifications in their DBMSs.

Red Brick Systems was first to implement many essential OLAP functions (as Red Brick Intelligent SQL (RISQL)), albeit in advance of the standard.

Page 54: OLAP and Data Mining

54

OLAP Extensions to SQL - RISQL

Designed for business analysts.

Set of extensions that augments SQL with a variety of powerful operations appropriate to data analysis and decision-support applications such as ranking, moving averages, comparisons, market share, this year versus last year.

Page 55: OLAP and Data Mining

55

Use of the RISQL CUME Function

Show the quarterly sales for branch office B003, along with the monthly year-to-date figures.

SELECT quarter, quarterlySales, CUME(quarterlySales) AS Year-to-Date

FROM BranchSales

WHERE branchNo = ‘B003’;

Page 56: OLAP and Data Mining

56

Use of the RISQL MOVINGAVG / MOVINGSUM Function

Show the first six monthly sales for branch office B003 without the effect of seasonality.

SELECT month, monthlySales,

MOVINGAVG(monthlySales) AS 3-MonthMovingAvg,

MOVINGSUM(monthlySales) AS 3-MonthMovingSum

FROM BranchSales

WHERE branchNo = ‘B003’;

Page 57: OLAP and Data Mining

57

Data Mining

The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions (Simoudis, 1996).

Involves analysis of data and use of software techniques for finding hidden and unexpected patterns and relationships in sets of data.

Page 58: OLAP and Data Mining

58

Data Mining

Reveals information that is hidden and unexpected, as little value in finding patterns and relationships that are already intuitive.

Patterns and relationships are identified by examining the underlying rules and features in the data.

Tends to work from the data up and most accurate results normally require large volumes of data to deliver reliable conclusions.

Page 59: OLAP and Data Mining

59

Data Mining

Starts by developing an optimal representation of structure of sample data, during which time knowledge is acquired and extended to larger sets of data.

Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing.

Relatively new technology, however already used in a number of industries.

Page 60: OLAP and Data Mining

60

Examples of Applications of Data Mining

Retail / Marketing– Identifying buying patterns of customers.– Finding associations among customer

demographic characteristics.– Predicting response to mailing campaigns.– Market basket analysis.

Page 61: OLAP and Data Mining

61

Examples of Applications of Data Mining

Banking – Detecting patterns of fraudulent credit card

use.– Identifying loyal customers.– Predicting customers likely to change their

credit card affiliation.– Determining credit card spending by

customer groups.

Page 62: OLAP and Data Mining

62

Examples of Applications of Data Mining

Insurance– Claims analysis.

– Predicting which customers will buy new policies.

Medicine– Characterizing patient behavior to predict surgery

visits.

– Identifying successful medical therapies for different illnesses.

Page 63: OLAP and Data Mining

63

Data Mining Operations

Four main operations include:– Predictive modeling.– Database segmentation.– Link analysis.– Deviation detection.

There are recognized associations between the applications and the corresponding operations. – e.g. Direct marketing strategies use database

segmentation.

Page 64: OLAP and Data Mining

64

Data Mining Techniques

Techniques are specific implementations of the data mining operations.

Each operation has its own strengths and weaknesses.

Data mining tools sometimes offer a choice of operations to implement a technique.

Page 65: OLAP and Data Mining

65

Data Mining Techniques

Criteria for selection of tool includes– Suitability for certain input data types.

– Transparency of the mining output.

– Tolerance of missing variable values.

– Level of accuracy possible.

– Ability to handle large volumes of data.

Page 66: OLAP and Data Mining

66

Data Mining Operations and Associated Techniques

Page 67: OLAP and Data Mining

67

Predictive Modeling

Similar to the human learning experience– uses observations to form a model of the important

characteristics of some phenomenon.

Uses generalizations of ‘real world’ and ability to fit new data into a general framework.

Can analyze a database to determine essential characteristics (model) about the data set.

Page 68: OLAP and Data Mining

68

Predictive Modeling

Model is developed using a supervised learning approach, which has two phases: training and testing. – Training builds a model using a large sample of

historical data called a training set.

– Testing involves trying out the model on new, previously unseen data to determine its accuracy and physical performance characteristics.

Page 69: OLAP and Data Mining

69

Predictive Modeling

Applications of predictive modeling include customer retention management, credit approval, cross selling, and direct marketing.

Two techniques associated with predictive modeling: classification and value prediction, distinguished by nature of the variable being predicted.

Page 70: OLAP and Data Mining

70

Predictive Modeling - Classification

Used to establish a specific predetermined class for each record in a database from a finite set of possible class values.

Two specializations of classification: tree induction and neural induction.

Page 71: OLAP and Data Mining

71

Example of Classification using Tree Induction

Page 72: OLAP and Data Mining

72

Example of Classification using Neural Induction

Page 73: OLAP and Data Mining

73

Predictive Modeling - Value Prediction

Used to estimate a continuous numeric value that is associated with a database record.

Uses the traditional statistical techniques of linear regression and nonlinear regression.

Relatively easy to use and understand.

Page 74: OLAP and Data Mining

74

Predictive Modeling - Value Prediction

Linear regression attempts to fit a straight line through a plot of the data, such that the line is the best representation of the average of all observations at that point in the plot.

Problem is that the technique only works well with linear data and is sensitive to the presence of outliers (i.e., data values, which do not conform to the expected norm).

Page 75: OLAP and Data Mining

75

Predictive Modeling - Value Prediction

Although nonlinear regression avoids the main problems of linear regression, still not flexible enough to handle all possible shapes of the data plot.

Statistical measurements are fine for building linear models that describe predictable data points, however, most data is not linear in nature.

Page 76: OLAP and Data Mining

76

Predictive Modeling - Value Prediction

Data mining requires statistical methods that can accommodate non-linearity, outliers, and non-numeric data.

Applications of value prediction include credit card fraud detection or target mailing list identification.

Page 77: OLAP and Data Mining

77

Database Segmentation

Aim is to partition a database into an unknown number of segments, or clusters, of similar records.

Uses unsupervised learning to discover homogeneous sub-populations in a database to improve the accuracy of the profiles.

Page 78: OLAP and Data Mining

78

Database Segmentation

Less precise than other operations thus less sensitive to redundant and irrelevant features.

Sensitivity can be reduced by ignoring a subset of the attributes that describe each instance or by assigning a weighting factor to each variable.

Applications of database segmentation include customer profiling, direct marketing, and cross selling.

Page 79: OLAP and Data Mining

79

Example of Database Segmentation using a Scatterplot

Page 80: OLAP and Data Mining

80

Database Segmentation

Associated with demographic or neural clustering techniques, distinguished by:– Allowable data inputs.– Methods used to calculate the distance

between records.– Presentation of the resulting segments for

analysis.

Page 81: OLAP and Data Mining

81

Link Analysis

Aims to establish links (associations) between records, or sets of records, in a database.

There are three specializations– Associations discovery.– Sequential pattern discovery.– Similar time sequence discovery.

Applications include product affinity analysis, direct marketing, and stock price movement.

Page 82: OLAP and Data Mining

82

Link Analysis - Associations Discovery

Finds items that imply the presence of other items in the same event.

Affinities between items are represented by association rules. – e.g. ‘When customer rents property for more than 2

years and is more than 25 years old, in 40% of cases, customer will buy a property. Association happens in 35% of all customers who rent properties’.

Page 83: OLAP and Data Mining

83

Link Analysis - Sequential Pattern Discovery

Finds patterns between events such that the presence of one set of items is followed by another set of items in a database of events over a period of time. – e.g. Used to understand long-term customer buying

behavior.

Page 84: OLAP and Data Mining

84

Link Analysis - Similar Time Sequence Discovery

Finds links between two sets of data that are time-dependent, and is based on the degree of similarity between the patterns that both time series demonstrate. – e.g. Within three months of buying property, new

home owners will purchase goods such as cookers, freezers, and washing machines.

Page 85: OLAP and Data Mining

85

Deviation Detection

Relatively new operation in terms of commercially available data mining tools.

Often a source of true discovery because it identifies outliers, which express deviation from some previously known expectation and norm.

Page 86: OLAP and Data Mining

86

Deviation Detection

Can be performed using statistics and visualization techniques or as a by-product of data mining.

Applications include fraud detection in the use of credit cards and insurance claims, quality control, and defects tracing.

Page 87: OLAP and Data Mining

87

Example of Database Segmentation using a Visualization

Page 88: OLAP and Data Mining

88

Data Mining Tools

There are a growing number of commercial data mining tools on the marketplace.

Important characteristics of data mining tools include:– Data preparation facilities.– Selection of data mining operations.– Product scalability and performance.– Facilities for visualization of results.

Page 89: OLAP and Data Mining

89

Data Mining and Data Warehousing

Major challenge to exploit data mining is identifying suitable data to mine.

Data mining requires single, separate, clean, integrated, and self-consistent source of data.

Page 90: OLAP and Data Mining

90

Data Mining and Data Warehousing

A data warehouse is well equipped for providing data for mining.

Data quality and consistency is a prerequisite for mining to ensure the accuracy of the predictive models. Data warehouses are populated with clean, consistent data.

Page 91: OLAP and Data Mining

91

Data Mining and Data Warehousing

Advantageous to mine data from multiple sources to discover as many interrelationships as possible. Data warehouses contain data from a number of sources.

Selecting relevant subsets of records and fields for data mining requires query capabilities of the data warehouse.

Page 92: OLAP and Data Mining

92

Data Mining and Data Warehousing

Results of a data mining study are useful if there is some way to further investigate the uncovered patterns. Data warehouses provide capability to go back to the data source.