Download pdf - DESIGN AND IMPLEMENTATION OF DATA ANALYSIS COMPONENTS A Thesis

DESIGN AND IMPLEMENTATION

OF

DATA ANALYSIS COMPONENTS

A Thesis

Presented to

The Graduate Faculty of The University of Akron

In Partial Fulfillment

of the Requirements for the Degree

Master of Science

Grace C. Shiao

May, 2006

ii

DESIGN AND IMPLEMENTATION

OF

DATA ANALYSIS COMPONENTS

Grace C. Shiao

Thesis

Approved: Accepted: _________________________________ ____________________________________ Advisor Dean of the College Dr. Chien-Chung Chan Dr. Ronald F. Levant

_________________________________ ____________________________________ Committee Member Dean of the Graduate School Dr. Xuan-Hien Dang Dr. George R. Newkome _________________________________ ____________________________________ Committee Member Date Dr. Zhong-Hui Duan _________________________________ Department Chair Dr. Wolfgang Pelz

iii

ABSTRACT

This thesis describes the design and implementation of the data analysis

components. Many features of modern database systems facilitate the decision-making

process. Recently, Online Analytical Processing (OLAP) and data mining are

increasingly being used in a wide range of applications. OLAP allows users to analyze

data from a wide variety of viewpoints. Data mining is the process of selecting,

exploring, and modeling large amounts of data to discover previously unknown patterns

for business advantage. Microsoft® SQL server™ 2000 Analysis Services provides a rich

set of tools to create and to maintain OLAP and data mining objects. In order to use

these tools, users need to fully understand the underlying architectures and the

specialized technological terms, which are not related to the data analysis. The

complexities in the development challenges prevent the data analysts to use these tools

effectively. In this work, we developed several components, which can be used as the

foundation in the analytical applications. Using these components in the software

applications can hide the technical complexities and can provide tools to build the OLAP

and mining model and to access data information from these model systems. Developers

can also reuse these components without coding from scratch. The reusability of these

components enhances the application’s reliability and reduces the development costs and

time.

iv

DEDICATION

Dedicated to my late parents

Mr. and Mrs. K. C. Chang

Who taught me the value of Education

And

Opened my eyes to the Power of Knowledge

v

ACKNOWLEDGEMENTS

First of all, I want to thank my adviser Dr. Chien-Chung Chan for his guidance and

support throughout my graduate research. His feedback helped to strengthen my research

skills and contributed greatly to this thesis. I want to thank my thesis committee

members, Dr. Xuan-Hien Dang and Dr Zhong-Hui Duan, for their guidance and

encouragement. In addition, I want to thank the faculty members of the Department of

Computer Science for building the foundation of my computer knowledge.

I also want to thank my late parents and wish they would have been able to see this

finished manuscript. I appreciate both of them for their love, support and encouragement

in my life. I thank my husband S. Y. for his love and support through these years, and to

my daughter Ming-Hao and my son Ming-Jay for their love, humor, and understanding.

Lastly, I thank the Mighty God for all His grace and blessing in my life.

1

CHAPTER I

INTRODUCTION

Data are not only valuable assets, but also the strategic resources in today’s

competitive environment. Organizations around the world are accumulating vast and

growing amounts of data in different database formats. Business companies need to

understand the effectiveness of their marketing efforts and quickly maintain the large

volumes of data created each day. These challenges require a well-defined database

system that can bring together disparate data with different dimensionality and

granularity. Making the data meaningful is no small task, especially given the different

aspects of data analysis. Companies need quality analysis of operational information to

understand their business strengths and weaknesses. Business analysis focuses on the

effective use of data and information to drive positive business actions. With good and

accurate data analysis, business decision makers can make well-informed decisions for

the future of their organizations. The Business Intelligence (BI) tools allow companies

to automate its functions of analysis, strategy, and forecasting to make better business

decisions. Online Analytical Processing (OLAP) and Data mining model are the key

features of the BI tools that help companies extract data from an operational system, to

summarize data into working totals, to find the hidden patterns from data for future

analysis and prediction, and to intuitively present these results to the end users [1, 2].

2

1.1 What is Online Analytical Processing (OLAP)?

The standard definition of OLAP provided by the OLAP Council [2] is:

“A category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user”.

The functionality of OLAP, according to the definition of the OLAP Council, lets

the users complete the following tasks [2]:

• Calculations and modeling applied across dimensions, through hierarchies and/or across members

• Trend analysis over sequential time periods • Slicing subsets for on-screen viewing • Drill-down to deeper levels of consolidation • Reach-through to underlying detail data • Rotation to new dimensional comparisons in the viewing area.

Therefore, OLAP performs multidimensional analysis of enterprise data and

provides the capabilities for complex calculations, trend analysis and very sophisticated

data modeling. In addition, OLAP enables end-users to perform ad hoc analysis of data

in multiple dimensions, thereby providing the insight and understanding they need for

better decision making.

An OLAP structure created from the operational data is called an OLAP cube [1, 2].

OLAP cubes are data processing units consisting of the fact and the dimensions from the

database. They provide multidimensional views and analytical querying capacities.

Therefore, OLAP technology can provide fast answers for complex querying on

operational data for decision-making management.

3

1.2 Data Mining

Data Mining is defined as the automated extraction of hidden predictive information

from database systems [3, 4]. Generally, it is the process of analyzing data from different

perspectives and discovering patterns and regularities in sets of data. Specifically, the

hidden patterns and the correlations discovered in the data can provide strategic business

advantages for decision-making in organizations.

1.3 Statement of the Problem

Microsoft® Analysis Services, shipped with SQL server™ 2000, is the OLAP

database engine and is able to build multidimensional cubes [1, 5]. It also provides the

application programs to browse the cube data and tools to support data mining algorithms

for discovering trends in data and predicting future results. The implementation of

Analysis Services is heavily wizard oriented in building and managing data cube and data

mining model. Although many features are also available through the predefined editors,

the wizard-intensive process still requires users to fully understand the cube structure and

associated objects in the definition process. The complexity of cube development makes

it difficult for end-users with little technical experience to gain access to these analysis

tools.

1.4 Motivations and Contributions

In reality, most decision-makers within an enterprise want to be able to use the

insights gained from their data for more tactical decision-making purposes. However,

they are not generally interested in spending time in building cube or mining model to

4

answer their business issues. Analysis Services provides intensive wizards and editors in

the development of OLAP cubes and the mining models. It has been designed to be

flexible for all levels of users, but users have difficulty learning to use these features

effectively and creating useful models for decision making. The best solution is to design

a specific front-end interface to meet the user’s requirements with the ability to cross-

analyze data even through a single click and to mask the underlying complexities of the

applications from the users.

Analysis applications contain sensitive and confidential information that should be

protected against unauthorized access and only are available to appropriate decision

makers. Analysis Services automatically creates an OLAP Administrators group in the

operating system. A member of the OLAP Administrators group has complete access to

the analysis objects. A user that is not a member of the OLAP Administrators group has

read- or write-access to the extent permitted based on dimension-level or cell-level

security but performs no administrative tasks. However, the active user must be a

member of the OLAP administrators group to use Analysis Manager. Therefore, the non-

Administrator user can not exploit the cube information through Analysis Manager. One

of the scope of this thesis is to construct a client-application interface by using the Multi-

dimensional Expressions (MDX) and ActiveX® Data Objects/Multi-dimensional

(ADO/MD) to query OLAP data to solve this conflict issue [1, 6].

The main contributions of this thesis are as follows:

• Development of a component, cubeBuilder, for software developers to design

application interface which can build the OLAP cube model to meet user’s

analytical requirements

5

• Development of a component, DMBuilder, for developers to design a specific

user-interface to create data mining model for users to uncover previously

unknown patterns

• Development of a component, cubeBrowser, for developers to design a client

interface to browse the cube data for non-Administrators group users.

In addition, these data analysis components not only help the software developers to

design the specific application without coding from scratch, but also hide the

complexities of development challenges from the less technically-oriented users.

1.5 Organization of the Thesis

This thesis covers the work on the development of the data analysis components,

cubeBuilder, cubeBrowser and DMBuilder for OLAP and mining model solutions. This

thesis is organized as follows:

Chapter II provides an overview of Microsoft SQL Server Analysis Services

including its fundamental operations and architectures in the functionality of OLAP and

Data Mining model. The step-by-step processes used to create an OLAP cube, to browse

the existing cube data and to create a data mining model with Analysis Manager are also

illustrated and described in Chapter II.

Chapter III focuses on the development of the design and the structures of the

analysis components for OLAP and mining model solutions.

Chapter IV describes the implementations of these analysis components in the

desktop and web-based applications interface for OLAP cube and mining model system.

6

It also describes a case study with the heart disease dataset to demonstrate the application

of the analysis components.

Chapter V presents a summary of the work that has been done in this thesis. It also

compares the functionalities between the analysis components and Analysis Manager in

the aspects of building of OLAP cube and mining model. The directions of future work

and the conclusion of this thesis are also presented in Chapter V.

7

CHAPTER II

MICROSOFT SQL SERVER 2000 ANALYSIS SERVICES

2.1. Overview

Microsoft® SQL server™ 2000 Analysis Services provides fully-functional OLAP

environment, which includes both OLAP and data-mining functionality [5]. It is a suite

of decision support engines and tools. It can also function as an intermediate layer that

converts relational warehouse data into a form, also called a cube, which makes it fast

and flexible for creating an analytical report.

2.2. Architecture

The architecture of Analysis Services can be divided into two portions: the server

and the client, as shown in Figure 2.1. The server portion, including the engines,

provides the functionality and power, while the client portion has interfaces for front-end

applications [5].

2.2.1. Server Architecture

The primary component of Analysis Services is the Analysis Server. The Analysis

Server operates as a Microsoft Window NT or Windows 2000 service and is

Analysis Manager

Decision Support Objects (DSO)

Data sources

Cubes

Analysis ServerMining models

Client ApplicationClient Application

ADO MD

PivotTable Service

Client

Server

Microsoft Management Console (MMC)

Figure 2.1 Analysis Services architecture

specifically designed to create and maintain multidimensional data structures [5, 6]. It

also provides multi-dimensional data values to client queries and manages connections to

the specified data sources and local access security. Figure 2.1 illustrates the Analysis

Manager, a snap-in console in Analysis Services, which communicates with the server

8

9

through the Decision Support Objects (DSO) component tool. The DSO is a set of

programming instructions for applications to work with the Analysis Services [7].

2.2.2. Client Architecture

The client side of the Analysis Services is primarily used to provide an accessing

interface, the PivotTable Service, between the server and the custom applications, as

shown in Figure 2.1 [6, 7]. PivotTable Service communicates with the Analysis server

and provides interfaces for client applications to access OLAP data and data mining data

on the server [6, 7]. It provides the OLE DB interface for users to access data managed

by Analysis Services, custom programs or client tools.

2.3 OLAP Cube

The primary form of data representation within the Analysis Services is the OLAP

cube [5-8]. A cube is a logical construct. It is a multidimensional representation of both

detailed and summary data. Cubes are designed according to the client’s analytical

requirements. Each cube represents data values of different business entities. Each side

of the cube presents a different aspect of the data.

Cubes in the Analysis Services are built using one of two types of database schemas:

the star schema and the snowflake schema [9]. Both schemas consist of a fact table and

dimension tables. The Analysis Services aggregates data from these tables to build

cubes. As shown in Figure 2.2, the star schema consists of a fact table and several

dimension tables. Each dimension table corresponds to a column in the fact table. The

data in the dimension tables are used to form the analytical queries in the fact table.

However, in the snowflake schema, several dimension tables are joined before being

linked to the fact table.

Star Schema

10

Snowflake Schema

Dimension table 1

Dimension table 2

Fact Table

Dimension Table

Fact Table

A layer of Dimension tables

Dimension Table

Dimension Table

Dimension table 3

Figure 2.2 The star and snowflake schemas

2.4 Analysis Manager

The Analysis Manager is a tool for the Analysis Server administration in Microsoft

SQL Server 2000 Analysis Services [5-9]. It is a snap-in application within the

Microsoft Management Console (MMC), which is the common framework for hosting

administrative tools. Figure 2.3 illustrates the screenshot of the hierarchical, tree-view

representation of the server and all its components in the left pane of the console.

Figure 2.3 Screenshot of the Analysis Manager

11

12

The major functional features for the Analysis Manager are summarized as follows:

• Administering Analysis server

• Creating database and specifying data sources

• Creating and processing cubes

• Creating dimensions for the specified database

• Specifying storage options and optimizing performance

• Authorizing and managing cube security

• Browsing cube data, shared dimensions and other objects

• Creating data mining model from relational and multidimensional data

• Viewing the Mining Model.

2.4.1 Creating the Basic Cube Model

Analysis Services provides wizards and editors within the Analysis Manager to let

the user create the cube easily [6, 8]. The step-by-step instructions for building a basic

cube model in the Analysis Manager using the Cube Wizard are summarized as follows:

1. Creating an Analysis Server’s database

A database acts like a folder that holds cubes, data sources, shared dimensions,

mining model and database roles as illustrated in Figure 2.3. To create a new database on

a server, after launching onto the Analysis Manager, right-click the server name and then

select new database from the pop-up menu [1, 2]. The Database dialog box appears for

user to enter a new database name for the new cube model, as shown in Figure 2.4.

Figure 2.4 Screenshot of the database dialog box of Cube Wizard

2. Specifying the data source

After creating a new database, a data source needs to be specified for the cube. The

data source contains the information of the data used in the cube [6, 7]. The purpose of

adding a data source is to let Analysis server establish connections to the source data.

The Data Link dialog box, as illustrated in Figure 2.5, can be opened by right-clicking the

Data Source folder and selecting New Data source from the pop-up menu.

Figure 2.5 Screenshot of the Provider for the Data Link dialog box

13

In the Data Link dialog box shown in Figure 2.6, the user can specify a provider, the

server name, login information and a database name to connect to the Analysis server.

Figure 2.6 Screenshot of the Connection tab of the Data Link dialog box

3. Selecting the fact table and the measures

The Cube Wizard and the Cube Editor are the tools to be used in the Analysis

Manager to create the OLAP cube [8]. A fact table contains the measure fields, which

consist of the numeric values for the analysis, and the key fields that are used to join to

dimension tables. The fact table should not contain any descriptive information or any

labels in addition to the measures and the index fields. Each cube must be based on only

one fact table. As shown in Figure 2.7, the panel displays all the tables in the specified

data source. After selecting the fact table, click the “Next” button, the Wizard displays

all of the available numeric data in the selected table, as shown in Figure 2.8

14

Figure 2.7 Screenshot of the “Select a fact table” dialog box with a selected fact table

After specifying the measures from the list, click the “Next” button, the Cube

Wizard asks the user to select dimensions or to create dimensions.

Figure 2.8 Screenshot of the “Defining measures” dialog box

4. Adding dimensions and levels to the cube

Dimensions are the categories for the user to analyze and summarize the data [6-8].

In other words, dimensions are the organized hierarchies that describe the data functions

in the fact table. There are two types of dimensions to be created for use in the cube. A

dimension created for use in an individual cube is called a private dimension. A shared

dimension is the one that multiple cubes can use [8]. A cube must contain at least one

dimension, and the dimension must exist in the database object where a cube will be

created.

15

In the Analysis Manager, a new dimension can be created either by the Cube Editor

or the Cube Wizard. If the editors are used to build the cube, then a dimension has to be

created before adding to a cube. However, if the Cube Wizard is used to create a cube,

then it will launch the Dimension Wizard to handle the task as part of the processing in

creating a cube [8]. The step-by-step processes of creating a new shared dimension with

the Dimension Wizard are summarized as follows:

a. Selecting the type of dimension schema in the screen of the “Choose how

you want to create the dimension”, as shown in Figure 2.9.

Figure 2.9 Screenshot of the Dimension Wizard

b. Specifying the dimension table from the available table list in the screen of

the “Select the dimension table”, as shown in Figure 2.10.

c. Selecting the level on the screen of the “Select the levels for your

dimension”, as shown in Figure 2.11.

16

Figure 2.10 Screenshot of the “Select Dimension table” dialog box

Figure 2.11 Screenshot of the “Select levels” dialog box

d. Specifying the new dimension name and previewing the dimension data in

the “Finish” dialog box of the Dimension Wizard, as illustrated in Figure

2.12.

17

Figure 2.12 Screenshot of the “Dimension Finish“ dialog box

5. Setting the storage options and setting up the cube aggregations

The storage mode determines how the data is organized in the server [8, 9]. It

affects the requirements of disk-storage space and the data-retrieval performance. There

are three types of storage options supported by Analysis Services: Multi-dimensional

OLAP (MOLAP), the relational OLAP, and the Hybrid OLAP (HOLAP). The

descriptions and storage locations of each mode are summarized in Table 2.1. The

Storage Design Wizard is used to select the option for the cube in the Analysis Manager,

as shown in Figure 2.13

18

Table 2.1 Storage options supported by Analysis Services

Storage Locations Storage Mode

Description Fact data Aggregated

Values ROLAP Relational OLAP

1. Slow processing, 2. Slow query response and 3. Huge storage requirements 4. Suitable for large databases or

legacy data.

Relational database Server

Relational Database Server

MOLAP Multidimensional OLAP 1. Require data duplication 2. Pre-summarizes the data to improve

performance in querying and displaying the data

3. High performances 4. Good for small to medium size data

sets.

Cube Cube

HOLAP Hybrid OLAP A combination of ROLAP and MOLAP 1. Does not create a copy of data 2. Provides connectivity to a large

number of relational databases. 3. Good for limited storage space but

faster query responses are needed.

Relational database Server

Cube

Figure 2.13 Screenshot of the “Storage Design Wizard” for selecting of storage options 19

After deciding the storage option, the next step is to specify the aggregation options

in the Set Aggregation Options dialog, as illustrated in Figure 2.14 [8, 9]. This option

allows the user to set the level of aggregation for the cube to boost the performance of

queries.

Aggregations are pre-calculated summaries of data that improve query response

time. The larger the level of cube’s aggregation, the faster the queries will be executed,

but a greater amount of disk space will be needed and more time will be required to

process the cube.

In the Analysis Services, there are three aggregation options for selection:

• Estimated storage reaches: specifying the maximum storage size in either megabytes (MB) or gigabytes (GB)

• Performance gain reaches: specifying the percentage amount of performance

gain for the queries • Until I click stop: selecting the manual control of the balance

.

Figure 2.14 Screenshot of the “Set aggregation options” dialog box

20

6. Processing the cube

Processing the cube is required before attempting to browse the cube data, especially

after designing its storage options and aggregations, because the aggregations are needed

to be calculated for the cube before the user to view the cube data [8, 9].

The major activities involved in the cube processing are described in a

“Process” window, as shown in Figure 2.15, and summarized as follows:

a. Reading the dimension tables to populate the levels from the actual data

b. Reading the fact table

c. Calculating specified aggregations

d. Storing the results in the cube.

Figure 2.15 Screenshot of the “Process” window

21

In the Analysis Manager, there are three options to be used to process a cube

depending on the different circumstances of the data structures. These options,

summarized in Table 2.2, can be selected in the “Process a Cube” dialog box, as shown in

Figure 2.16 [9].

Table 2.2 Summary of cube process options

Options of Process Circumstances

Full process Modifying the structure of the cube

Incremental update Adding new data to the cube

Refresh data Clear out and replacing a cube’s source data

Figure 2.16 Screenshot of the “Process a cube” dialog box

22

2.4.2 Browsing a Cube

In the Analysis Manager, using Cube Wizard to view the cube data is one of viewing

methods [5- 9]. There are two ways to open the Cube Browser to load cube data into it:

a. Right-click the cube name in the Analysis Manager Tree pane and selecting

“Browse Data” from the pop-up menu

b. Click the “Browse Sample Data” in the last step of the Cube Wizard

The cube Browser not only let users to view the multidimensional data in a flattened

two-dimensional grid format, as shown in Figure 2.17, but also makes it possible to drill

up or drill down different dimensions of data. However, the Cube Browser can not be

used to view unprocessed cube data [6].

Figure 2.17 Screenshot of the “Cube Browser” and sample results

23

24

2.4.3 Building the Data Mining Models

Data Mining is the process of extracting knowledge hidden from large volumes of

data [10, 11]. It involves uncovering patterns, trends, and relationships from historical

data and predicting outcomes of future situations. The primary mechanism for data

mining is the data mining model, an abstract object that stores data mining information in

a series of schema rowsets. The mining model serves as the blueprint for how data

should be analyzed or processed. Once the model is processed, information associated

with the mining model not only represents what was learned from the data, but also

allows users to discover the business trends for future decision making [11]. Two data

mining algorithms are built into Microsoft SQL server 2000 Analysis Services: Microsoft

Decision Trees and Microsoft Clustering [12, 13]

A. Decision Trees Algorithm:

Microsoft Decision Trees algorithm uses the recursive partitioning to divide the data

in a tree structure, and continually performs this search for predictive factors until there is

no more data to continue with [10-13]. A node in the tree structure represents each

predictive factor used to classify the data. This method focuses on providing information

paths for rules and patterns within data, and is useful in predicting the exact outcomes for

the future problems [12, 13].

B. Microsoft Clustering Algorithm:

Microsoft Clustering algorithm is based on the Expectation and Maximization (EM)

algorithm [11, 12]. It uses iterative refinement techniques to group records into

neighborhoods (clusters) that exhibit similar, predictable characteristics [13]. These are

useful for uncovering a relationship among data items in a large database with hundreds

of evaluated attributes.

The following steps describe the process of creating a mining model using the

mining model wizard in the Analysis Manager [13]:

1. Specifying the type of data:

In the window of “select data source type”, as shown in Figure 2.18, users

can select either relational data type or OLAP data to build the target mining

model.

Figure 2.18 Screenshot of the “Select source type” dialog box

2. Selecting the source cube:

In the “select source cube” window, as shown in Figure 2.19, users need to

highlight the target cube from the available cube lists [11, 13].

25

Figure 2.19 Screenshot of “Select source cube” window

3. Specifying the data mining method;

In the “Select data mining technique” window, as shown in Figure 2.2,

users can select one of two mining algorithms provided with the Analysis

Services: Microsoft Decision Trees and Microsoft Clustering [9, 10].

Figure 2.20 Screenshot of the selecting mining model technique

26

4. Identifying the case base or unit of analysis

In the “Select case” window, as shown in Figure 2.21, users need to

specify the case base of the analysis for the modeling task. A case is the basic

unit of analysis for mining task.

Figure 2.21 Screenshot of the “Select case” dialog box for specifying a case of analysis

5. Selecting the predicted entity:

In this step users must provide information for prediction used in the

mining model [12], as shown in Figure 2.22. The predicted entity can be

chosen as one of the following items:

ü A measure of the source table ü A member property of the case dimension and level ü Members of another dimension in the cube.

This feature provides flexibility in the process of predictive analysis using

OLAP data.

27

Figure 2.22 Screenshot of “Select predicted entity” window

6. Selecting a training data:

The training data is used to process OLAP data mining model and to

define the column structure of a data mining for the case set. As shown in

Figure 2.23, the users should select at least one additional data item from the

data training data [12, 13].

28

Figure 2.23 Screenshot of the “Select training data” window

7. Naming the model and process the model:

After user enters a model name and selects the “Save and process now”

check box, as shown in Figure 2.24, the wizard will process the model and

train the model with data based on the specified algorithm. Figure 2.25

displays the process of model execution [13]. When the process is complete, a

message of “Processing completed successfully” appears in the bottom of

dialog box.

29

Figure 2.24 Screenshot of the “Saving the data model” of the Mining Model Wizard

Figure 2.25 Screenshot of the “Model execution diagnostics” window

30

After clicking the “close” button, the OLAP Mining Model Editor will be launched

and system displays the content details of the proposed mining model, as shown in Figure

2.26.

Figure 2.26 Screenshot of the content details of a created mining model

31