DESIGN AND IMPLEMENTATION
OF
DATA ANALYSIS COMPONENTS
A Thesis
Presented to
The Graduate Faculty of The University of Akron
In Partial Fulfillment
of the Requirements for the Degree
Master of Science
Grace C. Shiao
May, 2006
ii
DESIGN AND IMPLEMENTATION
OF
DATA ANALYSIS COMPONENTS
Grace C. Shiao
Thesis
Approved: Accepted: _________________________________ ____________________________________ Advisor Dean of the College Dr. Chien-Chung Chan Dr. Ronald F. Levant
_________________________________ ____________________________________ Committee Member Dean of the Graduate School Dr. Xuan-Hien Dang Dr. George R. Newkome _________________________________ ____________________________________ Committee Member Date Dr. Zhong-Hui Duan _________________________________ Department Chair Dr. Wolfgang Pelz
iii
ABSTRACT
This thesis describes the design and implementation of the data analysis
components. Many features of modern database systems facilitate the decision-making
process. Recently, Online Analytical Processing (OLAP) and data mining are
increasingly being used in a wide range of applications. OLAP allows users to analyze
data from a wide variety of viewpoints. Data mining is the process of selecting,
exploring, and modeling large amounts of data to discover previously unknown patterns
for business advantage. Microsoft® SQL server™ 2000 Analysis Services provides a rich
set of tools to create and to maintain OLAP and data mining objects. In order to use
these tools, users need to fully understand the underlying architectures and the
specialized technological terms, which are not related to the data analysis. The
complexities in the development challenges prevent the data analysts to use these tools
effectively. In this work, we developed several components, which can be used as the
foundation in the analytical applications. Using these components in the software
applications can hide the technical complexities and can provide tools to build the OLAP
and mining model and to access data information from these model systems. Developers
can also reuse these components without coding from scratch. The reusability of these
components enhances the application’s reliability and reduces the development costs and
time.
iv
DEDICATION
Dedicated to my late parents
Mr. and Mrs. K. C. Chang
Who taught me the value of Education
And
Opened my eyes to the Power of Knowledge
v
ACKNOWLEDGEMENTS
First of all, I want to thank my adviser Dr. Chien-Chung Chan for his guidance and
support throughout my graduate research. His feedback helped to strengthen my research
skills and contributed greatly to this thesis. I want to thank my thesis committee
members, Dr. Xuan-Hien Dang and Dr Zhong-Hui Duan, for their guidance and
encouragement. In addition, I want to thank the faculty members of the Department of
Computer Science for building the foundation of my computer knowledge.
I also want to thank my late parents and wish they would have been able to see this
finished manuscript. I appreciate both of them for their love, support and encouragement
in my life. I thank my husband S. Y. for his love and support through these years, and to
my daughter Ming-Hao and my son Ming-Jay for their love, humor, and understanding.
Lastly, I thank the Mighty God for all His grace and blessing in my life.
1
CHAPTER I
INTRODUCTION
Data are not only valuable assets, but also the strategic resources in today’s
competitive environment. Organizations around the world are accumulating vast and
growing amounts of data in different database formats. Business companies need to
understand the effectiveness of their marketing efforts and quickly maintain the large
volumes of data created each day. These challenges require a well-defined database
system that can bring together disparate data with different dimensionality and
granularity. Making the data meaningful is no small task, especially given the different
aspects of data analysis. Companies need quality analysis of operational information to
understand their business strengths and weaknesses. Business analysis focuses on the
effective use of data and information to drive positive business actions. With good and
accurate data analysis, business decision makers can make well-informed decisions for
the future of their organizations. The Business Intelligence (BI) tools allow companies
to automate its functions of analysis, strategy, and forecasting to make better business
decisions. Online Analytical Processing (OLAP) and Data mining model are the key
features of the BI tools that help companies extract data from an operational system, to
summarize data into working totals, to find the hidden patterns from data for future
analysis and prediction, and to intuitively present these results to the end users [1, 2].
2
1.1 What is Online Analytical Processing (OLAP)?
The standard definition of OLAP provided by the OLAP Council [2] is:
“A category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user”.
The functionality of OLAP, according to the definition of the OLAP Council, lets
the users complete the following tasks [2]:
• Calculations and modeling applied across dimensions, through hierarchies and/or across members
• Trend analysis over sequential time periods • Slicing subsets for on-screen viewing • Drill-down to deeper levels of consolidation • Reach-through to underlying detail data • Rotation to new dimensional comparisons in the viewing area.
Therefore, OLAP performs multidimensional analysis of enterprise data and
provides the capabilities for complex calculations, trend analysis and very sophisticated
data modeling. In addition, OLAP enables end-users to perform ad hoc analysis of data
in multiple dimensions, thereby providing the insight and understanding they need for
better decision making.
An OLAP structure created from the operational data is called an OLAP cube [1, 2].
OLAP cubes are data processing units consisting of the fact and the dimensions from the
database. They provide multidimensional views and analytical querying capacities.
Therefore, OLAP technology can provide fast answers for complex querying on
operational data for decision-making management.
3
1.2 Data Mining
Data Mining is defined as the automated extraction of hidden predictive information
from database systems [3, 4]. Generally, it is the process of analyzing data from different
perspectives and discovering patterns and regularities in sets of data. Specifically, the
hidden patterns and the correlations discovered in the data can provide strategic business
advantages for decision-making in organizations.
1.3 Statement of the Problem
Microsoft® Analysis Services, shipped with SQL server™ 2000, is the OLAP
database engine and is able to build multidimensional cubes [1, 5]. It also provides the
application programs to browse the cube data and tools to support data mining algorithms
for discovering trends in data and predicting future results. The implementation of
Analysis Services is heavily wizard oriented in building and managing data cube and data
mining model. Although many features are also available through the predefined editors,
the wizard-intensive process still requires users to fully understand the cube structure and
associated objects in the definition process. The complexity of cube development makes
it difficult for end-users with little technical experience to gain access to these analysis
tools.
1.4 Motivations and Contributions
In reality, most decision-makers within an enterprise want to be able to use the
insights gained from their data for more tactical decision-making purposes. However,
they are not generally interested in spending time in building cube or mining model to
4
answer their business issues. Analysis Services provides intensive wizards and editors in
the development of OLAP cubes and the mining models. It has been designed to be
flexible for all levels of users, but users have difficulty learning to use these features
effectively and creating useful models for decision making. The best solution is to design
a specific front-end interface to meet the user’s requirements with the ability to cross-
analyze data even through a single click and to mask the underlying complexities of the
applications from the users.
Analysis applications contain sensitive and confidential information that should be
protected against unauthorized access and only are available to appropriate decision
makers. Analysis Services automatically creates an OLAP Administrators group in the
operating system. A member of the OLAP Administrators group has complete access to
the analysis objects. A user that is not a member of the OLAP Administrators group has
read- or write-access to the extent permitted based on dimension-level or cell-level
security but performs no administrative tasks. However, the active user must be a
member of the OLAP administrators group to use Analysis Manager. Therefore, the non-
Administrator user can not exploit the cube information through Analysis Manager. One
of the scope of this thesis is to construct a client-application interface by using the Multi-
dimensional Expressions (MDX) and ActiveX® Data Objects/Multi-dimensional
(ADO/MD) to query OLAP data to solve this conflict issue [1, 6].
The main contributions of this thesis are as follows:
• Development of a component, cubeBuilder, for software developers to design
application interface which can build the OLAP cube model to meet user’s
analytical requirements
5
• Development of a component, DMBuilder, for developers to design a specific
user-interface to create data mining model for users to uncover previously
unknown patterns
• Development of a component, cubeBrowser, for developers to design a client
interface to browse the cube data for non-Administrators group users.
In addition, these data analysis components not only help the software developers to
design the specific application without coding from scratch, but also hide the
complexities of development challenges from the less technically-oriented users.
1.5 Organization of the Thesis
This thesis covers the work on the development of the data analysis components,
cubeBuilder, cubeBrowser and DMBuilder for OLAP and mining model solutions. This
thesis is organized as follows:
Chapter II provides an overview of Microsoft SQL Server Analysis Services
including its fundamental operations and architectures in the functionality of OLAP and
Data Mining model. The step-by-step processes used to create an OLAP cube, to browse
the existing cube data and to create a data mining model with Analysis Manager are also
illustrated and described in Chapter II.
Chapter III focuses on the development of the design and the structures of the
analysis components for OLAP and mining model solutions.
Chapter IV describes the implementations of these analysis components in the
desktop and web-based applications interface for OLAP cube and mining model system.
6
It also describes a case study with the heart disease dataset to demonstrate the application
of the analysis components.
Chapter V presents a summary of the work that has been done in this thesis. It also
compares the functionalities between the analysis components and Analysis Manager in
the aspects of building of OLAP cube and mining model. The directions of future work
and the conclusion of this thesis are also presented in Chapter V.
7
CHAPTER II
MICROSOFT SQL SERVER 2000 ANALYSIS SERVICES
2.1. Overview
Microsoft® SQL server™ 2000 Analysis Services provides fully-functional OLAP
environment, which includes both OLAP and data-mining functionality [5]. It is a suite
of decision support engines and tools. It can also function as an intermediate layer that
converts relational warehouse data into a form, also called a cube, which makes it fast
and flexible for creating an analytical report.
2.2. Architecture
The architecture of Analysis Services can be divided into two portions: the server
and the client, as shown in Figure 2.1. The server portion, including the engines,
provides the functionality and power, while the client portion has interfaces for front-end
applications [5].
2.2.1. Server Architecture
The primary component of Analysis Services is the Analysis Server. The Analysis
Server operates as a Microsoft Window NT or Windows 2000 service and is
Analysis Manager
Decision Support Objects (DSO)
Data sources
Cubes
Analysis ServerMining models
Client ApplicationClient Application
ADO MD
PivotTable Service
Client
Server
Microsoft Management Console (MMC)
Figure 2.1 Analysis Services architecture
specifically designed to create and maintain multidimensional data structures [5, 6]. It
also provides multi-dimensional data values to client queries and manages connections to
the specified data sources and local access security. Figure 2.1 illustrates the Analysis
Manager, a snap-in console in Analysis Services, which communicates with the server
8
9
through the Decision Support Objects (DSO) component tool. The DSO is a set of
programming instructions for applications to work with the Analysis Services [7].
2.2.2. Client Architecture
The client side of the Analysis Services is primarily used to provide an accessing
interface, the PivotTable Service, between the server and the custom applications, as
shown in Figure 2.1 [6, 7]. PivotTable Service communicates with the Analysis server
and provides interfaces for client applications to access OLAP data and data mining data
on the server [6, 7]. It provides the OLE DB interface for users to access data managed
by Analysis Services, custom programs or client tools.
2.3 OLAP Cube
The primary form of data representation within the Analysis Services is the OLAP
cube [5-8]. A cube is a logical construct. It is a multidimensional representation of both
detailed and summary data. Cubes are designed according to the client’s analytical
requirements. Each cube represents data values of different business entities. Each side
of the cube presents a different aspect of the data.
Cubes in the Analysis Services are built using one of two types of database schemas:
the star schema and the snowflake schema [9]. Both schemas consist of a fact table and
dimension tables. The Analysis Services aggregates data from these tables to build
cubes. As shown in Figure 2.2, the star schema consists of a fact table and several
dimension tables. Each dimension table corresponds to a column in the fact table. The
data in the dimension tables are used to form the analytical queries in the fact table.
However, in the snowflake schema, several dimension tables are joined before being
linked to the fact table.
Star Schema
10
Snowflake Schema
Dimension table 1
Dimension table 2
Fact Table
Dimension Table
Fact Table
A layer of Dimension tables
Dimension Table
Dimension Table
Dimension table 3
Figure 2.2 The star and snowflake schemas
2.4 Analysis Manager
The Analysis Manager is a tool for the Analysis Server administration in Microsoft
SQL Server 2000 Analysis Services [5-9]. It is a snap-in application within the
Microsoft Management Console (MMC), which is the common framework for hosting
administrative tools. Figure 2.3 illustrates the screenshot of the hierarchical, tree-view
representation of the server and all its components in the left pane of the console.
Figure 2.3 Screenshot of the Analysis Manager
11
12
The major functional features for the Analysis Manager are summarized as follows:
• Administering Analysis server
• Creating database and specifying data sources
• Creating and processing cubes
• Creating dimensions for the specified database
• Specifying storage options and optimizing performance
• Authorizing and managing cube security
• Browsing cube data, shared dimensions and other objects
• Creating data mining model from relational and multidimensional data
• Viewing the Mining Model.
2.4.1 Creating the Basic Cube Model
Analysis Services provides wizards and editors within the Analysis Manager to let
the user create the cube easily [6, 8]. The step-by-step instructions for building a basic
cube model in the Analysis Manager using the Cube Wizard are summarized as follows:
1. Creating an Analysis Server’s database
A database acts like a folder that holds cubes, data sources, shared dimensions,
mining model and database roles as illustrated in Figure 2.3. To create a new database on
a server, after launching onto the Analysis Manager, right-click the server name and then
select new database from the pop-up menu [1, 2]. The Database dialog box appears for
user to enter a new database name for the new cube model, as shown in Figure 2.4.
Figure 2.4 Screenshot of the database dialog box of Cube Wizard
2. Specifying the data source
After creating a new database, a data source needs to be specified for the cube. The
data source contains the information of the data used in the cube [6, 7]. The purpose of
adding a data source is to let Analysis server establish connections to the source data.
The Data Link dialog box, as illustrated in Figure 2.5, can be opened by right-clicking the
Data Source folder and selecting New Data source from the pop-up menu.
Figure 2.5 Screenshot of the Provider for the Data Link dialog box
13
In the Data Link dialog box shown in Figure 2.6, the user can specify a provider, the
server name, login information and a database name to connect to the Analysis server.
Figure 2.6 Screenshot of the Connection tab of the Data Link dialog box
3. Selecting the fact table and the measures
The Cube Wizard and the Cube Editor are the tools to be used in the Analysis
Manager to create the OLAP cube [8]. A fact table contains the measure fields, which
consist of the numeric values for the analysis, and the key fields that are used to join to
dimension tables. The fact table should not contain any descriptive information or any
labels in addition to the measures and the index fields. Each cube must be based on only
one fact table. As shown in Figure 2.7, the panel displays all the tables in the specified
data source. After selecting the fact table, click the “Next” button, the Wizard displays
all of the available numeric data in the selected table, as shown in Figure 2.8
14
Figure 2.7 Screenshot of the “Select a fact table” dialog box with a selected fact table
After specifying the measures from the list, click the “Next” button, the Cube
Wizard asks the user to select dimensions or to create dimensions.
Figure 2.8 Screenshot of the “Defining measures” dialog box
4. Adding dimensions and levels to the cube
Dimensions are the categories for the user to analyze and summarize the data [6-8].
In other words, dimensions are the organized hierarchies that describe the data functions
in the fact table. There are two types of dimensions to be created for use in the cube. A
dimension created for use in an individual cube is called a private dimension. A shared
dimension is the one that multiple cubes can use [8]. A cube must contain at least one
dimension, and the dimension must exist in the database object where a cube will be
created.
15
In the Analysis Manager, a new dimension can be created either by the Cube Editor
or the Cube Wizard. If the editors are used to build the cube, then a dimension has to be
created before adding to a cube. However, if the Cube Wizard is used to create a cube,
then it will launch the Dimension Wizard to handle the task as part of the processing in
creating a cube [8]. The step-by-step processes of creating a new shared dimension with
the Dimension Wizard are summarized as follows:
a. Selecting the type of dimension schema in the screen of the “Choose how
you want to create the dimension”, as shown in Figure 2.9.
Figure 2.9 Screenshot of the Dimension Wizard
b. Specifying the dimension table from the available table list in the screen of
the “Select the dimension table”, as shown in Figure 2.10.
c. Selecting the level on the screen of the “Select the levels for your
dimension”, as shown in Figure 2.11.
16
Figure 2.10 Screenshot of the “Select Dimension table” dialog box
Figure 2.11 Screenshot of the “Select levels” dialog box
d. Specifying the new dimension name and previewing the dimension data in
the “Finish” dialog box of the Dimension Wizard, as illustrated in Figure
2.12.
17
Figure 2.12 Screenshot of the “Dimension Finish“ dialog box
5. Setting the storage options and setting up the cube aggregations
The storage mode determines how the data is organized in the server [8, 9]. It
affects the requirements of disk-storage space and the data-retrieval performance. There
are three types of storage options supported by Analysis Services: Multi-dimensional
OLAP (MOLAP), the relational OLAP, and the Hybrid OLAP (HOLAP). The
descriptions and storage locations of each mode are summarized in Table 2.1. The
Storage Design Wizard is used to select the option for the cube in the Analysis Manager,
as shown in Figure 2.13
18
Table 2.1 Storage options supported by Analysis Services
Storage Locations Storage Mode
Description Fact data Aggregated
Values ROLAP Relational OLAP
1. Slow processing, 2. Slow query response and 3. Huge storage requirements 4. Suitable for large databases or
legacy data.
Relational database Server
Relational Database Server
MOLAP Multidimensional OLAP 1. Require data duplication 2. Pre-summarizes the data to improve
performance in querying and displaying the data
3. High performances 4. Good for small to medium size data
sets.
Cube Cube
HOLAP Hybrid OLAP A combination of ROLAP and MOLAP 1. Does not create a copy of data 2. Provides connectivity to a large
number of relational databases. 3. Good for limited storage space but
faster query responses are needed.
Relational database Server
Cube
Figure 2.13 Screenshot of the “Storage Design Wizard” for selecting of storage options 19
After deciding the storage option, the next step is to specify the aggregation options
in the Set Aggregation Options dialog, as illustrated in Figure 2.14 [8, 9]. This option
allows the user to set the level of aggregation for the cube to boost the performance of
queries.
Aggregations are pre-calculated summaries of data that improve query response
time. The larger the level of cube’s aggregation, the faster the queries will be executed,
but a greater amount of disk space will be needed and more time will be required to
process the cube.
In the Analysis Services, there are three aggregation options for selection:
• Estimated storage reaches: specifying the maximum storage size in either megabytes (MB) or gigabytes (GB)
• Performance gain reaches: specifying the percentage amount of performance
gain for the queries • Until I click stop: selecting the manual control of the balance
.
Figure 2.14 Screenshot of the “Set aggregation options” dialog box
20
6. Processing the cube
Processing the cube is required before attempting to browse the cube data, especially
after designing its storage options and aggregations, because the aggregations are needed
to be calculated for the cube before the user to view the cube data [8, 9].
The major activities involved in the cube processing are described in a
“Process” window, as shown in Figure 2.15, and summarized as follows:
a. Reading the dimension tables to populate the levels from the actual data
b. Reading the fact table
c. Calculating specified aggregations
d. Storing the results in the cube.
Figure 2.15 Screenshot of the “Process” window
21
In the Analysis Manager, there are three options to be used to process a cube
depending on the different circumstances of the data structures. These options,
summarized in Table 2.2, can be selected in the “Process a Cube” dialog box, as shown in
Figure 2.16 [9].
Table 2.2 Summary of cube process options
Options of Process Circumstances
Full process Modifying the structure of the cube
Incremental update Adding new data to the cube
Refresh data Clear out and replacing a cube’s source data
Figure 2.16 Screenshot of the “Process a cube” dialog box
22
2.4.2 Browsing a Cube
In the Analysis Manager, using Cube Wizard to view the cube data is one of viewing
methods [5- 9]. There are two ways to open the Cube Browser to load cube data into it:
a. Right-click the cube name in the Analysis Manager Tree pane and selecting
“Browse Data” from the pop-up menu
b. Click the “Browse Sample Data” in the last step of the Cube Wizard
The cube Browser not only let users to view the multidimensional data in a flattened
two-dimensional grid format, as shown in Figure 2.17, but also makes it possible to drill
up or drill down different dimensions of data. However, the Cube Browser can not be
used to view unprocessed cube data [6].
Figure 2.17 Screenshot of the “Cube Browser” and sample results
23
24
2.4.3 Building the Data Mining Models
Data Mining is the process of extracting knowledge hidden from large volumes of
data [10, 11]. It involves uncovering patterns, trends, and relationships from historical
data and predicting outcomes of future situations. The primary mechanism for data
mining is the data mining model, an abstract object that stores data mining information in
a series of schema rowsets. The mining model serves as the blueprint for how data
should be analyzed or processed. Once the model is processed, information associated
with the mining model not only represents what was learned from the data, but also
allows users to discover the business trends for future decision making [11]. Two data
mining algorithms are built into Microsoft SQL server 2000 Analysis Services: Microsoft
Decision Trees and Microsoft Clustering [12, 13]
A. Decision Trees Algorithm:
Microsoft Decision Trees algorithm uses the recursive partitioning to divide the data
in a tree structure, and continually performs this search for predictive factors until there is
no more data to continue with [10-13]. A node in the tree structure represents each
predictive factor used to classify the data. This method focuses on providing information
paths for rules and patterns within data, and is useful in predicting the exact outcomes for
the future problems [12, 13].
B. Microsoft Clustering Algorithm:
Microsoft Clustering algorithm is based on the Expectation and Maximization (EM)
algorithm [11, 12]. It uses iterative refinement techniques to group records into
neighborhoods (clusters) that exhibit similar, predictable characteristics [13]. These are
useful for uncovering a relationship among data items in a large database with hundreds
of evaluated attributes.
The following steps describe the process of creating a mining model using the
mining model wizard in the Analysis Manager [13]:
1. Specifying the type of data:
In the window of “select data source type”, as shown in Figure 2.18, users
can select either relational data type or OLAP data to build the target mining
model.
Figure 2.18 Screenshot of the “Select source type” dialog box
2. Selecting the source cube:
In the “select source cube” window, as shown in Figure 2.19, users need to
highlight the target cube from the available cube lists [11, 13].
25
Figure 2.19 Screenshot of “Select source cube” window
3. Specifying the data mining method;
In the “Select data mining technique” window, as shown in Figure 2.2,
users can select one of two mining algorithms provided with the Analysis
Services: Microsoft Decision Trees and Microsoft Clustering [9, 10].
Figure 2.20 Screenshot of the selecting mining model technique
26
4. Identifying the case base or unit of analysis
In the “Select case” window, as shown in Figure 2.21, users need to
specify the case base of the analysis for the modeling task. A case is the basic
unit of analysis for mining task.
Figure 2.21 Screenshot of the “Select case” dialog box for specifying a case of analysis
5. Selecting the predicted entity:
In this step users must provide information for prediction used in the
mining model [12], as shown in Figure 2.22. The predicted entity can be
chosen as one of the following items:
ü A measure of the source table ü A member property of the case dimension and level ü Members of another dimension in the cube.
This feature provides flexibility in the process of predictive analysis using
OLAP data.
27
Figure 2.22 Screenshot of “Select predicted entity” window
6. Selecting a training data:
The training data is used to process OLAP data mining model and to
define the column structure of a data mining for the case set. As shown in
Figure 2.23, the users should select at least one additional data item from the
data training data [12, 13].
28
Figure 2.23 Screenshot of the “Select training data” window
7. Naming the model and process the model:
After user enters a model name and selects the “Save and process now”
check box, as shown in Figure 2.24, the wizard will process the model and
train the model with data based on the specified algorithm. Figure 2.25
displays the process of model execution [13]. When the process is complete, a
message of “Processing completed successfully” appears in the bottom of
dialog box.
29
Figure 2.24 Screenshot of the “Saving the data model” of the Mining Model Wizard
Figure 2.25 Screenshot of the “Model execution diagnostics” window
30
After clicking the “close” button, the OLAP Mining Model Editor will be launched
and system displays the content details of the proposed mining model, as shown in Figure
2.26.
Figure 2.26 Screenshot of the content details of a created mining model
31