38
130370702015 ADVANCED DATABASE PIET/ME/CE Page 1 PRACTICAL: 1 AIM: Study of Weka Tool Weka: It is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from our own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Fig 1.1: Weka tool A) The Explorer: An environment for exploring data with Weka .The Weka Knowledge Explorer is an easy to use graphical user interface that harnesses the power of the weka software. Each of the major weka packages Filters, Classifiers, Clusters, Associations, and Attribute Selection is represented in the Explorer along with a Visualization tool which allows datasets and the predictions of Classifiers and Clusters to be visualized in two dimensions. Status Box The status box appears at the very bottom of the window. It displays messages that keep us informed about what’s going on. Log Button Clicking on this button brings up a separate window containing a scrollable text field. Each line of text is stamped with the time it was entered into the log. As we perform actions in WEKA, the log keeps a record of what has happened. Graphical output Most graphical displays in WEKA, e.g., the GraphVisualizer or the TreeVisualizer, support saving the output to a file. A dialog for saving the output can be brought up with Alt+Shift+left-click. Supported formats are currently Windows Bitmap, JPEG, PNG and EPS (encapsulated Postscript). The dialog also allows us to specify the dimensions of the generated image.

ADBs_file1-5

Embed Size (px)

DESCRIPTION

Practical

Citation preview

Page 1: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 1

PRACTICAL: 1

AIM: Study of Weka Tool

Weka: It is a collection of machine learning algorithms for data mining tasks. The algorithms can

either be applied directly to a dataset or called from our own Java code. Weka contains tools for data

pre-processing, classification, regression, clustering, association rules, and visualization. It is also

well-suited for developing new machine learning schemes.

Fig 1.1: Weka tool

A) The Explorer:

An environment for exploring data with Weka .The Weka Knowledge Explorer is an easy to use

graphical user interface that harnesses the power of the weka software. Each of the major weka

packages Filters, Classifiers, Clusters, Associations, and Attribute Selection is represented in the

Explorer along with a Visualization tool which allows datasets and the predictions of Classifiers and

Clusters to be visualized in two dimensions.

Status Box

The status box appears at the very bottom of the window. It displays messages that keep us informed

about what’s going on.

Log Button

Clicking on this button brings up a separate window containing a scrollable text field. Each line of

text is stamped with the time it was entered into the log. As we perform actions in WEKA, the log

keeps a record of what has happened.

Graphical output

Most graphical displays in WEKA, e.g., the GraphVisualizer or the TreeVisualizer, support saving

the output to a file. A dialog for saving the output can be brought up with Alt+Shift+left-click.

Supported formats are currently Windows Bitmap, JPEG, PNG and EPS (encapsulated Postscript).

The dialog also allows us to specify the dimensions of the generated image.

Page 2: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 2

Section Tabs

1. Preprocess

The preprocess panel is the start for knowledge exploration. From this panel we can load datasets of

any type like WEKA’s ARFF format, CSV format, C4.5 format, or serialized Instances format.

ARFF files typically have a .arff extension, CSV files a .csv extension, C4.5 files a .data and .names

extension, and serialized Instances objects a .bsi extension. We can also browse the characteristics

of attributes and apply any combination of weka's unsupervised or supervised filters to the data.

Fig 1.2: Weka preprocess tab

2. Classify

The classifier panel allows us to configure and execute any of the weka classifiers on the current

dataset. Various test-options are also available like use training set, supplied test set, cross-

validation, percentage split.

Page 3: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 3

Fig 1.3: Weka Classify tab

3. Cluster

From the cluster panel we can perform clustering on the current dataset. Clusters can be visualized

in a cluster output window. From the “Choose” button can select the type of clustering that we want

to perform on the dataset. For example, Hierarchical clusterer, MakeDensityBasedClusterer, EM,

FartherstFirst. From available cluster modes can select according to the requirement. Modes

available are like use training set, supplied test set, percentage split, and classes to clusters

evaluation. One another option available for selecting the attributes which we want to ignore. By

clicking on Start, we can visualize the output in clusterer output window.

4. Associate Panel

From the associate panel we can mine the current dataset for association rules using the weka

associators.

5. Select Attributes Panel

This panel allows applying any combination of weka attribute evaluator and searching method to

select the most pertinent attributes in the dataset. Results can be visualized by right clicking on

result list window.

Page 4: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 4

6. Visualize Panel

This panel displays a scatter plot matrix for the current dataset. The size of the individual cells and

the size of the points they display can be adjusted using the slider controls at the bottom of the

panel. The number of cells in the matrix can be changed by pressing the "Select Attributes" button

and then choosing those attributes to be displayed. When a dataset is large, plotting performance can

be improved by displaying only a subsample of the current dataset. Clicking on a cell in the matrix

pops up a larger plot panel window that displays the view from that cell. This panel allows us to

visualize the current dataset in one and two dimensions. When the colouring attribute is discrete,

each value is displayed as a different color; when the coloring attribute is continuous, a spectrum is

used to indicate the value. Attribute "bars" (down the right hand side of the panel) provide a

convenient summary of the discriminating power of the attributes individually. This panel can also

be popped up in a separate window from the classifier panel and the cluster panel to allow us to

visualize predictions made by classifiers/clusters. When the class is discrete, misclassified points are

shown by a box in the color corresponding to the class predicted by the classifier; when the class is

continuous, the size of each plotted point varies in proportion to the magnitude of the error made by

the classifier.

B) The Experimenter:

This interface is designed to facilitate experimental comparison of the predictive performance of

algorithms based on the many different evaluation criteria that are available in WEKA. Experiments

can involve multiple algorithms that are run across multiple datasets; for example, using repeated

cross-validation. Experiments can also be distributed across different compute nodes in a network to

reduce the computational load for individual nodes. Once an experiment has been set up, it can be

saved in either XML or binary form, so that it can be re-visited if necessary. Configured and saved

experiments can also be run from the command-line. Compared to WEKA’s other user interfaces,

the Experimenter is perhaps used less frequently by data mining practitioners. However, once

preliminary experimentation has been performed in the Explorer, it is often much easier to identify a

suitable algorithm for a particular dataset, or collection of datasets, using this alternative interface.

The Experimenter environment is designed to allow the user to create, run, modify and analyze

experiments. It is a more convenient way to compare several learning schemes and compare the

results to determine if a particular scheme is better than the others. The comparisons can be based on

different measures such as accuracy, speed and so on. As such, the Experimenter environment

makes it easy to work with several data sets and several learning schemes at once.

A major limitation of the Experimenter environment is that we cannot edit the data set by using

filters. Neither is choosing a different class variable or even ignoring any attributes possible. This

environment assumes the data to be analyzed has already been cleaned and filtered and that all these

operations have already been performed on the data set(s), and that the only thing left is to test

classification schemes on it.

Advantages:

The main advantage of using the Experimenter interface is that it allows the user to compare

multiple learning schemes across one or more datasets. In simple mode, a user can setup simple

experiments very quickly and easily. The built-in analyzer is also easy to use and allows the user to

perform various kinds of tests easily. Iteration Controlling is also possible here.

Page 5: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 5

Though we did not look at the advanced mode, it gives the user the ability to run experiments that

are more complex and even to break the experiment across multiple computers. In this mode, the

user has a finer control of how each run of the experiment will be conducted.

Disadvantages:

The Experimenter’s main weakness is the lack of preprocessing facilities as none of Weka’s Filters

can be accessed in this interface. Another disadvantage is the lack of any graphical visualization

tools. It would have been nice, for example, to display multiple ROC curves on the same graph.

Another problem is that when more than 1 of the same learning scheme is being tested but with

different options, it is not easy to tell what resultant set belongs to what learning scheme. It would

be nice if a user would be able to label a given scheme or some way of easily identifying a particular

result set as the names given automatically by the system can be a bit confusing. One more

disadvantage is that it is not possible to select the class variable as the Experimenter will always take

the last attribute in the data set to be the class variable.

C) The Knowledge Flow

It offers a compelling alternative to the other Weka graphical interfaces and some capabilities not

available in the Explorer and Experimenter interface.

The idea behind the Knowledge Flow interface is the “data-flow”. The user is presented with a

layout canvas where they will place components. These components are then connected together to

form a “knowledge flow” that will determine how the data will be processed, analyzed and reported.

All of Weka’s clustering and Classification algorithms including some extra tools are also available

in Knowledge Flow.

A major benefit of the Knowledge Flow interface is that it can handle data either incrementally or in

batches while the Explorer can only handle data in batches. This is of practical importance when

handling extremely large and/or unlimited data. For this to be useful, Weka also provides classifiers

that can handle data incrementally and be updated on an instance by instance basis.

Advantages:

The Knowledge Flow interface presents an alternative to the Experimenter. It allows a user to

construct complex experiments visually and this can make the process easier. Connections are

labeled making it easier to follow the experiment and understand what is going on. As well, the

interface is designed to prevent the creation of invalid connections as components will only be

attached to components can accept connections from.

A major advantage of this interface that we did not look at is that it opens up a whole class of

updateable learning schemes. Updateable learning schemes are able to take in data incrementally to

continuously improve the model as the data streams in. This ability makes it is possible to take

advantage of extremely large or even infinitely large data. This feature is useful in information

retrieval applications and in areas such as spam detection in emails. In these cases, the data is often

unlimited and constantly changing and an updatable model can always take advantage of new data

as it comes in.

Page 6: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 6

Disadvantages:

The main disadvantage of the Knowledge Flow interface is that it is newer and some of Weka’s

functionality is not available or not fully implemented. Unlike the Experimenter, this interface has

the ability to compare graphically the different models as we did in our example with the 2 ROC

curves.

Unfortunately there is no way to compare the two models numerically to determine if there are

statistically significant differences between them as this functionality has not yet been implemented

in this interface.

User Interfaces

Aside from the above mentioned exposure of capabilities and technical information Meta data, there

has been further refinement and improvement to the GUIs in WEKA since version 3.4. The GUI

Chooser—WEKA’s graphical start point—has undergone a redesign and now provides access to

various supporting user interfaces, system information and logging information, as well as the main

applications in WEKA. The “Tools” menu provides two new supporting GUIs:

• SQL viewer: allows user-entered SQL to be run against a database and the results previewed. This

user interface is also used in the Explorer to extract data from a database when the “Open DB”

button is pressed.

• Bayes network editor: provides a graphical environment for constructing, editing and visualizing

Bayesian network classifiers.

Page 7: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 7

PRACTICAL: 2

AIM: To implement Apriori Algorithm using Weka Tool

Software Required: WEKA

Steps to implement Apriori Algorithm using Weka Tool:

1) Invoke Explorer:

We can implement Apriori algorithm using “Explorer” or the “Simple CLI” options given in weka.

Simple CLI is a command line interface for weka, while Explorer provides GUI based facility for

the same.

Fig 2.1: Weka Tool

To invoke Explorer, as shown in fig. 2.1, click the button “Explorer” under the Applications plane.

Explorer is an environment for exploring data.

2) Pre-process:

At the very top of the window, just below the title bar there is a row of tabs. Only the first tab,

‘Preprocess’ is active at the moment because there is no dataset open. The first three buttons at the

top of the preprocess section enable us to load data into WEKA. Data can be imported from a file in

various formats: ARFF, CSV, C4.5, binary, it can also be read from a URL or from an SQL database

(using JDBC). The easiest and the most common way of getting the data into WEKA is to store it as

Attribute-Relation File Format (ARFF) files. Here, we can use the predefined datasets that come

along with weka.

Click on ‘Open file…’ button.

It brings up a dialog box allowing us to browse for the data file on the local file system, choose

“contact-lenses.arff” file.

Page 8: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 8

Fig 2.2: Select Dataset ‘contact-lenses’

3) Associate:

Fig 2.3: Associate tab of weka explorer

Page 9: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 9

Click ‘Associate’ tab at the top of ‘WEKA Explorer’ window. It brings up interface for the Apriori

algorithm. Select “Apriori” association.

As shown in fig 2.3, click on the ‘Associator’ box, ‘GenericObjectEditor’ appears on our screen. In

the dialog box, change the value in ‘minMetric’ to 0.8 for confidence = 80%. Make sure that the

default value of rules is set to 100. The upper bound for minimum support ‘upperBoundMinSupport’

should be set to 1.0 (100%) and ‘lowerBoundMinSupport’ to 0.1. Apriori in WEKA starts with the

upper bound support and incrementally decreases support (by delta increments, which by default is

set to 0.05 or 5%). The algorithm halts when either the specified number of rules is generated, or the

lower bound for minimum support is reached. The ‘significanceLevel’ testing option is only

applicable in the case of confidence and is (-1.0) by default (not used).

Fig 2.4: Setting options for Apriori Algorithm

Options:

car -- If enabled class association rules are mined instead of (general) association rules.

classIndex -- Index of the class attribute. If set to -1, the last attribute is taken as class attribute.

delta -- Iteratively decrease support by this factor. Reduces support until min support is reached or

required number of rules has been generated.

lowerBoundMinSupport -- Lower bound for minimum support.

metricType -- Set the type of metric by which to rank rules. Confidence is the proportion of the

examples covered by the premise that are also covered by the consequence. Lift is confidence

divided by the proportion of all examples that are covered by the consequence. This is a measure of

Page 10: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 10

the importance of the association that is independent of support. Leverage is the proportion of

additional examples covered by both the premise and consequence above those expected if the

premise and consequence were independent of each other. The total number of examples that this

represents is presented in brackets following the leverage.

minMetric -- Minimum metric score. Consider only rules with scores higher than this value.

numRules -- Number of rules to find.

outputItemSets -- If enabled the itemsets are output as well.

removeAllMissingCols -- Remove columns with all missing values.

significanceLevel -- Significance level. Significance test (confidence metric only).

upperBoundMinSupport -- Upper bound for minimum support. Start iteratively decreasing minimum

support from this value.

verbose -- If enabled the algorithm will be run in verbose mode.

Once the options have been specified, we can run Apriori algorithm by clicking on the ‘Start’

button.

Fig 2.5: Execution of Apriori using Weka

Page 11: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 11

Analysing result:

Run Information gives us the following information:

• The scheme for learning association we used - Apriori

• The relation name – “contact-lenses”

• Number of instances in the relation – 24

• Number of attributes in the relation – 5

Fig 2.6: Output of Apriori using Weka

Page 12: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 12

The results for Apriori algorithm are:

First, the program generated the sets of large itemsets found for each support size considered. In this

case 11 item sets of items were found to have the required minimum support.

By default, Apriori tries to generate ten rules. It begins with a minimum support of 100% of the data

items and decreases this in steps of 5% until there are at least ten rules with the required minimum

confidence, or until the support has reached a lower bound of 10% whichever occurs first. The

minimum confidence is set 0.8 (80%). Generation of the required number of rules involved a total of

16 iterations. The last part gives the association rules that are found. The number preceding = =>

symbol indicates the rule’s support, that is, the number of items covered by its premise. Following

the rule is the number of those items for which the rule’s consequent holds as well. In the

parentheses there is a confidence of the rule.

Page 13: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 13

PRACTICAL: 3

AIM: Implementation of Decision Tree Classifier in Weka Tool

Software Required: WEKA

Steps to implement Decision Tree Classifier using Weka Tool:

1) Invoke Explorer:

We can implement Decision Tree Classifier using “Explorer” or the “Simple CLI” options given in

weka. Simple CLI is a command line interface for weka, while Explorer provides GUI based facility

for the same.

Fig 3.1: Weka Tool

To invoke Explorer, as shown in fig. 3.1, click the button “Explorer” under the Applications plane.

Explorer is an environment for exploring data.

Pre-process:

At the very top of the window, just below the title bar there is a row of tabs. Only the first tab,

‘Preprocess’, is active at the moment because there is no dataset open. The first three buttons at the

top of the preprocess section enable us to load data into WEKA. Data can be imported from a file in

various formats: ARFF, CSV, C4.5, binary, it can also be read from a URL or from an SQL database

(using JDBC). The easiest and the most common way of getting the data into WEKA is to store it as

Attribute-Relation File Format (ARFF) files. Here, we can use the predefined datasets that come

along with weka.

Click on ‘Open file…’ button.

It brings up a dialog box allowing us to browse for the data file on the local file system, choose

“weather.nominal.arff” file.

Page 14: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 14

Fig 3.2: Select Dataset ‘weather.nominal’

Classify:

Fig 3.3: Choosing Classifier ‘ J48’

Page 15: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 15

Once we have our data set loaded, all the tabs are available to us. Click on the ‘Classify’ tab. The

‘Classify’ window comes up on the screen.

Choosing a Classifier

Click on ‘Choose’ button in the ‘Classifier’ box just below the tabs and select J48 classifier

WEKA → Classifiers → Trees → J48.

Options:

Fig 3.4: Setting options for ‘J48’ Classifier

binarySplits -- Whether to use binary splits on nominal attributes when building the trees.

confidenceFactor -- The confidence factor used for pruning (smaller values incur more pruning).

debug -- If set to true, classifier may output additional info to the console.

minNumObj -- The minimum number of instances per leaf.

numFolds -- Determines the amount of data used for reduced-error pruning. One fold is used for

pruning, the rest for growing the tree.

reducedErrorPruning -- Whether reduced-error pruning is used instead of C.4.5 pruning.

saveInstanceData -- Whether to save the training data for visualization.

seed -- The seed used for randomizing the data when reduced-error pruning is used.

Page 16: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 16

subtreeRaising -- Whether to consider the subtree raising operation when pruning.

unpruned -- Whether pruning is performed.

useLaplace -- Whether counts at leaves are smoothed based on Laplace.

Setting Test Options

Before we run the classification algorithm, we need to set test options. Set test options in the ‘Test

options’ box. The test options that available to us are:

1. Use training set. Evaluates the classifier on how well it predicts the class of the instances it was

trained on.

2. Supplied test set. Evaluates the classifier on how well it predicts the class of a set of instances

loaded from a file. Clicking on the ‘Set…’ button brings up a dialog allowing us to choose the file to

test on.

3. Cross-validation. Evaluates the classifier by cross-validation, using the number of folds that are

entered in the ‘Folds’ text field.

4. Percentage split. Evaluates the classifier on how well it predicts a certain percentage of the data,

which is held out for testing. The amount of data held out depends on the value entered in the ‘%’

field.

Fig 3.5: Setting Evaluation options for the test

Page 17: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 17

In the ‘Classifier evaluation options’ make sure that the following options are checked:

1. Output model. The output is the classification model on the full training set, so that it can be

viewed, visualized, etc.

2. Output per-class stats. The precision/recall and true/false statistics for each class output.

3. Output confusion matrix. The confusion matrix of the classifier’s predictions is included in the

output.

4. Store predictions for visualization. The classifier’s predictions are remembered so that they can

be visualized.

5. Set ‘Random seed for Xval / % Split’ to 1. This specifies the random seed used when

randomizing the data before it is divided up for evaluation purposes

Once the options have been specified, we can run the classification algorithm. Click on ‘Start’

button to start the learning process. We can stop learning process at any time by clicking on ‘Stop’

button.

When training set is complete, the ‘Classifier’ output area on the right panel of ‘Classify’ window is

filled with text describing the results of training and testing. A new entry appears in the ‘Result list’

box on the left panel of ‘Classify’ window.

Fig 3.6: Executing Classification

Page 18: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 18

Analyzing Results:

Run Information gives us the following information:

• The algorithm used – J48

• The relation name – “weather.nominal”

• Number of instances in the relation – 14

• Number of attributes in the relation – 5 and the list of the attributes: outlook, temperature,

humidity, windy, play

• The test mode selected: Cross-validation, No. Of Folds = 14

Classifier model is a pruned decision tree in textual form that was produced on the full training data.

As we can see, the first split is on the ‘outlook’ attribute, at the second level, the splits are on

‘humidity’ and ‘windy’. In the tree structure, a colon represents the class label that has been

assigned to a particular leaf, followed by the number of instances that reach that leaf. Below the tree

structure, there is a number of leaves (which is 5), and the number of nodes in the tree - size of the

tree (which is 8).

Fig 3.7: Output of Classification

Evaluation on Cross-validation. This part of the output gives estimates of the tree’s predictive

performance, generated by WEKA’s evaluation module. It outputs the list of statistics summarizing

how accurately the classifier was able to predict the true class of the instances under the chosen test

module. The set of measurements is derived from the training data.

Page 19: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 19

In this case only 50% of 14 training instances have been classified correctly. This indicates that the

results obtained from the training data are not optimistic compared with what might be obtained

from the independent test set from the same source. In addition to classification error, the evaluation

output measurements derived from the class probabilities assigned by the tree. More specifically, it

outputs mean output error (0.3988) of the probability estimates, the root mean squared error

(0.5717) is the square root of the quadratic loss. The mean absolute error calculated in a similar way

by using the absolute instead of squared difference. The reason that the errors are not 1 or 0 is

because not all training instances are classified correctly.

Detailed Accuracy by Class demonstrates a more detailed per class break down of the classifier’s

prediction accuracy. From the Confusion matrix we can see that two instances of a class ‘yes’ have

been assigned to a class ‘no’ and five instances of class ‘no’ have been assigned to class ‘yes’.

Visualization of Results:

WEKA lets us to see a graphical representation of the classification tree. Right-click on the entry in

‘Result list’ for which we would like to visualize a tree.

Fig 3.8: Visualization of result – Tree view

Page 20: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 20

PRACTICAL: 4

AIM: Overview of SQL Server 2005 Analysis Services

Software Required: Analysis services- SQL Server-2005.

Knowledge Required: Data Mining Concepts

Theory/Logic:

Data Mining

Act of excavation in the data from which patterns can be extracted

Alternative name: Knowledge discovery in databases (KDD)

Multiple disciplines: database, statistics, artificial intelligence

Fastly maturing technology

Unlimited applicability

Figure 4.1: Data mining process

Data Mining Tasks - Summary

Classification

Regression

Segmentation

Association Analysis

Anomaly detection Sequence Analysis

Time-series Analysis

Text categorization

Advanced insights discovery

Others

Page 21: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 21

The data mining algorithms and tools in SQL Server 2005 make it easy to build a comprehensive

solution for a variety of projects, including market basket analysis, forecasting analysis, and targeted

mailing analysis.

The most visible components in SQL Server 2005 are the workspaces that we use to create and work

with data mining models. The online analytical processing (OLAP) and data mining tools are

consolidated into two working environments: Business Intelligence Development Studio and SQL

Server Management Studio. Using Business Intelligence Development Studio, we can develop an

Analysis Services project disconnected from the server. When the project is ready, we can deploy it

to the server. We can also work directly against the server. The main function of SQL Server

Management Studio is to manage the server. Each environment is described in more detail later in

this introduction.

All of the data mining tools exist in the data mining editor. Using the editor we can manage mining

models, create new models, view models, compare models, and create predictions based on existing

models.

After we build a mining model, we will want to explore it, looking for interesting patterns and rules.

Each mining model viewer in the editor is customized to explore models built with a specific

algorithm.

Often the project will contain several mining models, so before we can use a model to create

predictions, we need to be able to determine which model is the most accurate. For this reason, the

editor contains a model comparison tool called the Mining

Accuracy Chart tab, using this tool we can compare the predictive accuracy of our models and

determine the best model.

To create predictions, we will use the Data Mining Extensions (DMX) language. DMX extends

SQL, containing commands to create, modify, and predict against mining models. Because creating

a prediction can be complicated, the data mining editor contains a tool called Prediction Query

Builder, which allows us to build queries using a graphical interface. We can also view the DMX

code that is generated by the query builder.

The key to creating a mining model is the data mining algorithm. The algorithm finds patterns in the

data that we pass it, and it translates them into a mining model — it is the engine behind the process.

SQL Server 2005 includes nine algorithms:

1. Microsoft Decision Trees

2. Microsoft Clustering

3. Microsoft Naïve Bayes

4. Microsoft Sequence Clustering

5. Microsoft Time Series

6. Microsoft Association

7. Microsoft Neural Network

8. Microsoft Linear Regression

Page 22: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 22

9. Microsoft Logistic Regression

Using a combination of these nine algorithms, we can create solutions to common business

problems. Some of the most important steps in creating a data mining solution are consolidating,

cleaning, and preparing the data to be used to create the mining models. SQL Server 2005 includes

the Data Transformation Services (DTS) working environment, which contains tools that we can use

to clean, validate, and prepare our data. The audience for this tutorial is business analysts,

developers, and database administrators who have used data mining tools before and are familiar

with data mining concepts.

Business Intelligence Development Studio

Business Intelligence Development Studio is a set of tools designed for creating business

intelligence projects. Because Business Intelligence Development Studio was created as an IDE

environment in which we can create a complete solution, we can work disconnected from the server.

We can change our data mining objects as much as we want, but the changes are not reflected on the

server until after we deploy the project.

Working in an IDE is beneficial for the following reasons:

• We have powerful customization tools available to configure Business Intelligence Development

Studio to suit our needs.

• We can integrate our Analysis Services project with a variety of other business intelligence

projects encapsulating our entire solution into a single view.

• Full source control integration enables our entire team to collaborate in creating a complete

business intelligence solution.

The Analysis Services project is the entry point for a business intelligence solution. An Analysis

Services project encapsulates mining models and OLAP cubes, along with supplemental objects that

make up the Analysis Services database. From Business Intelligence Development Studio, we can

create and edit Analysis Services objects within a project and deploy the project to the appropriate

Analysis Services server or servers.

Working with Data Mining

Data mining gives us access to the information that we need to make intelligent decisions about

difficult business problems. Microsoft SQL Server 2005 Analysis Services (SSAS) provides tools

for data mining with which we can identify rules and patterns in our data, so that we can determine

why things happen and predict what will happen in the future. When we create a data mining

solution in Analysis Services, we first create a model that describes our business problem, and then

we run our data through an algorithm that generates a mathematical model of the data, a process that

is known as training the model. We can then either visually explore the mining model or create

prediction queries against it. Analysis Services can use datasets from both relational and OLAP

databases, and includes a variety of algorithms that we can use to investigate that data.

SQL Server 2005 provides different environments and tools that we can use for data mining. The

following sections outline a typical process for creating a data mining solution, and identify the

resources to use for each step.

Page 23: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 23

Creating an Analysis Services Project

To create a data mining solution, we must first create a new Analysis Services project, and then add

and configure a data source and a data source view for the project. The data source defines the

connection string and authentication information with which to connect to the data source on which

to base the mining model. The data source view provides an abstraction of the data source, which we

can use to modify the structure of the data to make it more relevant to our project.

Adding Mining Structures to an Analysis Services Project

After we have created an Analysis Services project, we can add mining structures, and one or more

mining models that are based on each structure. A mining structure, including tables and columns, is

derived from an existing data source view or OLAP cube in the project. Adding a new mining

structure starts the Data Mining Wizard, which we use to define the structure and to specify an

algorithm and training data for use in creating an initial model based on that structure.

We can use the Mining Structure tab of Data Mining Designer to modify existing mining

structures, including adding columns and nested tables.

Working with Data Mining Models

Before we can use the mining models we define, we must process them so that Analysis Services

can pass the training data through the algorithms to fill the models. Analysis Services provides

several options for processing mining model objects, including the ability to control which objects

are processed and how they are processed.

After we have processed the models, we can investigate the results and make decisions about which

models perform the best. Analysis Services provides viewers for each mining model type, within the

Mining Model Viewer tab in Data Mining Designer, which we can use to explore the mining

models. Analysis Services also provides tools,

in the Mining Accuracy Chart tab of the designer, that we can use to directly compare mining

models and to choose the mining model that works best for our purpose. These tools include a lift

chart, a profit chart, and a classification matrix.

Creating Predictions

The main goal of most data mining projects is to use a mining model to create predictions. After we

explore and compare mining models, we can use one of several tools to create predictions. Analysis

Services provides a query language called Data Mining Extensions (DMX) that is the basis for

creating predictions. To help us build DMX prediction queries, SQL Server provides a query

builder, available in SQL Server Management Studio and Business Intelligence Development

Studio, and DMX templates for the query editor in Management Studio. Within BI Development

Studio, we access the query builder from the Mining Model Prediction tab of Data Mining

Designer.

SQL Server Management Studio

After we have used BI Development Studio to build mining models for our data mining project, we

can manage and work with the models and create predictions in Management Studio.

Page 24: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 24

SQL Server Reporting Services

After we create a mining model, we may want to distribute the results to a wider audience. We can

use Report Designer in Microsoft SQL Server 2005 Reporting Services (SSRS) to create reports,

which we can use to present the information that a mining model contains. We can use the result of

any DMX query as the basis of a report, and can take advantage of the parameterization and

formatting features that are available in Reporting Services.

Working Programmatically with Data Mining

Analysis Services provides several tools that we can use to programmatically work with data

mining. The Data Mining Extensions (DMX) language provides statements that we can use to

create, train, and use data mining models. We can also perform these tasks by using a combination

of XML for Analysis (XMLA) and Analysis Services Scripting Language (ASSL), or by using

Analysis Management Objects (AMO).

We can access all the metadata that is associated with data mining by using data mining schema row

sets. For example, we can use schema row sets to determine the data types that an algorithm

supports, or the model names that exist in a database.

Data Mining Concepts

Data mining is frequently described as "the process of extracting valid, authentic, and actionable

information from large databases." In other words, data mining derives patterns and trends that exist

in data. These patterns and trends can be collected together and defined as a mining model. Mining

models can be applied to specific business scenarios, such as:

• Forecasting sales.

• Targeting mailings toward specific customers.

• Determining which products are likely to be sold together.

• Finding sequences in the order that customers add products to a shopping cart.

An important concept is that building a mining model is part of a larger process that includes

everything from defining the basic problem that the model will solve, to deploying the model into a

working environment. This process can be defined by using the following six basic steps:

1. Defining the Problem

2. Preparing Data

3. Exploring Data

4. Building Models

5. Exploring and Validating Models

6. Deploying and Updating Models

Page 25: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 25

The following diagram describes the relationships between each step in the process, and the

technologies in Microsoft SQL Server 2005 that we can use to complete each step.

Figure 4.2: Steps for building a Mining model

Although the process that is illustrated in the diagram is circular, each step does not necessarily lead

directly to the next step. Creating a data mining model is a dynamic and iterative process. After we

explore the data, we may find that the data is insufficient to create the appropriate mining models,

and that we therefore have to look for more data. We may build several models and realize that they

do not answer the problem posed when we defined the problem, and that we therefore must redefine

the problem. We may have to update the models after they have been deployed because more data

has become available. It is therefore important to understand that creating a data mining model is a

process, and that each step in the process may be repeated as many times as needed to create a good

model.

SQL Server 2005 provides an integrated environment for creating and working with data mining

models, called Business Intelligence Development Studio. The environment includes data mining

algorithms and tools that make it easy to build a comprehensive solution for a variety of projects.

Defining the Problem

The first step in the data mining process, as highlighted in the following diagram, is to clearly define

the business problem.

This step includes analyzing business requirements, defining the scope of the problem, defining the

metrics by which the model will be evaluated, and defining the final objective for the data mining

project. These tasks translate into questions such as the following:

Page 26: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 26

Figure 4.3: Step 1 - Defining the Problem

• What are we looking for?

• Which attribute of the dataset do we want to try to predict?

• What types of relationships are we trying to find?

• Do we want to make predictions from the data mining model or just look for interesting patterns

and associations?

• How is the data distributed?

• How are the columns related, or if there are multiple tables, how are the tables related?

To answer these questions, we may have to conduct a data availability study, to investigate the

needs of the business users with regard to the available data. If the data does not support the needs

of the users, we may have to redefine the project.

Preparing Data

The second step in the data mining process, as highlighted in the following diagram, is to

consolidate and clean the data that was identified in the Defining the Problem step.

Microsoft SQL Server 2005 Integration Services (SSIS) contains all the tools that we need to

complete this step, including transforms to automate data cleaning and consolidation.

Data can be scattered across a company and stored in different formats, or may contain

inconsistencies such as flawed or missing entries. For example, the data might show that a customer

bought a product before that customer was actually even born, or that the customer shops regularly

at a store located 2,000 miles from her home. Before we start to build models, we must fix these

problems. Typically, we are working with a very large dataset and cannot look through every

transaction. Therefore, we have to use some form of automation, such as in Integration Services, to

explore the data and find the inconsistencies.

Page 27: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 27

Figure 4.4: Step 2 – Preparing Data

Exploring Data

The third step in the data mining process, as highlighted in the following diagram, is to explore the

prepared data.

Figure 4.5: Step 3 – Exploring Data

We must understand the data in order to make appropriate decisions when we create the models.

Exploration techniques include calculating the minimum and maximum values, calculating mean

and standard deviations, and looking at the distribution of the data. After we explore the data, we

can decide if the dataset contains flawed data, and then we can devise a strategy for fixing the

problems.

Data Source View Designer in BI Development Studio contains several tools that we can use to

explore data.

Page 28: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 28

Building Models

The fourth step in the data mining process, as highlighted in the following diagram, is to build the

mining models.

Figure 4.6: Step 4 – Building Models

Before we build a model, we must randomly separate the prepared data into separate training and

testing datasets. We use the training dataset to build the model, and the testing dataset to test the

accuracy of the model by creating prediction queries. We can use the Percentage Sampling

Transformation in Integration Services to split the dataset.

We will use the knowledge that we gain from the Exploring Data step to help define and create a

mining model. A model typically contains input columns, an identifying column, and a predictable

column. We can then define these columns in a new model by using the Data Mining Extensions

(DMX) language or the Data Mining Wizard in BI Development Studio.

After we define the structure of the mining model, we process it, populating the empty structure

with the patterns that describe the model. This is known as training the model. Patterns are found by

passing the original data through a mathematical algorithm. SQL Server 2005 contains a different

algorithm for each type of model that we can build. We can use parameters to adjust each algorithm.

A mining model is defined by a data mining structure object, a data mining model object, and a data

mining algorithm.

Microsoft SQL Server 2005 Analysis Services (SSAS) includes the following algorithms:

• Microsoft Decision Trees Algorithm

• Microsoft Clustering Algorithm

• Microsoft Naive Bayes Algorithm

• Microsoft Association Algorithm

• Microsoft Sequence Clustering Algorithm

Page 29: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 29

• Microsoft Time Series Algorithm

• Microsoft Neural Network Algorithm (SSAS)

• Microsoft Logistic Regression Algorithm

• Microsoft Linear Regression Algorithm

Exploring and Validating Models

The fifth step in the data mining process, as highlighted in the following diagram, is to explore the

models that we have built and test their effectiveness.

Figure 4.7: Step 5 – Exploring and Validating Models

We do not want to deploy a model into a production environment without first testing how well the

model performs. Also, we may have created several models and will have to decide which model

will perform the best. If none of the models that we created in the Building Models step perform

well, we may have to return to a previous step in the process, either by redefining the problem or by

reinvestigating the data in the original dataset.

We can explore the trends and patterns that the algorithms discover by using the viewers in Data

Mining Designer in BI Development Studio. We can also test how well the models create

predictions by using tools in the designer such as the lift chart and classification matrix. These tools

require the testing data that we separated from the original dataset in the model-building step.

Deploying and Updating Models

The last step in the data mining process, as highlighted in the following diagram, is to deploy to a

production environment the models that performed the best.

After the mining models exist in a production environment, we can perform many tasks, depending

on our needs. Following are some of the tasks we can perform:

• Use the models to create predictions, which we can then use to make business decisions. SQL

Server provides the DMX language that we can use to create prediction queries, and Prediction

Query Builder to help us build the queries.

Page 30: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 30

Figure 4.8: Step 6 – Deploying and Updating Models

• Embed data mining functionality directly into an application. We can include Analysis

Management Objects (AMO) or an assembly that contains a set of objects that our application can

use to create, alter, process, and delete mining structures and mining models. Alternatively, we can

send XML for Analysis (XMLA) messages directly to an instance of Analysis Services.

• Use Integration Services to create a package in which a mining model is used to intelligently

separate incoming data into multiple tables. For example, if a database is continually updated with

potential customers, we could use a mining model together with Integration Services to split the

incoming data into customers who are likely to purchase a product and customers who are likely to

not purchase a product.

• Create a report that lets users directly query against an existing mining model. Updating the model

is part of the deployment strategy. As more data comes into the organization, we must reprocess the

models, thereby improving their effectiveness.

Page 31: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 31

PRACTICAL: 5

AIM: Design and Create cube by identifying measures and dimensions for Star

Schema

Software Required: Analysis services- SQL Server-2005.

Knowledge Required: Data cube

Theory/Logic:

Creating a Data Cube

To build a new data cube using BIDS, we need to perform these steps:

• Create a new Analysis Services project

• Define a data source

• Define a data source view

• Invoke the Cube Wizard

We’ll look at each of these steps in turn.

Creating a New Analysis Services Project

To create a new Analysis Services project, we use the New Project dialog box in BIDS. This is very

similar to creating any other type of new project in Visual Studio.

To create a new Analysis Services project, follow these steps:

1. Select Microsoft SQL Server 2005 ⇒SQL Server Business Intelligence Development Studio

from the Programs menu to launch Business Intelligence Development Studio.

2. Select File ⇒ New Project. ⇒

3. In the New Project dialog box, select the Business Intelligence Projects project type.

Fig 5.1: Solution Explorer window

4. Select the Analysis Services Project template.

Page 32: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 32

5. Name the new project AdventureWorksCube1 and select a convenient location to save it.

6. Click OK to create the new project.

Figure 5.1 shows the Solution Explorer window of the new project, ready to be populated with

objects.

Defining a Data Source

To define a data source, we’ll use the Data Source Wizard. We can launch this wizard by right-

clicking on the Data Sources folder in our new Analysis Services project. The wizard will walk us

through the process of defining a data source for our cube, including choosing a connection and

specifying security credentials to be used to connect to the data source.

To define a data source for the new cube, follow these steps:

1. Right-click on the Data Sources folder in Solution Explorer and select New Data Source.

2. Read the first page of the Data Source Wizard and click next.

Fig 5.2: Connection Manager Window

3. We can base a data source on a new or an existing connection. Because we don’t have any

existing connections, click New.

Page 33: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 33

4. In the Connection Manager Dialog box, select the server containing our analysis services sample

database from the Server Name combo box.

5. Fill in our authentication information.

6. Select the Native OLE DB\SQL Native Client provider (this is the default provider).

7. Select the AdventureWorksDW database. Figure 3-2 shows the filled-in Connection Manager

Dialog box.

8. Click OK to dismiss the Connection Manager Dialog box.

9. Click Next.

10. Select Default impersonation information to use the credentials we just supplied for the

connection and click Next.

11. Accept the default data source name and click Finish.

Defining a Data Source View

A data source view is a persistent set of tables from a data source that supply the data for a particular

cube. BIDS also includes a wizard for creating data source views, which we can invoke by right-

clicking on the Data Source Views folder in Solution Explorer.

To create a new data source view, follow these steps:

1. Right-click on the Data Source Views folder in Solution Explorer and select New Data Source

View.

2. Read the first page of the Data Source View Wizard and click next.

3. Select the Adventure Works DW data source and click next. Note that we could also launch the

Data Source Wizard from here by clicking New Data Source.

4. Select the dbo.FactFinance table in the Available Objects list and click the ⇒ button to move it to

the Included Object list. This will be the fact table in the new cube.

5. Click the Add Related Tables button to automatically add all of the tables that are directly related

to the dbo.FactFinance table. These will be the dimension tables for the new cube. Figure 2-3 shows

the wizard with all of the tables selected.

6. Click Next.

7. Name the new view Finance and click Finish. BIDS will automatically display the schema of the

new data source view, as shown in Figure 5.3.

Page 34: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 34

Fig 5.3: Data Source View Wizard

Analysis Services

Invoking the Cube Wizard

As we can probably guess at this point, we invoke the Cube Wizard by right clicking on the Cubes

folder in Solution Explorer. The Cube Wizard interactively explores the structure of our data source

view to identify the dimensions, levels, and measures in our cube.

To create the new cube, follow these steps:

1. Right-click on the Cubes folder in Solution Explorer and select New Cube.

2. Read the first page of the Cube Wizard and click next.

3. Select the option to build the cube using a data source.

4. Check the Auto Build checkbox.

5. Select the option to create attributes and hierarchies.

6. Click Next.

7. Select the Finance data source view and click next.

Page 35: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 35

Fig 5.4: Invoking the Cube Wizard

8. Wait for the Cube Wizard to analyze the data and then click next.

9. The Wizard will get most of the analysis right, but we can fine-tune it a bit. Select DimTime in

the Time Dimension combo box. Uncheck the Fact checkbox on the line for the dbo.DimTime table.

This will allow us to analyze this dimension using standard time periods.

10. Click Next.

11. Accept the default measures and click next.

12. Wait for the Cube Wizard to detect hierarchies and then click next.

13. Accept the default dimension structure and click next.

14. Name the new cube FinanceCube and click Finish.

Page 36: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 36

Deploying and Processing a Cube

At this point, we’ve defined the structure of the new cube - but there’s still more work to be done.

We still need to deploy this structure to an Analysis Services server and then process the cube to

create the aggregates that make querying fast and easy.

To deploy the cube we just created, select Build ⇒ Deploy AdventureWorksCube1. This will deploy

the cube to our local Analysis Server, and also process the cube, building the aggregates for us.

BIDS will open the Deployment Progress window, as shown in Figure 2-5, to keep us informed

during deployment and processing.

Fig 5.5: Deployment Process Window

Exploring a Data Cube

At last we’re ready to see what all the work was for. BIDS includes a built-in Cube Browser that lets

we interactively explore the data in any cube that has been deployed and processed. To open the

Cube Browser, right click on the cube in Solution Explorer and select Browse.

The Cube Browser is a drag-and-drop environment. If we’ve worked with pivot tables in Microsoft

Excel, we should have no trouble using the Cube browser. The pane to the left includes all of the

measures and dimensions in our cube, and the pane to the right gives we drop targets for these

measures and dimensions. Among other operations, we can:

Page 37: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 37

Drop a measure in the Totals/Detail area to see the aggregated data for that measure.

• Drop a dimension or level in the Row Fields area to summarize by that level or dimension on rows.

• Drop a dimension or level in the Column Fields area to summarize by that level or dimension on

columns

• Drop a dimension or level in the Filter Fields area to enable filtering by members of that dimension

or level.

• Use the controls at the top of the report area to select additional filtering expressions.

To see the data in the cube we just created, follow these steps:

1. Right-click on the cube in Solution Explorer and select Browse.

2. Expand the Measures node in the metadata panel (the area at the left of the user interface).

3. Expand the Fact Finance node.

4. Drag the Amount measure and drop it on the Totals/Detail area.

5. Expand the Dim Account node in the metadata panel.

6. Drag the Account Description property and drop it on the Row Fields area.

7. Expand the Dim Time node in the metadata panel.

8. Drag the Calendar Year-Calendar Quarter-Month Number of Year hierarchy and drop it on

the Column Fields area.

9. Click the + sign next to year 2001 and then the + sign next to quarter 3.

10. Expand the Dim Scenario node in the metadata panel.

11. Drag the Scenario Name property and drop it on the Filter Fields area.

12. Click the dropdown arrow next to scenario name. Uncheck all of the checkboxes except for

the one next to the Budget name.

Figure 5.6 shows the result. The Cube Browser displays month-by-month budgets by account for the

third quarter of 2001. Although we could have written queries to extract this information from the

original source data, it’s much easier to let Analysis Services do the heavy lifting for us.

Page 38: ADBs_file1-5

130370702015 ADVANCED DATABASE

PIET/ME/CE Page 38

Fig 5.6: Viewing data in a Data Cube