Upload
abhiparmar92
View
9
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Practical
Citation preview
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 1
PRACTICAL: 1
AIM: Study of Weka Tool
Weka: It is a collection of machine learning algorithms for data mining tasks. The algorithms can
either be applied directly to a dataset or called from our own Java code. Weka contains tools for data
pre-processing, classification, regression, clustering, association rules, and visualization. It is also
well-suited for developing new machine learning schemes.
Fig 1.1: Weka tool
A) The Explorer:
An environment for exploring data with Weka .The Weka Knowledge Explorer is an easy to use
graphical user interface that harnesses the power of the weka software. Each of the major weka
packages Filters, Classifiers, Clusters, Associations, and Attribute Selection is represented in the
Explorer along with a Visualization tool which allows datasets and the predictions of Classifiers and
Clusters to be visualized in two dimensions.
Status Box
The status box appears at the very bottom of the window. It displays messages that keep us informed
about what’s going on.
Log Button
Clicking on this button brings up a separate window containing a scrollable text field. Each line of
text is stamped with the time it was entered into the log. As we perform actions in WEKA, the log
keeps a record of what has happened.
Graphical output
Most graphical displays in WEKA, e.g., the GraphVisualizer or the TreeVisualizer, support saving
the output to a file. A dialog for saving the output can be brought up with Alt+Shift+left-click.
Supported formats are currently Windows Bitmap, JPEG, PNG and EPS (encapsulated Postscript).
The dialog also allows us to specify the dimensions of the generated image.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 2
Section Tabs
1. Preprocess
The preprocess panel is the start for knowledge exploration. From this panel we can load datasets of
any type like WEKA’s ARFF format, CSV format, C4.5 format, or serialized Instances format.
ARFF files typically have a .arff extension, CSV files a .csv extension, C4.5 files a .data and .names
extension, and serialized Instances objects a .bsi extension. We can also browse the characteristics
of attributes and apply any combination of weka's unsupervised or supervised filters to the data.
Fig 1.2: Weka preprocess tab
2. Classify
The classifier panel allows us to configure and execute any of the weka classifiers on the current
dataset. Various test-options are also available like use training set, supplied test set, cross-
validation, percentage split.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 3
Fig 1.3: Weka Classify tab
3. Cluster
From the cluster panel we can perform clustering on the current dataset. Clusters can be visualized
in a cluster output window. From the “Choose” button can select the type of clustering that we want
to perform on the dataset. For example, Hierarchical clusterer, MakeDensityBasedClusterer, EM,
FartherstFirst. From available cluster modes can select according to the requirement. Modes
available are like use training set, supplied test set, percentage split, and classes to clusters
evaluation. One another option available for selecting the attributes which we want to ignore. By
clicking on Start, we can visualize the output in clusterer output window.
4. Associate Panel
From the associate panel we can mine the current dataset for association rules using the weka
associators.
5. Select Attributes Panel
This panel allows applying any combination of weka attribute evaluator and searching method to
select the most pertinent attributes in the dataset. Results can be visualized by right clicking on
result list window.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 4
6. Visualize Panel
This panel displays a scatter plot matrix for the current dataset. The size of the individual cells and
the size of the points they display can be adjusted using the slider controls at the bottom of the
panel. The number of cells in the matrix can be changed by pressing the "Select Attributes" button
and then choosing those attributes to be displayed. When a dataset is large, plotting performance can
be improved by displaying only a subsample of the current dataset. Clicking on a cell in the matrix
pops up a larger plot panel window that displays the view from that cell. This panel allows us to
visualize the current dataset in one and two dimensions. When the colouring attribute is discrete,
each value is displayed as a different color; when the coloring attribute is continuous, a spectrum is
used to indicate the value. Attribute "bars" (down the right hand side of the panel) provide a
convenient summary of the discriminating power of the attributes individually. This panel can also
be popped up in a separate window from the classifier panel and the cluster panel to allow us to
visualize predictions made by classifiers/clusters. When the class is discrete, misclassified points are
shown by a box in the color corresponding to the class predicted by the classifier; when the class is
continuous, the size of each plotted point varies in proportion to the magnitude of the error made by
the classifier.
B) The Experimenter:
This interface is designed to facilitate experimental comparison of the predictive performance of
algorithms based on the many different evaluation criteria that are available in WEKA. Experiments
can involve multiple algorithms that are run across multiple datasets; for example, using repeated
cross-validation. Experiments can also be distributed across different compute nodes in a network to
reduce the computational load for individual nodes. Once an experiment has been set up, it can be
saved in either XML or binary form, so that it can be re-visited if necessary. Configured and saved
experiments can also be run from the command-line. Compared to WEKA’s other user interfaces,
the Experimenter is perhaps used less frequently by data mining practitioners. However, once
preliminary experimentation has been performed in the Explorer, it is often much easier to identify a
suitable algorithm for a particular dataset, or collection of datasets, using this alternative interface.
The Experimenter environment is designed to allow the user to create, run, modify and analyze
experiments. It is a more convenient way to compare several learning schemes and compare the
results to determine if a particular scheme is better than the others. The comparisons can be based on
different measures such as accuracy, speed and so on. As such, the Experimenter environment
makes it easy to work with several data sets and several learning schemes at once.
A major limitation of the Experimenter environment is that we cannot edit the data set by using
filters. Neither is choosing a different class variable or even ignoring any attributes possible. This
environment assumes the data to be analyzed has already been cleaned and filtered and that all these
operations have already been performed on the data set(s), and that the only thing left is to test
classification schemes on it.
Advantages:
The main advantage of using the Experimenter interface is that it allows the user to compare
multiple learning schemes across one or more datasets. In simple mode, a user can setup simple
experiments very quickly and easily. The built-in analyzer is also easy to use and allows the user to
perform various kinds of tests easily. Iteration Controlling is also possible here.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 5
Though we did not look at the advanced mode, it gives the user the ability to run experiments that
are more complex and even to break the experiment across multiple computers. In this mode, the
user has a finer control of how each run of the experiment will be conducted.
Disadvantages:
The Experimenter’s main weakness is the lack of preprocessing facilities as none of Weka’s Filters
can be accessed in this interface. Another disadvantage is the lack of any graphical visualization
tools. It would have been nice, for example, to display multiple ROC curves on the same graph.
Another problem is that when more than 1 of the same learning scheme is being tested but with
different options, it is not easy to tell what resultant set belongs to what learning scheme. It would
be nice if a user would be able to label a given scheme or some way of easily identifying a particular
result set as the names given automatically by the system can be a bit confusing. One more
disadvantage is that it is not possible to select the class variable as the Experimenter will always take
the last attribute in the data set to be the class variable.
C) The Knowledge Flow
It offers a compelling alternative to the other Weka graphical interfaces and some capabilities not
available in the Explorer and Experimenter interface.
The idea behind the Knowledge Flow interface is the “data-flow”. The user is presented with a
layout canvas where they will place components. These components are then connected together to
form a “knowledge flow” that will determine how the data will be processed, analyzed and reported.
All of Weka’s clustering and Classification algorithms including some extra tools are also available
in Knowledge Flow.
A major benefit of the Knowledge Flow interface is that it can handle data either incrementally or in
batches while the Explorer can only handle data in batches. This is of practical importance when
handling extremely large and/or unlimited data. For this to be useful, Weka also provides classifiers
that can handle data incrementally and be updated on an instance by instance basis.
Advantages:
The Knowledge Flow interface presents an alternative to the Experimenter. It allows a user to
construct complex experiments visually and this can make the process easier. Connections are
labeled making it easier to follow the experiment and understand what is going on. As well, the
interface is designed to prevent the creation of invalid connections as components will only be
attached to components can accept connections from.
A major advantage of this interface that we did not look at is that it opens up a whole class of
updateable learning schemes. Updateable learning schemes are able to take in data incrementally to
continuously improve the model as the data streams in. This ability makes it is possible to take
advantage of extremely large or even infinitely large data. This feature is useful in information
retrieval applications and in areas such as spam detection in emails. In these cases, the data is often
unlimited and constantly changing and an updatable model can always take advantage of new data
as it comes in.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 6
Disadvantages:
The main disadvantage of the Knowledge Flow interface is that it is newer and some of Weka’s
functionality is not available or not fully implemented. Unlike the Experimenter, this interface has
the ability to compare graphically the different models as we did in our example with the 2 ROC
curves.
Unfortunately there is no way to compare the two models numerically to determine if there are
statistically significant differences between them as this functionality has not yet been implemented
in this interface.
User Interfaces
Aside from the above mentioned exposure of capabilities and technical information Meta data, there
has been further refinement and improvement to the GUIs in WEKA since version 3.4. The GUI
Chooser—WEKA’s graphical start point—has undergone a redesign and now provides access to
various supporting user interfaces, system information and logging information, as well as the main
applications in WEKA. The “Tools” menu provides two new supporting GUIs:
• SQL viewer: allows user-entered SQL to be run against a database and the results previewed. This
user interface is also used in the Explorer to extract data from a database when the “Open DB”
button is pressed.
• Bayes network editor: provides a graphical environment for constructing, editing and visualizing
Bayesian network classifiers.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 7
PRACTICAL: 2
AIM: To implement Apriori Algorithm using Weka Tool
Software Required: WEKA
Steps to implement Apriori Algorithm using Weka Tool:
1) Invoke Explorer:
We can implement Apriori algorithm using “Explorer” or the “Simple CLI” options given in weka.
Simple CLI is a command line interface for weka, while Explorer provides GUI based facility for
the same.
Fig 2.1: Weka Tool
To invoke Explorer, as shown in fig. 2.1, click the button “Explorer” under the Applications plane.
Explorer is an environment for exploring data.
2) Pre-process:
At the very top of the window, just below the title bar there is a row of tabs. Only the first tab,
‘Preprocess’ is active at the moment because there is no dataset open. The first three buttons at the
top of the preprocess section enable us to load data into WEKA. Data can be imported from a file in
various formats: ARFF, CSV, C4.5, binary, it can also be read from a URL or from an SQL database
(using JDBC). The easiest and the most common way of getting the data into WEKA is to store it as
Attribute-Relation File Format (ARFF) files. Here, we can use the predefined datasets that come
along with weka.
Click on ‘Open file…’ button.
It brings up a dialog box allowing us to browse for the data file on the local file system, choose
“contact-lenses.arff” file.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 8
Fig 2.2: Select Dataset ‘contact-lenses’
3) Associate:
Fig 2.3: Associate tab of weka explorer
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 9
Click ‘Associate’ tab at the top of ‘WEKA Explorer’ window. It brings up interface for the Apriori
algorithm. Select “Apriori” association.
As shown in fig 2.3, click on the ‘Associator’ box, ‘GenericObjectEditor’ appears on our screen. In
the dialog box, change the value in ‘minMetric’ to 0.8 for confidence = 80%. Make sure that the
default value of rules is set to 100. The upper bound for minimum support ‘upperBoundMinSupport’
should be set to 1.0 (100%) and ‘lowerBoundMinSupport’ to 0.1. Apriori in WEKA starts with the
upper bound support and incrementally decreases support (by delta increments, which by default is
set to 0.05 or 5%). The algorithm halts when either the specified number of rules is generated, or the
lower bound for minimum support is reached. The ‘significanceLevel’ testing option is only
applicable in the case of confidence and is (-1.0) by default (not used).
Fig 2.4: Setting options for Apriori Algorithm
Options:
car -- If enabled class association rules are mined instead of (general) association rules.
classIndex -- Index of the class attribute. If set to -1, the last attribute is taken as class attribute.
delta -- Iteratively decrease support by this factor. Reduces support until min support is reached or
required number of rules has been generated.
lowerBoundMinSupport -- Lower bound for minimum support.
metricType -- Set the type of metric by which to rank rules. Confidence is the proportion of the
examples covered by the premise that are also covered by the consequence. Lift is confidence
divided by the proportion of all examples that are covered by the consequence. This is a measure of
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 10
the importance of the association that is independent of support. Leverage is the proportion of
additional examples covered by both the premise and consequence above those expected if the
premise and consequence were independent of each other. The total number of examples that this
represents is presented in brackets following the leverage.
minMetric -- Minimum metric score. Consider only rules with scores higher than this value.
numRules -- Number of rules to find.
outputItemSets -- If enabled the itemsets are output as well.
removeAllMissingCols -- Remove columns with all missing values.
significanceLevel -- Significance level. Significance test (confidence metric only).
upperBoundMinSupport -- Upper bound for minimum support. Start iteratively decreasing minimum
support from this value.
verbose -- If enabled the algorithm will be run in verbose mode.
Once the options have been specified, we can run Apriori algorithm by clicking on the ‘Start’
button.
Fig 2.5: Execution of Apriori using Weka
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 11
Analysing result:
Run Information gives us the following information:
• The scheme for learning association we used - Apriori
• The relation name – “contact-lenses”
• Number of instances in the relation – 24
• Number of attributes in the relation – 5
Fig 2.6: Output of Apriori using Weka
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 12
The results for Apriori algorithm are:
First, the program generated the sets of large itemsets found for each support size considered. In this
case 11 item sets of items were found to have the required minimum support.
By default, Apriori tries to generate ten rules. It begins with a minimum support of 100% of the data
items and decreases this in steps of 5% until there are at least ten rules with the required minimum
confidence, or until the support has reached a lower bound of 10% whichever occurs first. The
minimum confidence is set 0.8 (80%). Generation of the required number of rules involved a total of
16 iterations. The last part gives the association rules that are found. The number preceding = =>
symbol indicates the rule’s support, that is, the number of items covered by its premise. Following
the rule is the number of those items for which the rule’s consequent holds as well. In the
parentheses there is a confidence of the rule.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 13
PRACTICAL: 3
AIM: Implementation of Decision Tree Classifier in Weka Tool
Software Required: WEKA
Steps to implement Decision Tree Classifier using Weka Tool:
1) Invoke Explorer:
We can implement Decision Tree Classifier using “Explorer” or the “Simple CLI” options given in
weka. Simple CLI is a command line interface for weka, while Explorer provides GUI based facility
for the same.
Fig 3.1: Weka Tool
To invoke Explorer, as shown in fig. 3.1, click the button “Explorer” under the Applications plane.
Explorer is an environment for exploring data.
Pre-process:
At the very top of the window, just below the title bar there is a row of tabs. Only the first tab,
‘Preprocess’, is active at the moment because there is no dataset open. The first three buttons at the
top of the preprocess section enable us to load data into WEKA. Data can be imported from a file in
various formats: ARFF, CSV, C4.5, binary, it can also be read from a URL or from an SQL database
(using JDBC). The easiest and the most common way of getting the data into WEKA is to store it as
Attribute-Relation File Format (ARFF) files. Here, we can use the predefined datasets that come
along with weka.
Click on ‘Open file…’ button.
It brings up a dialog box allowing us to browse for the data file on the local file system, choose
“weather.nominal.arff” file.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 14
Fig 3.2: Select Dataset ‘weather.nominal’
Classify:
Fig 3.3: Choosing Classifier ‘ J48’
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 15
Once we have our data set loaded, all the tabs are available to us. Click on the ‘Classify’ tab. The
‘Classify’ window comes up on the screen.
Choosing a Classifier
Click on ‘Choose’ button in the ‘Classifier’ box just below the tabs and select J48 classifier
WEKA → Classifiers → Trees → J48.
Options:
Fig 3.4: Setting options for ‘J48’ Classifier
binarySplits -- Whether to use binary splits on nominal attributes when building the trees.
confidenceFactor -- The confidence factor used for pruning (smaller values incur more pruning).
debug -- If set to true, classifier may output additional info to the console.
minNumObj -- The minimum number of instances per leaf.
numFolds -- Determines the amount of data used for reduced-error pruning. One fold is used for
pruning, the rest for growing the tree.
reducedErrorPruning -- Whether reduced-error pruning is used instead of C.4.5 pruning.
saveInstanceData -- Whether to save the training data for visualization.
seed -- The seed used for randomizing the data when reduced-error pruning is used.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 16
subtreeRaising -- Whether to consider the subtree raising operation when pruning.
unpruned -- Whether pruning is performed.
useLaplace -- Whether counts at leaves are smoothed based on Laplace.
Setting Test Options
Before we run the classification algorithm, we need to set test options. Set test options in the ‘Test
options’ box. The test options that available to us are:
1. Use training set. Evaluates the classifier on how well it predicts the class of the instances it was
trained on.
2. Supplied test set. Evaluates the classifier on how well it predicts the class of a set of instances
loaded from a file. Clicking on the ‘Set…’ button brings up a dialog allowing us to choose the file to
test on.
3. Cross-validation. Evaluates the classifier by cross-validation, using the number of folds that are
entered in the ‘Folds’ text field.
4. Percentage split. Evaluates the classifier on how well it predicts a certain percentage of the data,
which is held out for testing. The amount of data held out depends on the value entered in the ‘%’
field.
Fig 3.5: Setting Evaluation options for the test
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 17
In the ‘Classifier evaluation options’ make sure that the following options are checked:
1. Output model. The output is the classification model on the full training set, so that it can be
viewed, visualized, etc.
2. Output per-class stats. The precision/recall and true/false statistics for each class output.
3. Output confusion matrix. The confusion matrix of the classifier’s predictions is included in the
output.
4. Store predictions for visualization. The classifier’s predictions are remembered so that they can
be visualized.
5. Set ‘Random seed for Xval / % Split’ to 1. This specifies the random seed used when
randomizing the data before it is divided up for evaluation purposes
Once the options have been specified, we can run the classification algorithm. Click on ‘Start’
button to start the learning process. We can stop learning process at any time by clicking on ‘Stop’
button.
When training set is complete, the ‘Classifier’ output area on the right panel of ‘Classify’ window is
filled with text describing the results of training and testing. A new entry appears in the ‘Result list’
box on the left panel of ‘Classify’ window.
Fig 3.6: Executing Classification
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 18
Analyzing Results:
Run Information gives us the following information:
• The algorithm used – J48
• The relation name – “weather.nominal”
• Number of instances in the relation – 14
• Number of attributes in the relation – 5 and the list of the attributes: outlook, temperature,
humidity, windy, play
• The test mode selected: Cross-validation, No. Of Folds = 14
Classifier model is a pruned decision tree in textual form that was produced on the full training data.
As we can see, the first split is on the ‘outlook’ attribute, at the second level, the splits are on
‘humidity’ and ‘windy’. In the tree structure, a colon represents the class label that has been
assigned to a particular leaf, followed by the number of instances that reach that leaf. Below the tree
structure, there is a number of leaves (which is 5), and the number of nodes in the tree - size of the
tree (which is 8).
Fig 3.7: Output of Classification
Evaluation on Cross-validation. This part of the output gives estimates of the tree’s predictive
performance, generated by WEKA’s evaluation module. It outputs the list of statistics summarizing
how accurately the classifier was able to predict the true class of the instances under the chosen test
module. The set of measurements is derived from the training data.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 19
In this case only 50% of 14 training instances have been classified correctly. This indicates that the
results obtained from the training data are not optimistic compared with what might be obtained
from the independent test set from the same source. In addition to classification error, the evaluation
output measurements derived from the class probabilities assigned by the tree. More specifically, it
outputs mean output error (0.3988) of the probability estimates, the root mean squared error
(0.5717) is the square root of the quadratic loss. The mean absolute error calculated in a similar way
by using the absolute instead of squared difference. The reason that the errors are not 1 or 0 is
because not all training instances are classified correctly.
Detailed Accuracy by Class demonstrates a more detailed per class break down of the classifier’s
prediction accuracy. From the Confusion matrix we can see that two instances of a class ‘yes’ have
been assigned to a class ‘no’ and five instances of class ‘no’ have been assigned to class ‘yes’.
Visualization of Results:
WEKA lets us to see a graphical representation of the classification tree. Right-click on the entry in
‘Result list’ for which we would like to visualize a tree.
Fig 3.8: Visualization of result – Tree view
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 20
PRACTICAL: 4
AIM: Overview of SQL Server 2005 Analysis Services
Software Required: Analysis services- SQL Server-2005.
Knowledge Required: Data Mining Concepts
Theory/Logic:
Data Mining
Act of excavation in the data from which patterns can be extracted
Alternative name: Knowledge discovery in databases (KDD)
Multiple disciplines: database, statistics, artificial intelligence
Fastly maturing technology
Unlimited applicability
Figure 4.1: Data mining process
Data Mining Tasks - Summary
Classification
Regression
Segmentation
Association Analysis
Anomaly detection Sequence Analysis
Time-series Analysis
Text categorization
Advanced insights discovery
Others
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 21
The data mining algorithms and tools in SQL Server 2005 make it easy to build a comprehensive
solution for a variety of projects, including market basket analysis, forecasting analysis, and targeted
mailing analysis.
The most visible components in SQL Server 2005 are the workspaces that we use to create and work
with data mining models. The online analytical processing (OLAP) and data mining tools are
consolidated into two working environments: Business Intelligence Development Studio and SQL
Server Management Studio. Using Business Intelligence Development Studio, we can develop an
Analysis Services project disconnected from the server. When the project is ready, we can deploy it
to the server. We can also work directly against the server. The main function of SQL Server
Management Studio is to manage the server. Each environment is described in more detail later in
this introduction.
All of the data mining tools exist in the data mining editor. Using the editor we can manage mining
models, create new models, view models, compare models, and create predictions based on existing
models.
After we build a mining model, we will want to explore it, looking for interesting patterns and rules.
Each mining model viewer in the editor is customized to explore models built with a specific
algorithm.
Often the project will contain several mining models, so before we can use a model to create
predictions, we need to be able to determine which model is the most accurate. For this reason, the
editor contains a model comparison tool called the Mining
Accuracy Chart tab, using this tool we can compare the predictive accuracy of our models and
determine the best model.
To create predictions, we will use the Data Mining Extensions (DMX) language. DMX extends
SQL, containing commands to create, modify, and predict against mining models. Because creating
a prediction can be complicated, the data mining editor contains a tool called Prediction Query
Builder, which allows us to build queries using a graphical interface. We can also view the DMX
code that is generated by the query builder.
The key to creating a mining model is the data mining algorithm. The algorithm finds patterns in the
data that we pass it, and it translates them into a mining model — it is the engine behind the process.
SQL Server 2005 includes nine algorithms:
1. Microsoft Decision Trees
2. Microsoft Clustering
3. Microsoft Naïve Bayes
4. Microsoft Sequence Clustering
5. Microsoft Time Series
6. Microsoft Association
7. Microsoft Neural Network
8. Microsoft Linear Regression
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 22
9. Microsoft Logistic Regression
Using a combination of these nine algorithms, we can create solutions to common business
problems. Some of the most important steps in creating a data mining solution are consolidating,
cleaning, and preparing the data to be used to create the mining models. SQL Server 2005 includes
the Data Transformation Services (DTS) working environment, which contains tools that we can use
to clean, validate, and prepare our data. The audience for this tutorial is business analysts,
developers, and database administrators who have used data mining tools before and are familiar
with data mining concepts.
Business Intelligence Development Studio
Business Intelligence Development Studio is a set of tools designed for creating business
intelligence projects. Because Business Intelligence Development Studio was created as an IDE
environment in which we can create a complete solution, we can work disconnected from the server.
We can change our data mining objects as much as we want, but the changes are not reflected on the
server until after we deploy the project.
Working in an IDE is beneficial for the following reasons:
• We have powerful customization tools available to configure Business Intelligence Development
Studio to suit our needs.
• We can integrate our Analysis Services project with a variety of other business intelligence
projects encapsulating our entire solution into a single view.
• Full source control integration enables our entire team to collaborate in creating a complete
business intelligence solution.
The Analysis Services project is the entry point for a business intelligence solution. An Analysis
Services project encapsulates mining models and OLAP cubes, along with supplemental objects that
make up the Analysis Services database. From Business Intelligence Development Studio, we can
create and edit Analysis Services objects within a project and deploy the project to the appropriate
Analysis Services server or servers.
Working with Data Mining
Data mining gives us access to the information that we need to make intelligent decisions about
difficult business problems. Microsoft SQL Server 2005 Analysis Services (SSAS) provides tools
for data mining with which we can identify rules and patterns in our data, so that we can determine
why things happen and predict what will happen in the future. When we create a data mining
solution in Analysis Services, we first create a model that describes our business problem, and then
we run our data through an algorithm that generates a mathematical model of the data, a process that
is known as training the model. We can then either visually explore the mining model or create
prediction queries against it. Analysis Services can use datasets from both relational and OLAP
databases, and includes a variety of algorithms that we can use to investigate that data.
SQL Server 2005 provides different environments and tools that we can use for data mining. The
following sections outline a typical process for creating a data mining solution, and identify the
resources to use for each step.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 23
Creating an Analysis Services Project
To create a data mining solution, we must first create a new Analysis Services project, and then add
and configure a data source and a data source view for the project. The data source defines the
connection string and authentication information with which to connect to the data source on which
to base the mining model. The data source view provides an abstraction of the data source, which we
can use to modify the structure of the data to make it more relevant to our project.
Adding Mining Structures to an Analysis Services Project
After we have created an Analysis Services project, we can add mining structures, and one or more
mining models that are based on each structure. A mining structure, including tables and columns, is
derived from an existing data source view or OLAP cube in the project. Adding a new mining
structure starts the Data Mining Wizard, which we use to define the structure and to specify an
algorithm and training data for use in creating an initial model based on that structure.
We can use the Mining Structure tab of Data Mining Designer to modify existing mining
structures, including adding columns and nested tables.
Working with Data Mining Models
Before we can use the mining models we define, we must process them so that Analysis Services
can pass the training data through the algorithms to fill the models. Analysis Services provides
several options for processing mining model objects, including the ability to control which objects
are processed and how they are processed.
After we have processed the models, we can investigate the results and make decisions about which
models perform the best. Analysis Services provides viewers for each mining model type, within the
Mining Model Viewer tab in Data Mining Designer, which we can use to explore the mining
models. Analysis Services also provides tools,
in the Mining Accuracy Chart tab of the designer, that we can use to directly compare mining
models and to choose the mining model that works best for our purpose. These tools include a lift
chart, a profit chart, and a classification matrix.
Creating Predictions
The main goal of most data mining projects is to use a mining model to create predictions. After we
explore and compare mining models, we can use one of several tools to create predictions. Analysis
Services provides a query language called Data Mining Extensions (DMX) that is the basis for
creating predictions. To help us build DMX prediction queries, SQL Server provides a query
builder, available in SQL Server Management Studio and Business Intelligence Development
Studio, and DMX templates for the query editor in Management Studio. Within BI Development
Studio, we access the query builder from the Mining Model Prediction tab of Data Mining
Designer.
SQL Server Management Studio
After we have used BI Development Studio to build mining models for our data mining project, we
can manage and work with the models and create predictions in Management Studio.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 24
SQL Server Reporting Services
After we create a mining model, we may want to distribute the results to a wider audience. We can
use Report Designer in Microsoft SQL Server 2005 Reporting Services (SSRS) to create reports,
which we can use to present the information that a mining model contains. We can use the result of
any DMX query as the basis of a report, and can take advantage of the parameterization and
formatting features that are available in Reporting Services.
Working Programmatically with Data Mining
Analysis Services provides several tools that we can use to programmatically work with data
mining. The Data Mining Extensions (DMX) language provides statements that we can use to
create, train, and use data mining models. We can also perform these tasks by using a combination
of XML for Analysis (XMLA) and Analysis Services Scripting Language (ASSL), or by using
Analysis Management Objects (AMO).
We can access all the metadata that is associated with data mining by using data mining schema row
sets. For example, we can use schema row sets to determine the data types that an algorithm
supports, or the model names that exist in a database.
Data Mining Concepts
Data mining is frequently described as "the process of extracting valid, authentic, and actionable
information from large databases." In other words, data mining derives patterns and trends that exist
in data. These patterns and trends can be collected together and defined as a mining model. Mining
models can be applied to specific business scenarios, such as:
• Forecasting sales.
• Targeting mailings toward specific customers.
• Determining which products are likely to be sold together.
• Finding sequences in the order that customers add products to a shopping cart.
An important concept is that building a mining model is part of a larger process that includes
everything from defining the basic problem that the model will solve, to deploying the model into a
working environment. This process can be defined by using the following six basic steps:
1. Defining the Problem
2. Preparing Data
3. Exploring Data
4. Building Models
5. Exploring and Validating Models
6. Deploying and Updating Models
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 25
The following diagram describes the relationships between each step in the process, and the
technologies in Microsoft SQL Server 2005 that we can use to complete each step.
Figure 4.2: Steps for building a Mining model
Although the process that is illustrated in the diagram is circular, each step does not necessarily lead
directly to the next step. Creating a data mining model is a dynamic and iterative process. After we
explore the data, we may find that the data is insufficient to create the appropriate mining models,
and that we therefore have to look for more data. We may build several models and realize that they
do not answer the problem posed when we defined the problem, and that we therefore must redefine
the problem. We may have to update the models after they have been deployed because more data
has become available. It is therefore important to understand that creating a data mining model is a
process, and that each step in the process may be repeated as many times as needed to create a good
model.
SQL Server 2005 provides an integrated environment for creating and working with data mining
models, called Business Intelligence Development Studio. The environment includes data mining
algorithms and tools that make it easy to build a comprehensive solution for a variety of projects.
Defining the Problem
The first step in the data mining process, as highlighted in the following diagram, is to clearly define
the business problem.
This step includes analyzing business requirements, defining the scope of the problem, defining the
metrics by which the model will be evaluated, and defining the final objective for the data mining
project. These tasks translate into questions such as the following:
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 26
Figure 4.3: Step 1 - Defining the Problem
• What are we looking for?
• Which attribute of the dataset do we want to try to predict?
• What types of relationships are we trying to find?
• Do we want to make predictions from the data mining model or just look for interesting patterns
and associations?
• How is the data distributed?
• How are the columns related, or if there are multiple tables, how are the tables related?
To answer these questions, we may have to conduct a data availability study, to investigate the
needs of the business users with regard to the available data. If the data does not support the needs
of the users, we may have to redefine the project.
Preparing Data
The second step in the data mining process, as highlighted in the following diagram, is to
consolidate and clean the data that was identified in the Defining the Problem step.
Microsoft SQL Server 2005 Integration Services (SSIS) contains all the tools that we need to
complete this step, including transforms to automate data cleaning and consolidation.
Data can be scattered across a company and stored in different formats, or may contain
inconsistencies such as flawed or missing entries. For example, the data might show that a customer
bought a product before that customer was actually even born, or that the customer shops regularly
at a store located 2,000 miles from her home. Before we start to build models, we must fix these
problems. Typically, we are working with a very large dataset and cannot look through every
transaction. Therefore, we have to use some form of automation, such as in Integration Services, to
explore the data and find the inconsistencies.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 27
Figure 4.4: Step 2 – Preparing Data
Exploring Data
The third step in the data mining process, as highlighted in the following diagram, is to explore the
prepared data.
Figure 4.5: Step 3 – Exploring Data
We must understand the data in order to make appropriate decisions when we create the models.
Exploration techniques include calculating the minimum and maximum values, calculating mean
and standard deviations, and looking at the distribution of the data. After we explore the data, we
can decide if the dataset contains flawed data, and then we can devise a strategy for fixing the
problems.
Data Source View Designer in BI Development Studio contains several tools that we can use to
explore data.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 28
Building Models
The fourth step in the data mining process, as highlighted in the following diagram, is to build the
mining models.
Figure 4.6: Step 4 – Building Models
Before we build a model, we must randomly separate the prepared data into separate training and
testing datasets. We use the training dataset to build the model, and the testing dataset to test the
accuracy of the model by creating prediction queries. We can use the Percentage Sampling
Transformation in Integration Services to split the dataset.
We will use the knowledge that we gain from the Exploring Data step to help define and create a
mining model. A model typically contains input columns, an identifying column, and a predictable
column. We can then define these columns in a new model by using the Data Mining Extensions
(DMX) language or the Data Mining Wizard in BI Development Studio.
After we define the structure of the mining model, we process it, populating the empty structure
with the patterns that describe the model. This is known as training the model. Patterns are found by
passing the original data through a mathematical algorithm. SQL Server 2005 contains a different
algorithm for each type of model that we can build. We can use parameters to adjust each algorithm.
A mining model is defined by a data mining structure object, a data mining model object, and a data
mining algorithm.
Microsoft SQL Server 2005 Analysis Services (SSAS) includes the following algorithms:
• Microsoft Decision Trees Algorithm
• Microsoft Clustering Algorithm
• Microsoft Naive Bayes Algorithm
• Microsoft Association Algorithm
• Microsoft Sequence Clustering Algorithm
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 29
• Microsoft Time Series Algorithm
• Microsoft Neural Network Algorithm (SSAS)
• Microsoft Logistic Regression Algorithm
• Microsoft Linear Regression Algorithm
Exploring and Validating Models
The fifth step in the data mining process, as highlighted in the following diagram, is to explore the
models that we have built and test their effectiveness.
Figure 4.7: Step 5 – Exploring and Validating Models
We do not want to deploy a model into a production environment without first testing how well the
model performs. Also, we may have created several models and will have to decide which model
will perform the best. If none of the models that we created in the Building Models step perform
well, we may have to return to a previous step in the process, either by redefining the problem or by
reinvestigating the data in the original dataset.
We can explore the trends and patterns that the algorithms discover by using the viewers in Data
Mining Designer in BI Development Studio. We can also test how well the models create
predictions by using tools in the designer such as the lift chart and classification matrix. These tools
require the testing data that we separated from the original dataset in the model-building step.
Deploying and Updating Models
The last step in the data mining process, as highlighted in the following diagram, is to deploy to a
production environment the models that performed the best.
After the mining models exist in a production environment, we can perform many tasks, depending
on our needs. Following are some of the tasks we can perform:
• Use the models to create predictions, which we can then use to make business decisions. SQL
Server provides the DMX language that we can use to create prediction queries, and Prediction
Query Builder to help us build the queries.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 30
Figure 4.8: Step 6 – Deploying and Updating Models
• Embed data mining functionality directly into an application. We can include Analysis
Management Objects (AMO) or an assembly that contains a set of objects that our application can
use to create, alter, process, and delete mining structures and mining models. Alternatively, we can
send XML for Analysis (XMLA) messages directly to an instance of Analysis Services.
• Use Integration Services to create a package in which a mining model is used to intelligently
separate incoming data into multiple tables. For example, if a database is continually updated with
potential customers, we could use a mining model together with Integration Services to split the
incoming data into customers who are likely to purchase a product and customers who are likely to
not purchase a product.
• Create a report that lets users directly query against an existing mining model. Updating the model
is part of the deployment strategy. As more data comes into the organization, we must reprocess the
models, thereby improving their effectiveness.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 31
PRACTICAL: 5
AIM: Design and Create cube by identifying measures and dimensions for Star
Schema
Software Required: Analysis services- SQL Server-2005.
Knowledge Required: Data cube
Theory/Logic:
Creating a Data Cube
To build a new data cube using BIDS, we need to perform these steps:
• Create a new Analysis Services project
• Define a data source
• Define a data source view
• Invoke the Cube Wizard
We’ll look at each of these steps in turn.
Creating a New Analysis Services Project
To create a new Analysis Services project, we use the New Project dialog box in BIDS. This is very
similar to creating any other type of new project in Visual Studio.
To create a new Analysis Services project, follow these steps:
1. Select Microsoft SQL Server 2005 ⇒SQL Server Business Intelligence Development Studio
from the Programs menu to launch Business Intelligence Development Studio.
2. Select File ⇒ New Project. ⇒
3. In the New Project dialog box, select the Business Intelligence Projects project type.
Fig 5.1: Solution Explorer window
4. Select the Analysis Services Project template.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 32
5. Name the new project AdventureWorksCube1 and select a convenient location to save it.
6. Click OK to create the new project.
Figure 5.1 shows the Solution Explorer window of the new project, ready to be populated with
objects.
Defining a Data Source
To define a data source, we’ll use the Data Source Wizard. We can launch this wizard by right-
clicking on the Data Sources folder in our new Analysis Services project. The wizard will walk us
through the process of defining a data source for our cube, including choosing a connection and
specifying security credentials to be used to connect to the data source.
To define a data source for the new cube, follow these steps:
1. Right-click on the Data Sources folder in Solution Explorer and select New Data Source.
2. Read the first page of the Data Source Wizard and click next.
Fig 5.2: Connection Manager Window
3. We can base a data source on a new or an existing connection. Because we don’t have any
existing connections, click New.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 33
4. In the Connection Manager Dialog box, select the server containing our analysis services sample
database from the Server Name combo box.
5. Fill in our authentication information.
6. Select the Native OLE DB\SQL Native Client provider (this is the default provider).
7. Select the AdventureWorksDW database. Figure 3-2 shows the filled-in Connection Manager
Dialog box.
8. Click OK to dismiss the Connection Manager Dialog box.
9. Click Next.
10. Select Default impersonation information to use the credentials we just supplied for the
connection and click Next.
11. Accept the default data source name and click Finish.
Defining a Data Source View
A data source view is a persistent set of tables from a data source that supply the data for a particular
cube. BIDS also includes a wizard for creating data source views, which we can invoke by right-
clicking on the Data Source Views folder in Solution Explorer.
To create a new data source view, follow these steps:
1. Right-click on the Data Source Views folder in Solution Explorer and select New Data Source
View.
2. Read the first page of the Data Source View Wizard and click next.
3. Select the Adventure Works DW data source and click next. Note that we could also launch the
Data Source Wizard from here by clicking New Data Source.
4. Select the dbo.FactFinance table in the Available Objects list and click the ⇒ button to move it to
the Included Object list. This will be the fact table in the new cube.
5. Click the Add Related Tables button to automatically add all of the tables that are directly related
to the dbo.FactFinance table. These will be the dimension tables for the new cube. Figure 2-3 shows
the wizard with all of the tables selected.
6. Click Next.
7. Name the new view Finance and click Finish. BIDS will automatically display the schema of the
new data source view, as shown in Figure 5.3.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 34
Fig 5.3: Data Source View Wizard
Analysis Services
Invoking the Cube Wizard
As we can probably guess at this point, we invoke the Cube Wizard by right clicking on the Cubes
folder in Solution Explorer. The Cube Wizard interactively explores the structure of our data source
view to identify the dimensions, levels, and measures in our cube.
To create the new cube, follow these steps:
1. Right-click on the Cubes folder in Solution Explorer and select New Cube.
2. Read the first page of the Cube Wizard and click next.
3. Select the option to build the cube using a data source.
4. Check the Auto Build checkbox.
5. Select the option to create attributes and hierarchies.
6. Click Next.
7. Select the Finance data source view and click next.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 35
Fig 5.4: Invoking the Cube Wizard
8. Wait for the Cube Wizard to analyze the data and then click next.
9. The Wizard will get most of the analysis right, but we can fine-tune it a bit. Select DimTime in
the Time Dimension combo box. Uncheck the Fact checkbox on the line for the dbo.DimTime table.
This will allow us to analyze this dimension using standard time periods.
10. Click Next.
11. Accept the default measures and click next.
12. Wait for the Cube Wizard to detect hierarchies and then click next.
13. Accept the default dimension structure and click next.
14. Name the new cube FinanceCube and click Finish.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 36
Deploying and Processing a Cube
At this point, we’ve defined the structure of the new cube - but there’s still more work to be done.
We still need to deploy this structure to an Analysis Services server and then process the cube to
create the aggregates that make querying fast and easy.
To deploy the cube we just created, select Build ⇒ Deploy AdventureWorksCube1. This will deploy
the cube to our local Analysis Server, and also process the cube, building the aggregates for us.
BIDS will open the Deployment Progress window, as shown in Figure 2-5, to keep us informed
during deployment and processing.
Fig 5.5: Deployment Process Window
Exploring a Data Cube
At last we’re ready to see what all the work was for. BIDS includes a built-in Cube Browser that lets
we interactively explore the data in any cube that has been deployed and processed. To open the
Cube Browser, right click on the cube in Solution Explorer and select Browse.
The Cube Browser is a drag-and-drop environment. If we’ve worked with pivot tables in Microsoft
Excel, we should have no trouble using the Cube browser. The pane to the left includes all of the
measures and dimensions in our cube, and the pane to the right gives we drop targets for these
measures and dimensions. Among other operations, we can:
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 37
Drop a measure in the Totals/Detail area to see the aggregated data for that measure.
• Drop a dimension or level in the Row Fields area to summarize by that level or dimension on rows.
• Drop a dimension or level in the Column Fields area to summarize by that level or dimension on
columns
• Drop a dimension or level in the Filter Fields area to enable filtering by members of that dimension
or level.
• Use the controls at the top of the report area to select additional filtering expressions.
To see the data in the cube we just created, follow these steps:
1. Right-click on the cube in Solution Explorer and select Browse.
2. Expand the Measures node in the metadata panel (the area at the left of the user interface).
3. Expand the Fact Finance node.
4. Drag the Amount measure and drop it on the Totals/Detail area.
5. Expand the Dim Account node in the metadata panel.
6. Drag the Account Description property and drop it on the Row Fields area.
7. Expand the Dim Time node in the metadata panel.
8. Drag the Calendar Year-Calendar Quarter-Month Number of Year hierarchy and drop it on
the Column Fields area.
9. Click the + sign next to year 2001 and then the + sign next to quarter 3.
10. Expand the Dim Scenario node in the metadata panel.
11. Drag the Scenario Name property and drop it on the Filter Fields area.
12. Click the dropdown arrow next to scenario name. Uncheck all of the checkboxes except for
the one next to the Budget name.
Figure 5.6 shows the result. The Cube Browser displays month-by-month budgets by account for the
third quarter of 2001. Although we could have written queries to extract this information from the
original source data, it’s much easier to let Analysis Services do the heavy lifting for us.
130370702015 ADVANCED DATABASE
PIET/ME/CE Page 38
Fig 5.6: Viewing data in a Data Cube