Seminar: Rapid Miner –beginner’s guide - cvut.cz · Creating an experiment Rapid Miner uses nested graphs to describe the knowledge flow process. This process can contain loading

České vysoké učení technické v Praze

Fakulta informačních technologií

Katedra teoretické informatiky

1

MI-ADM – Algorithms of data mining(2010/2011)

Seminar: Rapid Miner – beginner’s guide

Jan Černý, FIT, Czech Technical University in Prague

Evropský sociální fondPraha & EU: Investujeme do vaší budoucnosti

Installing Rapid Miner

Application download

� Download and install from Rapid home page http://rapid-i.com/

Sources download

� Checkout from repository https://rapidminer.svn.sourceforge.net/svnroot/rapidminer/Vega/

� Add tools.jar (from your local JDK) as project dependency.

� Run ant build script.

� Run Rapid miner with class RapidMinerGUI.java in package com.rapidminer.gui.

2

Creating an experiment� Rapid Miner uses nested graphs to describe the

knowledge flow process.

� This process can contain loading data, preprocessing, modeling using different types of algorithms, performance measuring, report generating and so on..

� In this example we will learn step by step how to create a knowledge flow that will read data and performs a cross-validation to test our model quality.

3

Operators� Knowledge flows consists of Operators where each have

given number of inputs and outputs with type checking.

� Each Operator also have its attributes which can be set when you select given operator.

4

Learning simple model

� Let’s construct a simple knowledge flow that will learn our model on all data and get its output.

� Notice the red SVM input and error messages in the problems dialog

5

� This construct will read data from arff file and passes them to the Support Vector Machine model. The model is then send to the output (the right side) where we can view it in the report view.


� Most of the time you can just use Rapid Miner suggested fixes and it will work fine.

� The first error tells us that SVM cannot handle polynomial output attributes and offer us 3 fixes:

� 1) Convert them to numerical which is useful if the attributes has defined distance to each other (let’s say student’s mark (finite set from A to F) - we know that A is closer to B than to C and so on..)

� 2) Classification by regression which uses 1 regression SVM model for each output (we can use regression model to solve classification task with this option)

� 3) Polynomial by binominal classification which uses binominal SVM classifier for each class (to classify into 2 classes – my class and others).

� We add the label from the available fixes(label identifies output attribute – in our case output attribute is named class in arff files), select Classification by binominal classification and see what happens to the knowledge flow.

6


� Knowledge flow changed and one operator was added to set the role of attribute class to label and one nested operator was added to perform polynomial by binominal classification. Inside of that operator is the logic behind creation of the binominal classifiers, in our case the SVM operator (you can view it by double click).

� Nested Operators are identified by symbol on the right bottom.

� Note: you may see same errors that you see here, but this is a bug of rapid miner and knowledge flow will work normally.

7

Results - model

� Now we can switch to results view and we can look at the model which was created.

� Here we can see model description but we want to also know its quality (ie error). For that we need to modify the knowledge flow even further.

8

Performance of model

� Let’s measure performance of our model using 10 fold crossvalidation. Add X-validation operator and plug it instead of the polynomial by binominal classification (PBC) operator. Then cut the PBC operator and paste it inside the learning validation part.

� As you can see validation has 2 parts inside and automatically divides the data. One is executed when model is learned and train data are passed into the input.

� After learning the model testing part is executed and model is tested on the test data.

9

Performance of model

� The testing part uses Apply Model operator which gets output from given model on given data followed by Performance operator which computes various statistics on the output of the model. We are mainly interested in classification accuracy but you can select any other measure available.

10

Results - accuracy

� Now we see accuracy of our model including confusion matrix

11

…

Loop

� Now, we are going to modify our knowledge flow to make some statistically significant experiment – we need to repeat them large number of times.

� Go to the top level of knowledge flow, insert Loop operator instead the X-validation operator and cut-paste the validation operator inside the loop.

� You can see that the last line coming from loop is doubled. That indicates it has multiple values in it. So we can average accuracy values from different runs using Average operator.

12

Knowledge flow overview

13

Documents

Seminar: Rapid Miner –beginner’s guide - cvut.cz · Creating an experiment Rapid Miner uses nested graphs to describe the knowledge flow process. This process can contain loading