21
1 REPORT ON SVM IMPLEMENTATION By: Abhishek Kumar Agrawal (Abby) Research Assistant, University Of Guelph, ON, Canada.

Svm implementation for Health Data

Embed Size (px)

DESCRIPTION

Svm implementation for Health Data

Citation preview

Page 1: Svm implementation for Health Data

1

REPORT

ON

SVM IMPLEMENTATION

By: Abhishek Kumar Agrawal (Abby)

Research Assistant,

University Of Guelph, ON, Canada.

Page 2: Svm implementation for Health Data

2

Contents

Page No.

1. Introduction 3

2. Data Set Description 3

2.1 Transformed Data 3

2.2 Raw Data 3

3. SVM Tool- LIBSVM 3 - 4

3.1 LIBSVM Features 4

4. LIBSVM Usage 4 - 7

4.1 Installation and Data Format 4

4.2 Svm-train Usage 5

4.3 Svm-predict Usage 6

4.4 Svm-scale Usage 6

4.5 Tips on Practical Use 7

5. Data Set Distribution, tests and Observation 8 - 14

5.1 Transformed Data Analysis with LIBSVM 8 - 10

5.1.1 Random Distribution 8

5.1.2 Proportional Distribution 9

5.1.3 Equal Distribution 10

5.2 Raw Data Analysis with LIBSVM 11 - 14

5.2.1 Random and Proportional Distribution 11

5.2.2 Customized Distribution 12

5.2.3 ANN Sample Data with LIBSVM 13

5.2.4 Binary Classification with LIBSVM 13

5.2.5 Tests with All Claims 14

6. Rapid Miner Interface of LIBSVM 15 - 16

7. LIBLINEAR :Usage and Implementation 17 - 19

8. Comparison of one against one Vs One against rest multi label classification 20

9. References 21

Page 3: Svm implementation for Health Data

3

1. Introduction: This document gives an idea about the implementation of Support

Vector Machine on the dataset of Health Provider Network provided by the Kaggle in the

data-mining competition. This document also gives the brief information about the tools

used and the outcomes of different sets of above mention data on SVM along with the

possible reason behind the same.

2. Data Set Description: The data-set used here is a medical data of the HPN, California. It

comprises of member table, claims table, Days In Hospital Y2 and Y3 etc. The goal of the project is

to predict the Days In hospital Y4 using the tables mentioned above.

After the pre-processing of the data, the data statistics was clearly visible. The whole was mapped to

two different data-sets:

2.1 Transformed Data: This data-set consists the table of unique members with their MemberID, Age,

Sex, Total No. of Discrete Claims, Days In Hospital YX, Specialty Group, Primary Procedure Group,

Charlson Index etc. As per HPN data dictionary, the claims of YX-1 is needed to predict the Days

in Hospital for YX. Hence Days In Hospital is the class label for the classification.

Observing this data, we draw the following class distribution statistics:

Class (*DIH) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Percentage 84.5 6.36 3.11 1.91 1.28 0.74 0.45 0.34 0.23 0.19 0.14 0.11 0.10 0.08 0.06 0.34

*DIH here stands for Days In Hospital

We can see from the above table that maximum data i.e. 97.19% lies under the class label of 0 - 4.

Hence the SVM classifier has to deal with such biased data in order to predict true classes.

2.2 Raw Data: This data-set consists spread sheets for each distinguish claims from claim1 till claim 43

for Y1,Y2 and Y3. Since the distributions were based on claims, the data columns regarding each

claim were included explicitly.

Eg: Y1 C3 Raw Data included all the medical description of member who filed the 3 claims in Year1

Here the data-size of a single sheet is comparatively low if we compare it with Transformed Data as

it includes only its specific claim filed data. But again the class distribution is following the same

trend. More than 90% data belongs to the class label of 0 – 4.

3. SVM Tool - LIBSVM : There are many classification open source tool available in market

,among them LIBSVM is one of the best known and tested classifier based on Support Vector

Machine. It is an integrated software for support vector classification, (C-SVC, nu-SVC), regression

epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM). It supports multi-class

classification.

Page 4: Svm implementation for Health Data

4

3.1 LIBSVM provides a simple interface where users can easily link it with their own programs. Main features of LIBSVM includes :

Different SVM formulations Efficient multi-class classification Cross validation for model selection Probability estimates Various kernels (including pre-computed kernel matrix) Weighted SVM for unbalanced data Both C++ and Java sources GUI demonstrating SVM classification and regression Python, R, MATLAB, Perl, Ruby, Weka, Common LISP, CLISP, Haskell, LabVIEW,

and PHP interfaces. C# .NET code and CUDA extension is available. It's also included in some data mining environments: RapidMiner and PCP.

Automatic model selection which can generate contour of cross validation accuracy. Libsvm is available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

4. LIBSVM USAGE: LIBSVM is having very simple interface .It is very user friendly .

Following are the steps needed to be followed for the usage of LIBSVM.

4.1 Installation and Data Format:

On Unix systems (Ubuntu),

First download the libsvm package from http://www.csie.ntu.edu.tw/~cjlin/libsvm

Make sure you have g++ compiler installed, if not install it from System-> Administration->Synaptic Package Manager

Type `make' to build the `svm-train' and `svm-predict' programs. Run them without arguments to show the usages of them.

Install svm-train by typing sudo apt-get install libsvm-tools

. On other systems, consult `Makefile' to build them (e.g., see 'Building Windows binaries' with the package) or use the pre-built binaries (Windows binaries are in the directory `windows'). The format of training and testing data file is: <label> <index1>:<value1> <index2>:<value2> ... . . . Each line contains an instance and is ended by a '\n' character. For classification, <label> is an integer indicating the class label (multi-class is supported). For regression, <label> is the target value which can be any real number. For one-class SVM, it's not used so can be any number. The pair <index>: <value> gives a feature (attribute) value: <index> is an integer starting from 1 and <value> is a real number. The only exception is the pre-computed kernel, where <index> starts from 0; see the section of pre-computed kernels. Indices must be in ASCENDING order. Labels in the testing file are only used to calculate accuracy or errors. If they are unknown, just fill the first column with any numbers.

Page 5: Svm implementation for Health Data

5

Example:

1 1: 0 2: 0 3: 8 4: 4 5: 162 6: -1 7: 0 8: 26 9: 0 10: 2

0 1: 1 2: 1 3: 8 4: 4 5: 42 6: -1 7: 0 8: 41 9: 0 10: 1

0 1: 3 2: 0 3: 4 4: 4 5: 32 6: -1 7: 0 8: 23 9: 0 10: 1

2 1: 0 2: 0 3: 8 4: 4 5: 49 6: -1 7: 0 8: 26 9: 0 10: 2 Some constraints of SVM data-format:

1. The values must be in a real number, it does not support any text format so far. 2. The class label must be at the beginning of the line. 3. There should not be any heading on the top of the file. 4. In the LIBSVM tools, we have a python script named checkdata.py ,it can be used to check the data-format before modeling the data. Any numerical CSV file can be converted to LIBSVM format using convert.c Example: cc convert.c

./a.out <input_filename > <output_filename>

4.2 ‘svm-train’ Usage: Usage: svm-train [options] training_set_file [model_file] options:

-s svm_type : set type of SVM (default 0)

0 -- C-SVC 1 -- nu-SVC 2 -- one-class SVM 3 -- epsilon-SVR 4 -- nu-SVR -t kernel_type : set type of kernel function (default 2)

0 -- linear: u'*v 1 -- polynomial: (gamma*u'*v + coef0)^degree 2 -- radial basis function: exp(-gamma*|u-v|^2) 3 -- sigmoid: tanh(gamma*u'*v + coef0) 4 -- precomputed kernel (kernel values in training_set_file) -d degree : set degree in kernel function (default 3) -g gamma : set gamma in kernel function (default 1/num_features) -r coef0 : set coef0 in kernel function (default 0) -c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1) -n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5) -p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1) -m cachesize : set cache memory size in MB (default 100) -e epsilon : set tolerance of termination criterion (default 0.001) -h shrinking : whether to use the shrinking heuristics, 0 or 1 (default 1) -b probability_estimates : whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0) -wi weight : set the parameter C of class i to weight*C, for C-SVC (default 1) -v n: n-fold cross validation mode -q : quiet mode (no outputs)

Page 6: Svm implementation for Health Data

6

4.3 ‘svm-predict’ Usage:

Usage: svm-predict [options] test_file model_file output_file options:

-b probability_estimates: whether to predict probability estimates, 0 or 1 (default 0). model_file is the model file generated by svm-train. test_file is the test data you want to predict. svm-predict will produce output in the output_file.

4.4 ‘svm-scale’ Usage: Usage: svm-scale [options] data_filename options:

-l lower : x scaling lower limit (default -1) -u upper : x scaling upper limit (default +1) -y y_lower y_upper : y scaling limits (default: no y scaling) -s save_filename : save scaling parameters to save_filename -r restore_filename : restore scaling parameters from restore_filename

4.5 Tips on Practical Use

Scale your data. For example, scale each attribute to [0,1] or [-1,+1].

For C-SVC, consider using the model selection tool in the tools directory.

nu in nu-SVC/one-class-SVM/nu-SVR approximates the fraction of training errors and support vectors.

If data for classification are unbalanced (e.g. many positive and few negative), try different penalty parameters C by -wi (see examples below).

Specify larger cache size (i.e., larger -m) for huge problems. Stepwise Approach: Open the terminal.

Step 1: Convert the data file in the LIBSVM format. cc convert.c

./a.out file1 file2

Here file1 is the input file1 and file2 is the output file in LIBSVM format.

Step 2: Normalize the data using svm-scale function. svm-scale –l -1 –u 1 –s range1 file2>file2.scale

Here file2.scale is the final normalized file in a range of[-1,1].

Step 3: If the subset of data is required for the process, it can be done using the python script

subset.py available in the tools set of libsvm.

python subset.py file2.scale 1000 train_file.scale test_file.scale

Page 7: Svm implementation for Health Data

7

Here file2.scale is the input file and 1000 is the no of rows which will be selected randomly and pasted to the file train_file.scale. The rest of the rows will be pasted in the test_file.scale Step 4: To find the ideal parameters for modeling i.e. cost and gamma. We can use grid search tool

from the libsvm tool-set.

python grid.py train_file.scale

A plot will appear. And it will perform grid based search to find the maximum accuracy. The grid search rate is nothing but the cross validation accuracy. This process takes significantly longer than the others. Just keep waiting. The best parameters will be shown with the grid search graph. Step 5: Now it’s time to train the data. We can select the appropriate options from the option set

mentioned above. svm-train -c 0.03125 –g 0.0078125 -e 0.01 train_file.scale

The SVM algorithm will read the options and the parameters supplied above and results the total nSv i.e. total no. of support vectors and with all the distinct support vectors and parameters of multi classification. All the modeling is stored in the file train_file.scale.model here. LIBSVM employs one against one multi label classification. If we want to perform k-fold cross validation, then we should follow the following syntax:

svm-train -s 0 -c 100 -g 0.1 -v 5 train_file.scale

In order to provide the penalty to certain class labels, we can weight them as follows:

svm-train -c 10 -w1 1 -w2 5 -w4 2 train_file.scale

Train a classifier with penalty 10 = 1 * 10 for class 1, penalty 50 =5 * 10 for class 2, and penalty 20 = 2 * 10 for class 4. Along with these we can also chose the kernel type, svm-type and different supporting parameters by supplying option values referring from option column mentioned above.

Step-6: This is the final step ,called as prediction step. In this testing file is being used to predict the

unknown class label. svm-predict test_file.scale train_file.scale.model output.predict

Here test_file is the testing file with the model file train_file.scale.model and output .predict is the output file of the prediction of unknown class labels. We also get the percentage accuracy.

Here, % Accuracy= (#correctly predicted data /#total testing data) * 100

We can also obtain a model with probability information and predict test data with probability estimates.

svm-predict -b 1 test_file.scale train_file.scale.model output.predict

For more information about the usage of LIBSVM, we can refer the READ ME file of the LIBSVM package.

Page 8: Svm implementation for Health Data

8

5. Data set Distribution, tests and observation:

As described above SVM has been tested with majorly two data-sets, i.e. Transformed Data and Raw

Data. Below is the description about the tests and results of the same with the possible reason beyond

it.

5.1 Transformed Data Analysis With LIBSVM:

Challenges: There were few challenges with this set of data:

1. The dataset is highly biased; it contains 97.91% of the whole data with the class label 0,1,2,3

and 4.

2. It has minimum 92 columns, and it increases with Y1->Y2->Y3. And hence the data complexity

increases.

3. The data-size is huge here. For Memberclaims_Y1 i.e. Year1 claims file has 76037 rows. So to

train and perform grid search such huge data with LIBSVM is very time consuming.

Observation: Small data sample of Transformed Data was taken for training and testing purpose.

Hence data sampling code has been made to select randomly small sample data as per random,

customized, equal and proportional distribution.

5.1.1 Random Distribution: The sample data for training and testing is taken randomly. The results

are shown below:

*RBF is a type of a kernel function which stands for Radial Basis Function.

** Avg. Grid Search stands for avg. of cross validation rates of the tests performed.

Inference: Sampling data contains more than 90% as DIH 0. Hence SVM is unable to learn other DIHs.

The avg. grid search is default here because it matches with the percentage of DIH 0 in original data.

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

Accuracy

600 1400700 Training Data Size

Parameters Value

SVM Type c-svc

Kernel Type *RBF

Cost 0.03125

Gamma 0.0078125

No. of classes 16

No. of features 91

**Avg. Grid Search 83.37%

True Prediction Class 0

Page 9: Svm implementation for Health Data

9

5.1.2 Proportional Distribution: The sample for training and testing data is taken in the exact

proportion as of original data. Sampling here is again done by the source code for sample distribution

generator. The results are shown below:

*Cost and Gamma for Training data size #700 is 8 here. Inference: (i) Avg. Grid Search rate is better than random distribution, so proportional distribution has

better cross validation accuracy over random distribution for transform data. (ii) Even with proportional data, the true predictions are nothing but 0 DIH. It means for 84.5% Zero DIH also, SVM is learning the default classification. Hence the proportion of other DIHs must be increase. 5.1.3 Equal Distribution: In this type of distribution, equal no. of rows are selected randomly from

each DIH. Again the sampling here is done by the source code for sample distribution generator. Starting from 1 row of each class till 40 rows of each class. In this type of distribution is tested against majorly two types of kernel function. 1-> RBF 2-> Polynomial We have tested LIBSVM against both types of kernel function with the increasing size of training data. The four testing samples of size 50, 100, 200 and 500 were selected randomly. All these tests include disjoint sets of training and testing data. Other SVM parameters were kept constant here. So as the effect of data-size can be observed.

SVM Type No. of classes No. of features Cost Gamma True Predictions

c-svc 16 91 0.03125 0.0078125 0,1,3,6,7,14…

79%

80%

81%

82%

83%

84%

85%

86%

600 700 1400

Accuracy

Training Data Size

Parameters Value

SVM Type c-svc

Kernel Type RBF

*Cost 0.03125

*Gamma 0.0078125

No. of classes 16

No. of features 91

Avg. Grid Search 84.17%

True Prediction Class 0

Page 10: Svm implementation for Health Data

10

Testing Data Size 50 Testing Data Size 100

Testing Data Size 200 Testing Data Size 500

Inference: We can infer following things from the result graphs: 1: RBF kernel function is better than Polynomial kernel function in terms of accurate prediction. 2: Training data with lower data-size have less accuracy for both type of kernel functions. 3: With equal sampling, we got true prediction class labels other than 0 also. 4: Accuracy in any of the above cases has not reached near or beyond 90% . 5: As the testing data size is increasing, the true prediction is decreasing. But again in all above cases the maximum true prediction is for DIH 0. Hence the classifier is learning the pattern of DIH 0 correctly.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

16

64

12

8

19

2

25

6

32

0

38

4

44

8

51

2

57

6

64

0RBF

Polynomial

#Training Data

Y Axis: Accuracy

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

16

64

12

8

19

2

25

6

32

0

38

4

44

8

51

2

57

6

64

0

RBF

Polynomial

#TrainingData

Y Axis: Accuracy

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

16

64

12

8

19

2

25

6

32

0

38

4

44

8

51

2

57

6

64

0

RBF

Polynomial

#Training Data

Y Axis: Accuracy

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

16

64

12

8

19

2

25

6

32

0

38

4

44

8

51

2

57

6

64

0

RBF

Polynomial

#Training Data

Y Axis: Accuracy

Page 11: Svm implementation for Health Data

11

5.2 Raw Data Analysis With LIBSVM : Challenges: There were few challenges with this set of data as well: 1. This data set again is very biased towards lower DIH, more than 90% of rows belongs to DIH 0,

in every claim sheet. 2. Since the data is sub-divided into many claim sheet , the data size is comparatively reduce as

compare to that of Transformed Data sheets. So here we do not have plenty of cases for every DIHs. Hence equal distribution of data is not a good option here.

3. Since the columns are less, the pattern learning and then prediction may be challenging.

Observation: Here also analysis is done based on different type of distribution discussed above.

5.2.1 Random And Proportional Distribution: In most of tests here Year 1 Claim1 file has been taken as a sample to observe the result pattern.

Inference: We can conclude few things from the above results: 1. The true prediction here is again 0 DIH here (default), but the notable thing is the cost and

gamma are again the same which we had with Transformed Data. So whenever there will be more 0 DIH in the training data set, the grid search will always result the same cost and gamma. And these parameters tend to predict 0 DIH only.

2. The same result of random and proportional distribution here indicates that classifier is not learning anything but 0 DIH from either distribution.

3. We can conclude that some sort of customization is needed in training the data.

89%

90%

90%

91%

91%

92%

92%

93%

93%

Random Proportional

100

200

500

Distribution

Training Data Size

Y Axis: Accuracy

Parameters Value

SVM Type c-svc

Kernel Type RBF

Cost 0.03125

Gamma 0.0078125

No. of classes 16

No. of features 11

Avg. Grid Search 91.3%

True Prediction Class 0

Page 12: Svm implementation for Health Data

12

5.2.2 Customized Distribution: For this type of distribution again Year 1 claim1 has been taken as a

sample. The result and analysis are as follows:

The customization of the sample data are as follows: Class(DIH) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Sample2 15%

15%

15%

15%

10% 5% 4% 4% 4% 4% 4% 1% 1% 1% 1% 1%

Sample3 0% 20% 20% 15% 15% 5% 4% 4% 4% 4% 4% 1% 1% 1% 1% 1%

Sample4 0% 0% 20% 20% 20% 10% 5% 5% 5% 5% 5% 1% 1% 1% 1% 1%

Sample5 0% 0% 0% 20% 20% 20% 10% 10% 5% 5% 5% 1% 1% 1% 1% 1%

Parameters Customized Sample2 Customized Sample3 Customized Sample4 Customized Sample5

Type of SVM nu-svc, c-svc c-svc c-svc c-svc

Cost 8 2048 32 2048

Gamma 0.5 0.000488281 0.03125 0.00012207

No. of classes 16 15 14 13

No. of features 11 11 11 11

Best Grid Search Rate 22.34% 23.94% 24.86% 31.25%

Best Kernel Function RBF Sigmoid Polynomial Polynomial

True Prediction 0,1,2,3,4 1,2,3,4,8,11,12,15 2,3,4 3,4

Inference: (i) The very first thing we observe here is that accuracy has steeply decreased. It means

it has predicted less DIH 0. (ii) In Customized Sample 2, we see that there are max true predictions of DIH 0, 1, 2, 3, 4. Hence the classifier is not learning the pattern; it is only learning to predict the labels having maximum rows in training data. (iii) Sigmoid and Polynomial kernel function also showed better accuracy, yet it’s below 35%.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

Customised Sample 2

Customised Sample 3

Customised Sample 4

Customised Sample 5

RBF

Polynomial

Linear

SigmoidAcc

ura

cy

Page 13: Svm implementation for Health Data

13

5.2.3 ANN Sample Data With LIBSVM: To compare the results with ANN(Artificial Neural Network) ,

few common sample data-sets were tested. The accuracy results are as below:

0.00012207 Inference: The accuracy is comparatively high than the data-sets tested so far, but the notable thing is

Avg. Grid search rate is still very low, i.e. 56.89% . Now this is because the testing data taken here was highly biased, more than 97% rows had 0 & 1 DIHs. So the classifier has only predicted 0 and 1 DIH with greater accuracy.

5.2.4 Binary Classification with LIBSVM: As SVM is basically a binary classifier, it is tested with

HPN binary class of Raw Data. The transformation of multi-class to binary class has been done in following two ways: 1. One against rest: Eg: class 0 and rest classes as 1, class 1 and rest classes as 2,so on.. 2. One against rest with omition: The samples of lower DIH were omitted here. Eg: 1 and rest binary data-set, doesn’t contain 0 DIH.

91.50%

92.00%

92.50%

93.00%

93.50%

94.00%

94.50%

95.00%

95.50%

96.00%

96.50%

Sample1 Sample2 Sample3

Accuracy

Y Axis: Accuracy

0%

20%

40%

60%

80%

100%

120%

class 0 &rest

class 1 &rest

class 2 &rest

class 3 &rest

Omitting DIH

Without omitting DIH

Y Axis: Accuracy

Parameters Value

SVM Type c-svc

Kernel Type RBF

No. of classes 16

No. of features 11

Avg. Grid Search 56.89%

True Prediction Class 0 ,1

Page 14: Svm implementation for Health Data

14

Inference: Here the accuracy of the datasets without the omitted class labels are higher than that of

omitted class labels. But again in the binary data of without omitted class has more than 90% other class data rather than the respective class labels. Eg: class 3 without omitted data –set has more than 95% data, which belongs to rest of the classes. Hence the prediction is higher which the default one is again. With Class 0 and rest, all the true predictions were 0, as 90% data belongs to DIH 0. 5.2.5 Testing with the All Claims: In order to compare and analyze the results, all the claims (excluding some because of the same type of data-dispersion) were tested with LIBSVM-Rapid Miner. The results are as shown below:

Inference:

1. All the true predictions here where 0 DIH, as maximum claims resulted the same cost and gamma pair after grid search( c=0.03125, g=0.0078125 ). And as we know that these parameters come in picture when there are more than 80-85% DIH as zero. 2. Manually tuning the SVM parameters do give prediction other than DIH 0, but again the prediction accuracy goes down and the true prediction is also not satisfactory. 3. Omitted Data is predicting DIH with more accuracy because it has slightly less diversion because of small sets of omitted DIHs. 4. Sample 3x is the Data set where it has DIH 0, as 3 times more than that of next less DIH that is DIH1. But the accuracy is less because it is predicting only DIH 0, because the data has again more no of DIH 0.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

WholeData

Omitted Data

Sampled 3X

Y Axis: Accuracy

Page 15: Svm implementation for Health Data

15

6. Rapid Miner Interface of LIBSVM: There is a very user friendly interface of LIBSVM in Rapid Miner. We need to follow the following steps to run LIBSVM in Rapid Miner. Step 1: Import the CSV file of the data to be tested.

Repository -> Import CSV File->Select Comma Separator-> Select Name as Annotation ->Mark Days In Hospital as label and its type as text ->Save the file -> Finish. Step 2: Drag and drop the saved data file from the repository and Cross validation from:

Operators-> Evaluation-> Validation->X- Validation Connect the circuits as shown:

Input the Number of cross validation, like shown in the above screen shot. Step 3: Double Click on the blue squares in the Validation box dragged in the Main Process Canvas.

And the canvas will be divided into two parts: 1: Training: In this part of process, we need to drop the classifier. So find the LIBSVM classifier as:

Operators-> Modeling-> Classification and Regression-> Support Vector Modeling->Support Vector Machine (Libsvm).

Page 16: Svm implementation for Health Data

16

2: Testing: In this part of process, we add Apply model and performance box. Apply model can be found as:

Operators-> Modeling->Model Application->Confidences-> Apply Model. Performance can be found as: Operators->Evaluation-> Performance Measurement -> % Performance.

Input the SVM parameters on the right side of the column box as shown. Connect the connections as shown. Step 4: Run the process, by clicking on upper blue triangle. Save the process in the repository.

Wait for the results and then save the results. For more information regarding the usage of Rapid Miner ,refer to the video tutorials at http://rapid-i.com/content/view/189/198/ Note: The results of LIBSVM tool and LIBSVM Rapid Miner are same, since they have implemented

the same set of codes and function. We can refer the description section LIBSVM in Rapid Miner for the clearance.

Page 17: Svm implementation for Health Data

17

7. LIBLINEAR

LIBLINEAR is a simple package for solving large-scale regularized linear classification. It currently supports L2-regularized logistic regression/L2-loss support vector classification/L1-loss support vector classification, and L1-regularized L2-loss support vector classification/ logistic regression. This document explains the usage of LIBLINEAR. It implements one against rest multi label classification.

When to use LIBLINEAR but not LIBSVM

There are some large data for which with/without nonlinear mappings gives similar performances. Without using kernels, one can efficiently train a much larger set via a linear classifier. These data usually have a large number of features. Document classification is an example. Warning: While generally liblinear is very fast, its default solver may be slow under certain situations (e.g., data not scaled or C is large). See Appendix B of our SVM guide about how to handle such cases. http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf Warning: If you are a beginner and your data sets are not large, you should consider LIBSVM first. Installation On Unix systems, type `make' to build the `train' and `predict' programs. Run ./train and ./predict without arguments to show the usages. On other systems, consult `Makefile' to build them (e.g., see 'Building Windows binaries' in this file) or use the pre-built binaries (Windows binaries are in the directory `windows'). `train' Usage

Usage: ./train [options] training_set_file [model_file] options: -s type : set type of solver (default 1) 0 -- L2-regularized logistic regression (primal) 1 -- L2-regularized L2-loss support vector classification (dual) 2 -- L2-regularized L2-loss support vector classification (primal) 3 -- L2-regularized L1-loss support vector classification (dual) 4 -- multi-class support vector classification by Crammer and Singer 5 -- L1-regularized L2-loss support vector classification 6 -- L1-regularized logistic regression 7 -- L2-regularized logistic regression (dual) -c cost : set the parameter C (default 1) -e epsilon : set tolerance of termination criterion -s 0 and 2 |f'(w)|_2 <= eps*min(pos,neg)/l*|f'(w0)|_2,

Page 18: Svm implementation for Health Data

18

where f is the primal function and pos/neg are # of positive/negative data (default 0.01) -s 1, 3, 4 and 7 Dual maximal violation <= eps; similar to libsvm (default 0.1) -s 5 and 6 |f'(w)|_inf <= eps*min(pos,neg)/l*|f'(w0)|_inf, where f is the primal function (default 0.01) -B bias : if bias >= 0, instance x becomes [x; bias]; if < 0, no bias term added (default -1) -wi weight: weights adjust the parameter C of different classes (see README for details) -v n: n-fold cross validation mode -q : quiet mode (no outputs) Option -v randomly splits the data into n parts and calculates cross validation accuracy on them. Formulations: For L2-regularized logistic regression (-s 0), we solve

min_w w^Tw/2 + C \sum log(1 + exp(-y_i w^Tx_i)) For L2-regularized L2-loss SVC dual (-s 1), we solve

min_alpha 0.5(alpha^T (Q + I/2/C) alpha) - e^T alpha s.t. 0 <= alpha_i, For L2-regularized L2-loss SVC (-s 2), we solve

min_w w^Tw/2 + C \sum max(0, 1- y_i w^Tx_i)^2 For L2-regularized L1-loss SVC dual (-s 3), we solve

min_alpha 0.5(alpha^T Q alpha) - e^T alpha s.t. 0 <= alpha_i <= C, For L1-regularized L2-loss SVC (-s 5), we solve

min_w \sum |w_j| + C \sum max(0, 1- y_i w^Tx_i)^2 For L1-regularized logistic regression (-s 6), we solve

min_w \sum |w_j| + C \sum log(1 + exp(-y_i w^Tx_i)) where Q is a matrix with Q_ij = y_i y_j x_i^T x_j. For L2-regularized logistic regression (-s 7), we solve

min_alpha 0.5(alpha^T Q alpha) + \sum alpha_i*log(alpha_i) + \sum (C-alpha_i)*log(C-alpha_i) - a constant s.t. 0 <= alpha_i <= C,

Page 19: Svm implementation for Health Data

19

`predict' Usage

Usage: ./predict [options] test_file model_file output_file

options: -b probability_estimates: whether to predict probability estimates, 0 or 1 (default 0)

Examples >./ train data_file Train linear SVM with L2-loss function. >./ train -s 0 data_file Train a logistic regression model. >./ train -v 5 -e 0.001 data_file Do five-fold cross-validation using L2-loss svm. Use a smaller stopping tolerance 0.001 than the default 0.1 if you want more accurate solutions. >./ train -c 10 -w1 2 -w2 5 -w3 2 four_class_data_file Train four classifiers: positive negative Cp Cn class 1 class 2,3,4. 20 10 class 2 class 1,3,4. 50 10 class 3 class 1,2,4. 20 10 class 4 class 1,2,3. 10 10 > ./train -c 10 -w3 1 -w2 5 two_class_data_file If there are only two classes, we train ONE model. The C values for the two classes are 10 and 50. > ./predict -b 1 test_file data_file.model output_file Output probability estimates (for logistic regression only).

Page 20: Svm implementation for Health Data

20

8. Comparison of One against one and One against rest on HPN Data:

In order to see the results of different multi classification results of HPN Raw Data, the comparison is done between one against one multi label classification on LIBSVM Vs one against rest multi label classification on LIBLINEAR. The results are shown in the table below.

Inference: There is not much difference implementing one against rest approach for multi label

classification. But again some more tests can be run with all different type of multi classification available to compare the results.

LIBSVM (one aginst one) VS LIBLINEAR (one against rest) ResultCo

de SampleSize Cost Gamma Epsilon e Cross fold Validation Accuarcy (One against one) Accuarcy (One against

rest) RD-Y1-1C-

WholeData 7011 0.03125 0.0078125 0.01 10 90.75% 90.68% RD-Y1-2C-

WholeData 6328 0.03125 0.0078125 0.01 10 90.33% 90.33% RD-Y1-3C-

WholeData 5918 0.03125 0.0078125 0.01 10 90.11% 90.10% RD-Y1-4C-

WholeData 5428 0.03125 0.0078125 0.01 10 88.47% 88.49%

Page 21: Svm implementation for Health Data

21

9. References

C.-W. Hsu, C.-C. Chang, C.-J. Lin. A practical guide to support vector classification

R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using the second order information for training SVM. Journal of Machine Learning Research 6, 1889-1918, 2005

A comparison of methods for multi-class support vector machines , IEEE Transactions on Neural Networks, 13(2002), 415-425.

A Tutorial on Support Vector Machines for Pattern Recognition CHRISTOPHER J.C. BURGES [email protected] ,Bell Laboratories

Using Support Vector Machines in Data Mining ,RICHARD A. WASNIOWSKI Computer Science Department, California State University Dominguez Hills

http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html For LIBSVM

http://www.csie.ntu.edu.tw/~cjlin/liblinear/ For LIBLINEAR

http://rapid-i.com/content/view/189/198/ For Rapid Miner tutorials

http://www.dtreg.com/svm.htm Online SVM Tutorial

http://nlp.stanford.edu/IR-book/html/htmledition/support-vector-machines-the-linearly-separable-case-1.html Online SVM Guide by Stanford university

SVM Example Dan Ventura ,March 12, 2009