DATA MINING WITH WEKA

Term paper on Data mining How to use Weka for data analysis

Submitted by: Shubham Gupta (10BM60085)

Vinod Gupta School of Management

The first technique that we would do on weka is classification. The data below shows the financial

situation in Japan. The data has been collected from 1970-2009. The columns represent:

1) BROAD: Broad money supplied in the economy

2) DOMC: Domestic consumption

3) PSC: Payment securities

4) CLAIMS: Represents the claims on the government.

5) TOTRES: Total Reserves

6) GDP: Gross domestic product

7) LIQLB: Liquid Liability

We want to get a decision tree that would help us decide what values of independent variable may

result in what final rule or result. For example if we know that for a DOMC> 140 and PSC> 150.3 we

would always get say GDP of greater than 3 trillion yen, then it would help us in making our decisions

better. Hence to get such rules we perform this analysis to generate a decision tree.

YEAR BROAD CLAIMS DOMC PSC TOTRES LIQLB GDP

1970 83.65 61.88 134.25 111.75 4876114550 104.73 205,995,000,000

1971 106.70 21.37 147.59 123.72 15469150615 118.21 232,681,000,000

1972 116.14 23.17 160.29 133.47 18932675966 129.03 308,137,000,000

1973 116.02 19.84 157.87 132.20 13723930639 126.07 418,640,000,000

1974 113.08 13.72 154.00 126.49 16551248298 120.50 464,705,000,000

1975 118.31 13.02 164.40 129.96 14910849997 127.56 505,317,000,000

1976 122.40 12.09 169.96 130.63 18590784646 131.20 567,926,000,000

1977 125.82 8.76 172.45 128.49 25907710023 133.90 698,968,000,000

1978 130.36 8.56 178.29 127.71 37824744320 139.12 982,078,000,000

1979 135.51 8.19 183.05 129.23 31926244737 142.67 1,022,190,000,000

1980 137.95 8.09 188.44 131.29 38918848626 144.30 1,071,000,000,000

1981 142.13 8.04 194.09 134.10 37839039769 150.03 1,183,790,000,000

1982 149.54 7.67 203.99 139.59 34403732201 156.18 1,100,410,000,000

1983 156.55 6.72 213.12 145.03 33844549531 162.92 1,200,190,000,000

1984 159.31 6.69 217.77 147.43 33898638541 165.34 1,275,560,000,000

1985 160.68 7.66 220.09 149.90 34641202378 167.41 1,364,160,000,000

1986 167.30 7.67 230.23 156.30 51727320082 174.65 2,020,890,000,000

1987 175.85 12.27 243.85 173.48 92701641597 183.77 2,448,670,000,000

1988 178.70 10.66 251.68 182.52 1.06668E+11 186.47 2,971,030,000,000

1989 182.62 10.13 258.13 190.28 93672771034 192.14 2,972,670,000,000

1990 184.06 8.46 259.15 194.81 87828362969 190.16 3,058,040,000,000

1991 184.35 5.20 257.54 195.40 80625855126 189.32 3,484,770,000,000

1992 187.89 4.16 265.33 199.63 79696644593 190.93 3,796,110,000,000

1993 193.97 1.33 274.00 202.14 1.07989E+11 198.16 4,350,010,000,000

1994 200.35 1.88 281.02 204.58 1.35146E+11 204.45 4,778,990,000,000

1995 205.79 1.26 287.13 203.90 1.9262E+11 209.90 5,264,380,000,000

1996 209.72 1.81 292.42 205.21 2.25594E+11 213.63 4,642,540,000,000

1997 215.31 6.47 276.47 217.76 2.26679E+11 221.38 4,261,840,000,000

1998 229.64 1.80 298.40 228.01 2.22443E+11 233.17 3,857,030,000,000

1999 239.91 -1.20 309.92 231.08 2.93948E+11 243.22 4,368,730,000,000

2000 242.24 -1.58 308.91 222.28 3.61639E+11 243.84 4,667,450,000,000

2001 225.31 -33.25 299.43 193.01 4.01958E+11 187.41 4,095,480,000,000

2002 207.79 -4.32 299.16 182.40 4.69618E+11 190.79 3,918,340,000,000

2003 209.70 -1.99 307.26 180.71 6.73554E+11 191.84 4,229,100,000,000

2004 207.51 -1.10 303.48 174.12 8.44667E+11 189.79 4,605,920,000,000

2005 207.24 1.79 312.85 182.87 8.46896E+11 189.30 4,552,200,000,000

2006 204.73 -0.14 304.96 179.99 8.95321E+11 186.06 4,362,590,000,000

2007 201.50 0.16 294.31 172.56 9.73297E+11 184.17 4,377,940,000,000

2008 207.14 0.76 295.42 165.48 1.03076E+12 189.52 4,879,860,000,000

2009 223.76 -1.12 320.53 171.00 1.04899E+12 206.13 5,032,980,000,000

Loading data in Weka is quite easy. Just click on the open file option and give the location of the file.

Figure 1 Shows how to load data in Weka

Weka software is used to classify the above data to find out how these economical factors be modified

or fixed so as to get an 11% growth in the previous year’s GDP

Figure 2 Diagram shows where you could the used tree technique

The following shows the output by running the above data in Weka. The Classifier used is to create the

required decision tree is M5P. Weka's M5P algorithm is a rational reconstruction of M5 with some

enhancements. M5Base. Implements base routines for generating M5 Model trees and rule

the original algorithm M5 was invented by R. Quinlan and Yong Wang. M5P (where the P stands for

‘prime’) generates M5 model trees using the M5' algorithm, which was introduced in Wang & Witten

(1997) and enhances the original M5 algorithm by Quinlan (1992). The output of the analysis is shown

below:

=== Run information ===

Scheme: weka.classifiers.trees.M5P -M 4.0

Relation: Copy of Data_Rudra-weka.filters.unsupervised.attribute.Remove-R1

Instances: 945

Attributes: 6

BROAD, CLAIMS, DOMC, PSC, TOTRES, LIQLB

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

M5 pruned model tree:

(Using smoothed linear models)

BROAD <= 153.045 : LM1 (13/5.644%)

BROAD > 153.045 :

| PSC <= 203.02 :

| | BROAD <= 177.275 : LM2 (5/0.653%)

| | BROAD > 177.275 :

| | | TOTRES <= 871108500000 : LM3 (11/8.309%)

| | | TOTRES > 871108500000 : LM4 (4/1.446%)

| PSC > 203.02 : LM5 (7/2.741%)

LM num: 1

LIQLB = 0.7447 * BROAD + 0.1474 * PSC - 0 * TOTRES + 22.3168

LM num: 2


LM num: 3


LM num: 4


LM num: 5


Number of Rules: 5

Time taken to build model: 0.08 seconds

=== Cross-validation ===

=== Summary ===

Correlation coefficient 0.9882

Mean absolute error 3.412

Root mean squared error 5.4145

Relative absolute error 11.529 %

Root relative squared error 15.1993 %

Total Number of Instances 40

Ignored Class Unknown Instances 905

Interpretation of the Results:

Based on the data above M5 algorithm generates modular tree which is formed by 5 linear models (LM)

based on the initial values of Broad money in the economy which if less than equal to 153.045 then we

have to follow linear model 1 (LM 1) to estimate Liquidity in the economy. If BROAD> 153.045 we check

PSC and move down the tree and choosing corresponding models to get the Liquidity and finally GDP

values as shown in the figure above.

Linear Regression with Weka

The second technique is to conduct linear regression through Weka on the same data. When the

outcome, or class, is numeric and all the attributes are numeric, linear regression is a natural technique

to consider. In the previous technique we created five linear models from the same data; hence M5P’s

performance is slightly worse than any linear model. The idea is to express the class as a linear

combination of the attributes with predetermined weights. From the previous data, we can also find

linear regression equation between various parameters determining GDP. To run the regression, go to

classify tab on Weka and choose linear regression from functions as shown.

Figure 3 Shows where to find LR in Weka

Following output is generated by the above analysis:


Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8

Relation: Copy of Data_Rudra

Instances: 945

Attributes: 7

YEAR

BROAD

CLAIMS

DOMC

PSC

TOTRES

LIQLB

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

Linear Regression Model

LIQLB = 1.2523 * BROAD + 0.6062 * CLAIMS + -0.1407 * DOMC 0 * TOTRES + -6.9705

Time taken to build model: 0.2 seconds

=== Cross-validation ===

=== Summary ===

Correlation coefficient 0.9738

Mean absolute error 4.8731

Root mean squared error 8.0404

Relative absolute error 16.4661 %

Root relative squared error 22.5707 %

Total Number of Instances 40

Ignored Class Unknown Instances 905

The above analysis gives as a mathematical relationship (linear) between various variables. The Value of

the fifth variable (dependent) can be found out once other independent variable values are known. This

equation also tells how these variables are related. A negative relation shows reciprocal relationship and

vice-versa. To see the same relation is pictorial form simply goes to visualize tab on Weka explorer. The

same is shown in the figure below.

CLUSTERING IN WEKA

Clustering is a technique used to group similar instances or rows in term of Euclidean distance. We have

used SimpeKMeans clustering algorithm to analyze clustering in our initial data. In SimpleKMeans

implementation clustering data use k-means, or the algorithm can decide using cross-validation- in

which case number of folds is fixed at 10. The figure below shows the output of SimpeKMeans for the

above data. The result is shown as table with rows that are attributes names and columns that

correspond to cluster centroids; an additional cluster at the beginning shows the entire data set. The

number of instances in each cluster appears in parenthesis at the top of its column. Each table entry is

either the mean or mode of the corresponding attribute for the cluster in that column. The bottom of

the output shows the result of applying the learned cluster model. In this case, it assigned each training

set to one of the clusters, showing the same result as the parenthetical numbers at the top of each

column. An alternative is to use a separate test set or a percentage split of training data, in which case

figures would be different. This technique could be used with data from other countries in addition of

the present data that is taken for Japan.


Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10

Relation: Copy of Data_Rudra

Instances: 945

Attributes: 7

YEAR

BROAD

CLAIMS

DOMC

PSC

TOTRES

LIQLB

Test mode:evaluate on training data

=== Model and evaluation on training set ===kMeans======

Number of iterations: 5

Within cluster sum of squared errors: 12.988387913678944

Missing values globally replaced with mean/mode

Cluster centroids:

Cluster#

Attribute Full Data 0 1

(945) (929) (16)

=================================================================

YEAR 1989.5 1989.2933 2001.5

BROAD 174.1633 173.4625 214.8525

CLAIMS 6.6645 6.8103 -1.7981

DOMC 242.2808 241.2956 299.4794

PSC 168.2627 167.8077 194.685

TOTRES 248907476505.9463 243675387834.3592 552695625000

LIQLB 175.2342 174.7166 205.2875

Time taken to build model (full training data) : 0.14 seconds

=== Model and evaluation on training set ===

Clustered Instances

0 929 (98%)

1 16 (2%)

We can also visualize the clusters formed. Right click on the result-list output and select cluster visualize.

We get the following output:

Business

DATA MINING WITH WEKA