Upload
shubham-gupta
View
4.771
Download
0
Tags:
Embed Size (px)
Citation preview
Term paper on Data mining How to use Weka for data analysis
Submitted by: Shubham Gupta (10BM60085)
Vinod Gupta School of Management
The first technique that we would do on weka is classification. The data below shows the financial
situation in Japan. The data has been collected from 1970-2009. The columns represent:
1) BROAD: Broad money supplied in the economy
2) DOMC: Domestic consumption
3) PSC: Payment securities
4) CLAIMS: Represents the claims on the government.
5) TOTRES: Total Reserves
6) GDP: Gross domestic product
7) LIQLB: Liquid Liability
We want to get a decision tree that would help us decide what values of independent variable may
result in what final rule or result. For example if we know that for a DOMC> 140 and PSC> 150.3 we
would always get say GDP of greater than 3 trillion yen, then it would help us in making our decisions
better. Hence to get such rules we perform this analysis to generate a decision tree.
YEAR BROAD CLAIMS DOMC PSC TOTRES LIQLB GDP
1970 83.65 61.88 134.25 111.75 4876114550 104.73 205,995,000,000
1971 106.70 21.37 147.59 123.72 15469150615 118.21 232,681,000,000
1972 116.14 23.17 160.29 133.47 18932675966 129.03 308,137,000,000
1973 116.02 19.84 157.87 132.20 13723930639 126.07 418,640,000,000
1974 113.08 13.72 154.00 126.49 16551248298 120.50 464,705,000,000
1975 118.31 13.02 164.40 129.96 14910849997 127.56 505,317,000,000
1976 122.40 12.09 169.96 130.63 18590784646 131.20 567,926,000,000
1977 125.82 8.76 172.45 128.49 25907710023 133.90 698,968,000,000
1978 130.36 8.56 178.29 127.71 37824744320 139.12 982,078,000,000
1979 135.51 8.19 183.05 129.23 31926244737 142.67 1,022,190,000,000
1980 137.95 8.09 188.44 131.29 38918848626 144.30 1,071,000,000,000
1981 142.13 8.04 194.09 134.10 37839039769 150.03 1,183,790,000,000
1982 149.54 7.67 203.99 139.59 34403732201 156.18 1,100,410,000,000
1983 156.55 6.72 213.12 145.03 33844549531 162.92 1,200,190,000,000
1984 159.31 6.69 217.77 147.43 33898638541 165.34 1,275,560,000,000
1985 160.68 7.66 220.09 149.90 34641202378 167.41 1,364,160,000,000
1986 167.30 7.67 230.23 156.30 51727320082 174.65 2,020,890,000,000
1987 175.85 12.27 243.85 173.48 92701641597 183.77 2,448,670,000,000
1988 178.70 10.66 251.68 182.52 1.06668E+11 186.47 2,971,030,000,000
1989 182.62 10.13 258.13 190.28 93672771034 192.14 2,972,670,000,000
1990 184.06 8.46 259.15 194.81 87828362969 190.16 3,058,040,000,000
1991 184.35 5.20 257.54 195.40 80625855126 189.32 3,484,770,000,000
1992 187.89 4.16 265.33 199.63 79696644593 190.93 3,796,110,000,000
1993 193.97 1.33 274.00 202.14 1.07989E+11 198.16 4,350,010,000,000
1994 200.35 1.88 281.02 204.58 1.35146E+11 204.45 4,778,990,000,000
1995 205.79 1.26 287.13 203.90 1.9262E+11 209.90 5,264,380,000,000
1996 209.72 1.81 292.42 205.21 2.25594E+11 213.63 4,642,540,000,000
1997 215.31 6.47 276.47 217.76 2.26679E+11 221.38 4,261,840,000,000
1998 229.64 1.80 298.40 228.01 2.22443E+11 233.17 3,857,030,000,000
1999 239.91 -1.20 309.92 231.08 2.93948E+11 243.22 4,368,730,000,000
2000 242.24 -1.58 308.91 222.28 3.61639E+11 243.84 4,667,450,000,000
2001 225.31 -33.25 299.43 193.01 4.01958E+11 187.41 4,095,480,000,000
2002 207.79 -4.32 299.16 182.40 4.69618E+11 190.79 3,918,340,000,000
2003 209.70 -1.99 307.26 180.71 6.73554E+11 191.84 4,229,100,000,000
2004 207.51 -1.10 303.48 174.12 8.44667E+11 189.79 4,605,920,000,000
2005 207.24 1.79 312.85 182.87 8.46896E+11 189.30 4,552,200,000,000
2006 204.73 -0.14 304.96 179.99 8.95321E+11 186.06 4,362,590,000,000
2007 201.50 0.16 294.31 172.56 9.73297E+11 184.17 4,377,940,000,000
2008 207.14 0.76 295.42 165.48 1.03076E+12 189.52 4,879,860,000,000
2009 223.76 -1.12 320.53 171.00 1.04899E+12 206.13 5,032,980,000,000
Loading data in Weka is quite easy. Just click on the open file option and give the location of the file.
Figure 1 Shows how to load data in Weka
Weka software is used to classify the above data to find out how these economical factors be modified
or fixed so as to get an 11% growth in the previous year’s GDP
Figure 2 Diagram shows where you could the used tree technique
The following shows the output by running the above data in Weka. The Classifier used is to create the
required decision tree is M5P. Weka's M5P algorithm is a rational reconstruction of M5 with some
enhancements. M5Base. Implements base routines for generating M5 Model trees and rule
the original algorithm M5 was invented by R. Quinlan and Yong Wang. M5P (where the P stands for
‘prime’) generates M5 model trees using the M5' algorithm, which was introduced in Wang & Witten
(1997) and enhances the original M5 algorithm by Quinlan (1992). The output of the analysis is shown
below:
=== Run information ===
Scheme: weka.classifiers.trees.M5P -M 4.0
Relation: Copy of Data_Rudra-weka.filters.unsupervised.attribute.Remove-R1
Instances: 945
Attributes: 6
BROAD, CLAIMS, DOMC, PSC, TOTRES, LIQLB
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
M5 pruned model tree:
(Using smoothed linear models)
BROAD <= 153.045 : LM1 (13/5.644%)
BROAD > 153.045 :
| PSC <= 203.02 :
| | BROAD <= 177.275 : LM2 (5/0.653%)
| | BROAD > 177.275 :
| | | TOTRES <= 871108500000 : LM3 (11/8.309%)
| | | TOTRES > 871108500000 : LM4 (4/1.446%)
| PSC > 203.02 : LM5 (7/2.741%)
LM num: 1
LIQLB = 0.7447 * BROAD + 0.1474 * PSC - 0 * TOTRES + 22.3168
LM num: 2
LIQLB = 0.586 * BROAD + 0.2788 * PSC - 0 * TOTRES + 30.3097
LM num: 3
LIQLB = 0.4606 * BROAD + 0.2504 * PSC - 0 * TOTRES + 58.87
LM num: 4
LIQLB = 0.4996 * BROAD + 0.2504 * PSC - 0 * TOTRES + 50.7563
LM num: 5
LIQLB = 0.7016 * BROAD + 0.2497 * PSC - 0 * TOTRES + 15.2517
Number of Rules: 5
Time taken to build model: 0.08 seconds
=== Cross-validation ===
=== Summary ===
Correlation coefficient 0.9882
Mean absolute error 3.412
Root mean squared error 5.4145
Relative absolute error 11.529 %
Root relative squared error 15.1993 %
Total Number of Instances 40
Ignored Class Unknown Instances 905
Interpretation of the Results:
Based on the data above M5 algorithm generates modular tree which is formed by 5 linear models (LM)
based on the initial values of Broad money in the economy which if less than equal to 153.045 then we
have to follow linear model 1 (LM 1) to estimate Liquidity in the economy. If BROAD> 153.045 we check
PSC and move down the tree and choosing corresponding models to get the Liquidity and finally GDP
values as shown in the figure above.
Linear Regression with Weka
The second technique is to conduct linear regression through Weka on the same data. When the
outcome, or class, is numeric and all the attributes are numeric, linear regression is a natural technique
to consider. In the previous technique we created five linear models from the same data; hence M5P’s
performance is slightly worse than any linear model. The idea is to express the class as a linear
combination of the attributes with predetermined weights. From the previous data, we can also find
linear regression equation between various parameters determining GDP. To run the regression, go to
classify tab on Weka and choose linear regression from functions as shown.
Figure 3 Shows where to find LR in Weka
Following output is generated by the above analysis:
=== Run information ===
Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
Relation: Copy of Data_Rudra
Instances: 945
Attributes: 7
YEAR
BROAD
CLAIMS
DOMC
PSC
TOTRES
LIQLB
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Linear Regression Model
LIQLB = 1.2523 * BROAD + 0.6062 * CLAIMS + -0.1407 * DOMC 0 * TOTRES + -6.9705
Time taken to build model: 0.2 seconds
=== Cross-validation ===
=== Summary ===
Correlation coefficient 0.9738
Mean absolute error 4.8731
Root mean squared error 8.0404
Relative absolute error 16.4661 %
Root relative squared error 22.5707 %
Total Number of Instances 40
Ignored Class Unknown Instances 905
The above analysis gives as a mathematical relationship (linear) between various variables. The Value of
the fifth variable (dependent) can be found out once other independent variable values are known. This
equation also tells how these variables are related. A negative relation shows reciprocal relationship and
vice-versa. To see the same relation is pictorial form simply goes to visualize tab on Weka explorer. The
same is shown in the figure below.
CLUSTERING IN WEKA
Clustering is a technique used to group similar instances or rows in term of Euclidean distance. We have
used SimpeKMeans clustering algorithm to analyze clustering in our initial data. In SimpleKMeans
implementation clustering data use k-means, or the algorithm can decide using cross-validation- in
which case number of folds is fixed at 10. The figure below shows the output of SimpeKMeans for the
above data. The result is shown as table with rows that are attributes names and columns that
correspond to cluster centroids; an additional cluster at the beginning shows the entire data set. The
number of instances in each cluster appears in parenthesis at the top of its column. Each table entry is
either the mean or mode of the corresponding attribute for the cluster in that column. The bottom of
the output shows the result of applying the learned cluster model. In this case, it assigned each training
set to one of the clusters, showing the same result as the parenthetical numbers at the top of each
column. An alternative is to use a separate test set or a percentage split of training data, in which case
figures would be different. This technique could be used with data from other countries in addition of
the present data that is taken for Japan.
=== Run information ===
Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10
Relation: Copy of Data_Rudra
Instances: 945
Attributes: 7
YEAR
BROAD
CLAIMS
DOMC
PSC
TOTRES
LIQLB
Test mode:evaluate on training data
=== Model and evaluation on training set ===kMeans======
Number of iterations: 5
Within cluster sum of squared errors: 12.988387913678944
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1
(945) (929) (16)
=================================================================
YEAR 1989.5 1989.2933 2001.5
BROAD 174.1633 173.4625 214.8525
CLAIMS 6.6645 6.8103 -1.7981
DOMC 242.2808 241.2956 299.4794
PSC 168.2627 167.8077 194.685
TOTRES 248907476505.9463 243675387834.3592 552695625000
LIQLB 175.2342 174.7166 205.2875
Time taken to build model (full training data) : 0.14 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 929 (98%)
1 16 (2%)
We can also visualize the clusters formed. Right click on the result-list output and select cluster visualize.
We get the following output: