10
COMPARISON OF CLUSTERING BASED ON SEARCH ENGINE DATASET USING TANAGRA AND WEKA PREETI BANSAL & MD. EZAZ AHMED Department of Computer Science, ITM University, Gurgaon, Haryana, India ABSTRACT The paper introduced the concept of clustering on a particular dataset of some influencing factors in Search Engine Optimization using Tanagra and Weka data mining tool and find the values of factors that affect relevance. As the number of available Web pages grows, it is become more difficult for users to find documents relevant to their interest. To a search engine, relevance means more than simply finding a page with the right words. In the early days of the web, search engines didn’t go much further than this simplistic step, and their results suffered as a conse quence. Thus, through evolution, smart engineers at the engines devised better ways to find valuable results that searchers would appreciate and enjoy. Today, 100s of factors influence relevance, many of which we’ll discuss through this paper. Clustering is the classification of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - according to some defined distance measure. It can enable users to find the relevant documents more easily and also help users to form an understanding of the different facets of the query that have been provided for web search engine. We used clustering algorithm is K-means and EM and compare the result of clustering on Weka and Tanagra and find the values of factors like title length, number of backlinks, Domain length, keywords in title that affect search engine optimization. KEYWORDS: Weka, Tanagra, K-Means, EM, Search Engine, Dataset INTRODUCTION With the rapid growth of network information resources, the result obtained through the search engine is very large. Users have to filter the results list one by one to get the results they want [1] .According to survey, users will generally turn back to read no more than five pages of the results. How to quickly and efficiently extract valuable information from the massive network information, how to organize the display form of the query results is becoming the objective that the information industry compete to research and develop [17], [18]. Data mining tools predict future trends and behavior, allowing business to make proactive, knowledge-driven decisions. [6] Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Clustering is one of the key technology [2] [8].We used K-means and EM clustering algorithms in Weka and Tanagra .K means include the Euclidean k-medians, in which the objective is to minimize the sum of distances to the nearest center, and the geometric k-center problem, in which the objective is to minimize the maximum distance from every point to its closest center. The EM algorithm is used to find the maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly. Typically these models involve latent variables in addition to unknown parameters and known data observations. That is, either there are missing values among the data, or the model can be formulated more simply by assuming the existence of additional unobserved data points [20]. International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) ISSN 2249-6831 Vol. 3, Issue 2, Mar 2013, 113-122 © TJPRC Pvt. Ltd.

12.Comparison of Clustering - full

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: 12.Comparison of Clustering - full

COMPARISON OF CLUSTERING BASED ON SEARCH ENGINE DATASET USING

TANAGRA AND WEKA

PREETI BANSAL & MD. EZAZ AHMED

Department of Computer Science, ITM University, Gurgaon, Haryana, India

ABSTRACT

The paper introduced the concept of clustering on a particular dataset of some influencing factors in Search

Engine Optimization using Tanagra and Weka data mining tool and find the values of factors that affect relevance. As the

number of available Web pages grows, it is become more difficult for users to find documents relevant to their interest. To

a search engine, relevance means more than simply finding a page with the right words. In the early days of the web,

search engines didn’t go much further than this simplistic step, and their results suffered as a consequence. Thus, through

evolution, smart engineers at the engines devised better ways to find valuable results that searchers would appreciate and

enjoy.

Today, 100s of factors influence relevance, many of which we’ll discuss through this paper. Clustering is the

classification of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait -

according to some defined distance measure. It can enable users to find the relevant documents more easily and also help

users to form an understanding of the different facets of the query that have been provided for web search engine. We used

clustering algorithm is K-means and EM and compare the result of clustering on Weka and Tanagra and find the values of

factors like title length, number of backlinks, Domain length, keywords in title that affect search engine optimization.

KEYWORDS: Weka, Tanagra, K-Means, EM, Search Engine, Dataset

INTRODUCTION

With the rapid growth of network information resources, the result obtained through the search engine is very

large. Users have to filter the results list one by one to get the results they want [1] .According to survey, users will

generally turn back to read no more than five pages of the results. How to quickly and efficiently extract valuable

information from the massive network information, how to organize the display form of the query results is becoming the

objective that the information industry compete to research and develop [17], [18]. Data mining tools predict future trends

and behavior, allowing business to make proactive, knowledge-driven decisions. [6] Data mining tools can answer business

questions that traditionally were too time consuming to resolve.

They scour databases for hidden patterns, finding predictive information that experts may miss because it lies

outside their expectations. Clustering is one of the key technology [2] [8].We used K-means and EM clustering algorithms

in Weka and Tanagra .K means include the Euclidean k-medians, in which the objective is to minimize the sum of

distances to the nearest center, and the geometric k-center problem, in which the objective is to minimize the maximum

distance from every point to its closest center. The EM algorithm is used to find the maximum likelihood parameters of a

statistical model in cases where the equations cannot be solved directly. Typically these models involve latent variables in

addition to unknown parameters and known data observations. That is, either there are missing values among the data, or

the model can be formulated more simply by assuming the existence of additional unobserved data points [20].

International Journal of Computer Science Engineering

and Information Technology Research (IJCSEITR)

ISSN 2249-6831

Vol. 3, Issue 2, Mar 2013, 113-122

© TJPRC Pvt. Ltd.

Page 2: 12.Comparison of Clustering - full

114 Preeti Bansal & Md. Ezaz Ahmed

THE K-MEANS ALGORITHM

From a practical point of view, clustering analysis is one of the main tasks of data mining. It is now used in many

areas, such as data mining and knowledge discovery, pattern recognition and so on. There are many clustering analysis

algorithm, of which the most well-known is the K-means algorithm.

K-Means clustering is a very popular algorithm to find the clustering in dataset by iterative computations. It has

the advantages of simple implementing and finding at least local optimal clustering. K-Means algorithm is employed to

find the clustering in dataset. The algorithm [3] is composed of the following steps:

Initialize k cluster centers to be seed points. (These centers can be randomly produced or use other ways to

generate).

For each sample, find the nearest cluster center, put the sample in this cluster and recompute centers of the altered

cluster (Repeat n times).

Exam all samples again and put each one in the cluster identified with the nearest center (don’t recompute any

cluster centers). If members of each cluster haven’t been changed, stop. If changed, go to step 2.

EM ALGORITHM

EM (Expectation Maximization, called EM) algorithm is a method generally from the "incomplete data" to solve

the maximum likelihood estimation of model parameters, the "incomplete data" generally has two situations: one is to

observe the process by its own limitations or mistake, then make observed data as mistakes incomplete data, another is the

directly optimization of the likelihood function of parameter very difficult, and introduce additional parameters (hidden or

lost) make optimization easy, so the definition of original observational data with additional data compose "complete data",

the original observations naturally become "incomplete data."

Basic principle of the EM can be expressed as follows:

Y, is the observed data, complete data X = (Y, Z), Z is missing data, Ɵ is model parameters. Ɵ About Y on the

posterior distribution p (ƟǀY) is very complicated and difficult for a variety of statistical calculation. If the missing data Z

is known, it may be getting a simple added posterior distribution p (Ɵ ǀy, Z) about Ɵ. We can use the simplicity of p (Ɵ ǀY,

Z) for statistical calculation. Then, we can return to examine and improve the assumption of Z, so we can transform a

complex maximization or sampling problem into a simple one. The greatest advantages of EM algorithm are simplicity and

stability, and its main purpose is to provide a simple iterative algorithm to calculate the posterior plural [14].

ORGANIZATION OF DATA

It’s important for search engine to maintain a high quality websites this will improve the optimization. According

to 2011 search engine report following are the factors that affect ranking as shown in Figure 1. [15].We make a database of

30 different websites[12],[13] in which following factors we take length of title and keywords in title[18] from page level

keyword usage, Domain length from Domain level keyword usage, and number of backlinks from page level keyword

Agnostic feature and one discrete factor Top rank website[21],[11].

It is true that other tags, namely the title tag and Meta description tag are of critical importance to SEO best

practices but there are many other factors like relevant contents [19], paid advertisements etc also that affect relevance.

Here we discuss some of the on page factors.

Page 3: 12.Comparison of Clustering - full

Comparison of Clustering Based on Search Engine Dataset Using Tanagra and Weka 115

Figure 1: Dataset

Figure 2: Search Ranking Factors

Working with Weka on Dataset Using K -Means

Open Weka, then click on right side option explorer then open data file under preprocesses option which is in csv

or arff format. As we choose the explorer option it will appear as given below, the screen shot in Figure 3. [5] Clearly

indicate the open file option. Now we click on view open file and choose the data set. Weka provides filters to accomplish

all of these preprocessing tasks, they are not necessary for clustering in Weka. This is because Weka SimpleKMeans

algorithm automatically handles a mixture of categorical and numerical attributes. This algorithm automatically normalizes

numerical attributes when doing distance computations. This gives all attributes that are present in dataset. We can select

any one which we want to include or select all.

Figure 3: Opening Page

Page 4: 12.Comparison of Clustering - full

116 Preeti Bansal & Md. Ezaz Ahmed

After this just click on cluster tab and click on choose button on left side and select clustering algorithm which we

want to apply, we select simple k means the screen appears below in Figure 4.

Figure 4: Select Algorithm

Next, click on the text box to the right of the "Choose" button to get the pop-up window shown in Figure 5, for

editing the clustering parameter. In the pop-up window we enter as the number of clusters and we leave the value of

"seed" as is. The seed value is used in generating a random number which is, in turn, used for making the initial assignment

of instances to clusters. Note that, in general, K-means is quite sensitive to how clusters are initially assigned, when we

increase the value of seed then result may differ and number of iterations also different. If we give the value of seed like

100, 1000, 500 , 300 etc then there is no difference in result but iterations are different but if we give value like 150,175 etc

then result may also different.

Figure 5: Choose Parameters

Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the "Cluster

Mode" panel, the "Use training set" option is selected, and we click "Start". We can right click the result set in the "Result

list" panel and view the results of clustering in a separate window.

The result window shows the centroid of each cluster as well as statistics on the number and percentage of

instances assigned to different clusters. Cluster centroids are the mean vectors for each cluster (so, each dimension value in

the centroid represents the mean value for that dimension in the cluster). Thus, centroids can be used to characterize the

clusters. The result shows that in cluster 0 there are 13 websites that are not top rank and have length of title is 59, number

of backlinks are 6638, keywords in title are 5.3, Domain length is 22.53 and in cluster 1 there are 16 websites that are top

rank and have length of title 36.37 and number of backlinks are 19163, keywords in title are 2.9 and Domain length is

29.81 as shown in Figure 6.

Page 5: 12.Comparison of Clustering - full

Comparison of Clustering Based on Search Engine Dataset Using Tanagra and Weka 117

Figure 6: Result of Clustering in Weka

Working with Tanagra on Dataset Using K Means

Open Tanagra, and then open data files which are in txt, xls or arff format. As we open the file in dataset it will

appear as given below, the screen shot in Figure 7. Clearly indicate the open file name research.xls. Now we right click on

view dataset and choose view from pop-up menu which will appear after right click on view dataset [7]

Figure 7: Opening Page

Now we select view dataset from data visualization tab and drag it and drop to that on dataset. Now we select

define status from feature selection tab and then drags it and drop to dataset then right click on define status and select

parameters from popup menu as in Figure 8.

Figure 8: Selection of Input Parameters

Page 6: 12.Comparison of Clustering - full

118 Preeti Bansal & Md. Ezaz Ahmed

Now select attributes as input as page rank, backlinks, length of title, keywords in title, Domain length and press

OK button. From statistics tab we choose Univariate continuous stat, drag and drop it in define status1.Then we use view

command from popup menu we will get result as Min, Max values. We want to standardize the variables before performing

the k‐means approach. The aim is to eliminate the discrepancy of scales between the variables. We add the Standardize

component (Feature Construction tab) into the diagram. Then, we click on the View menu. In fact, this operation is not

necessary with Tanagra. It can automatically standardize the variables with the K‐Means component [9].

Now again we select define status from feature selection tab and then drag it and drop to dataset then right click

on define status and select parameters from popup menu and select all standardize variables like std_length of title_1 .We

insert the K‐means component under the Clustering tab. We click on the parameters contextual menu. We set the following

Parameters as in Figure.9.

Figure 9: Define the Number of Clusters

We ask a partitioning into two groups. It is not necessary to normalize the distance because we use already

standardized variables. We validate and we click on the View menu. This gives the TSS, WSS and centroids of clusters.

Tanagra computes and adds automatically a new column to the current dataset. We can visualize it with the View Dataset

component (Data visualization tab). We again insert the Define Status component into the k means to distinguish the

clusters. We set as Target a computed column cluster k means_1 as input as in Figure 10 .Then we add the Group

Characterization component under Statistics tab to get the final result. The result shows that in cluster 1 the websites are

top rank and have length of title is 41.54 , number of backlinks are 14908.42 , keywords in title are 3.42, Domain length is

27.46 and in cluster 2 the websites that are not top rank and have length of title 105 and number of backlinks are 2511

,keywords in title are 13 and Domain length is 18 as shown in Figure 11.

Figure 10: Selection of Discrete Attribute

Page 7: 12.Comparison of Clustering - full

Comparison of Clustering Based on Search Engine Dataset Using Tanagra and Weka 119

Figure 11: Final Result

Working with Weka on Dataset Using EM

Open Weka, then click on right side option explorer then open data file under preprocesses option which is in csv

or arff format [4], [5] like in K means. Starting points are similar to as describe above in K means. After this just click on

cluster tab and click on choose button on left side and select clustering algorithm which we want to apply, we select simple

EM the screen appears below in Figure 12

Figure 12: Selection of Algorithm

Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the "Cluster

Mode" panel, the "Use training set" option is selected, and we click "Start". We can right click the result set in the "Result

list" panel and view the results of clustering in a separate window. The result shows that there are 11 websites in cluster 0,

14 websites, in cluster 1, 1 websites in cluster 2 and in cluster 3 the 3 websites are present .The websites present in cluster

1 are top rank and have length of title 33 and number of backlinks are 3941, keywords in title are 3.44 and Domain length

is 18.61 as shown in Figure 13.

Figure 13: Final Result

Page 8: 12.Comparison of Clustering - full

120 Preeti Bansal & Md. Ezaz Ahmed

Working with Tanagra on Dataset Using EM

Open Tanagra, and then open data files which are in txt, xls or arff format. As we open the file in dataset it will

appear as given above, the screen shot in Figure 7. Now we select define status from feature selection tab and then drag it

and drop to dataset then right click on define status and select parameters from popup menu as in Figure 14. Now select

attributes backlinks, length of title, keywords in title, Domain length as input and press OK button [16].After this drag and

drop EM-Clustering in clustering tab under define Status. Then right click on EM clustering and select number of clusters

as 2 and press ok in Figure 15. and then drag and drop EM-selection under EM-clustering and select start and end value

like we want to make 2 clusters then start value is 1 and end value is 2.To see the output just double click on EM clustering

and check the values of factors in clusters as shown in Figure 16.

Figure 14: Selection of Input Parameters

Figure 15: Specify Number of Clusters

Figure 16: Final Result

Page 9: 12.Comparison of Clustering - full

Comparison of Clustering Based on Search Engine Dataset Using Tanagra and Weka 121

The result shows that in cluster 1 the websites are in top rank and have length of title is 41, number of backlinks

are 14908, keywords in title are 3.42, Domain length is 27.46 and in cluster 2 the websites that are not top rank and have

length of title 105 and number of backlinks are 2511, keywords in title are 13 and Domain length is 18.

Comparison of K Means and EM in Tanagra and Weka

Weka

In Weka, using K means the websites that are top rank have length of title is 36.37 while using EM the length of

title should be 33.62, number of backlinks in Weka by K-means are 19143 while using EM it should be 3941.4, keywords

in title in Weka by K means are 2.93 while by EM it should be 3.42, Domain length by K means is 29.81 while by EM it

should be 18.16 .The graphical comparison is shown in Figure 17 and Figure 18.

Tanagra

Tanagra, using K means the websites that are top rank have length of title is 27.46 while using EM the length of

title should be 41.5, number of backlinks in Tanagra by K-means are 14908.42 while using EM it should be 14908.9 ,

keywords in title in Tanagra by K means are 3.42 while by EM it should be 3.42, Domain length by K means is 27.46

while by EM it should be 27.41 .The graphical comparison is shown in Figure 19 and Figure 20, the values are nearby

same because of this the line shown in Figure 19. is of one colour.

CONCLUSIONS

As we continue to fight with huge data on web ,to optimised the result we should find many more influencing

factors.By this comparison we conclude that Tanagra is good tool in comparison of Weka as results of both algorithms k

means and EM are near by same[10].In future we work on Social signals as this feature has great importance in

optimization.

Figure 17: Comparison of Optimized Factors in Weka Figure 18: Comparison of Backlinks in Weka

Figure 19: Comparison of Optimized Factors in Tanagra Figure 20: Comparison of Backlinks in Tanagra

Page 10: 12.Comparison of Clustering - full

122 Preeti Bansal & Md. Ezaz Ahmed

ACKNOWLEDGEMENTS

We thanks to Mrs. Latika Singh for their invaluable comments and suggestions to improve the manuscript.

REFERENCES

1. A Document Clustering Algorithm for Web Search Engine Retrieval System, Hongwei Yang School of Software

Yunnan University, Kunming 650021, China;

2. S. Kantabutra, Efficient Representation of Cluster Structure in Large Data Sets, Ph.D. Thesis, Tufts University,

Medford MA, September 2001.

3. Wang Jun, OuYang Zheng-Zheng “The Research of K- Means Clustering Algorithm Based on Association Rules

“.

4. http://maya.cs.depaul.edu/classes/ect584/weka/ k- means.html.

5. http://www.cs.ccsu.edu/~markov/weka-tutorial.pdf.

6. http://thesai.org/Downloads/Volume3No4/Paper_20Knowledge_Discovery_in_Health_Care_Datasets

Using_Data_Mining_Tools.pdf.

7. Tanagra: An Evaluation. Jessica Enright Jonathan Klippenstein.

8. C. Romero, S. Ventura "Educational data Mining: A Survey from 1995 to 2005", Expert System with

Applications (33), pp. 135-146, 2007.

9. http://eric.univ-lyon2.fr/~ricco/tanagra/fichier/ tanagra_etles autres_KMeans.pdf.

10. R. Kannan, S. Vempala, and Adrian Vetta, On Clusterings Good, Bad, and Spectral” Proc. of the 41st Foundations

of Computer Science, Redondo Beach, 2000.5.

11. http://klageswebdesign.com/seo-blog/2011/06/search-ranking-factors-released-what-you- need-to-know/.

12. http://www.backlinkswatch.com.

13. http://www.submitexpress.com/cgi-bin/analyzer/meta.pl.

14. http://cptra.ln.edu.hk/~mlwong/conference/isda2002.pdf

15. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=04381759

16. http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/ en- Tanagra_EM_Clustering.pdf

17. http://www.seomoz.org/beginners-guide-to-seo.

18. http://www.ieee.org/about/webteam/resources/search_ optimization.html.

19. http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35291

20. http://en.wikipedia.org/wiki/Expectation%E2%80%93 maximization algorithm.

21. http://www.seomoz.org/article/search-ranking-factors