Upload
transtellar-publications
View
232
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Â
Citation preview
COMPARISON OF CLUSTERING BASED ON SEARCH ENGINE DATASET USING
TANAGRA AND WEKA
PREETI BANSAL & MD. EZAZ AHMED
Department of Computer Science, ITM University, Gurgaon, Haryana, India
ABSTRACT
The paper introduced the concept of clustering on a particular dataset of some influencing factors in Search
Engine Optimization using Tanagra and Weka data mining tool and find the values of factors that affect relevance. As the
number of available Web pages grows, it is become more difficult for users to find documents relevant to their interest. To
a search engine, relevance means more than simply finding a page with the right words. In the early days of the web,
search engines didn’t go much further than this simplistic step, and their results suffered as a consequence. Thus, through
evolution, smart engineers at the engines devised better ways to find valuable results that searchers would appreciate and
enjoy.
Today, 100s of factors influence relevance, many of which we’ll discuss through this paper. Clustering is the
classification of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait -
according to some defined distance measure. It can enable users to find the relevant documents more easily and also help
users to form an understanding of the different facets of the query that have been provided for web search engine. We used
clustering algorithm is K-means and EM and compare the result of clustering on Weka and Tanagra and find the values of
factors like title length, number of backlinks, Domain length, keywords in title that affect search engine optimization.
KEYWORDS: Weka, Tanagra, K-Means, EM, Search Engine, Dataset
INTRODUCTION
With the rapid growth of network information resources, the result obtained through the search engine is very
large. Users have to filter the results list one by one to get the results they want [1] .According to survey, users will
generally turn back to read no more than five pages of the results. How to quickly and efficiently extract valuable
information from the massive network information, how to organize the display form of the query results is becoming the
objective that the information industry compete to research and develop [17], [18]. Data mining tools predict future trends
and behavior, allowing business to make proactive, knowledge-driven decisions. [6] Data mining tools can answer business
questions that traditionally were too time consuming to resolve.
They scour databases for hidden patterns, finding predictive information that experts may miss because it lies
outside their expectations. Clustering is one of the key technology [2] [8].We used K-means and EM clustering algorithms
in Weka and Tanagra .K means include the Euclidean k-medians, in which the objective is to minimize the sum of
distances to the nearest center, and the geometric k-center problem, in which the objective is to minimize the maximum
distance from every point to its closest center. The EM algorithm is used to find the maximum likelihood parameters of a
statistical model in cases where the equations cannot be solved directly. Typically these models involve latent variables in
addition to unknown parameters and known data observations. That is, either there are missing values among the data, or
the model can be formulated more simply by assuming the existence of additional unobserved data points [20].
International Journal of Computer Science Engineering
and Information Technology Research (IJCSEITR)
ISSN 2249-6831
Vol. 3, Issue 2, Mar 2013, 113-122
© TJPRC Pvt. Ltd.
114 Preeti Bansal & Md. Ezaz Ahmed
THE K-MEANS ALGORITHM
From a practical point of view, clustering analysis is one of the main tasks of data mining. It is now used in many
areas, such as data mining and knowledge discovery, pattern recognition and so on. There are many clustering analysis
algorithm, of which the most well-known is the K-means algorithm.
K-Means clustering is a very popular algorithm to find the clustering in dataset by iterative computations. It has
the advantages of simple implementing and finding at least local optimal clustering. K-Means algorithm is employed to
find the clustering in dataset. The algorithm [3] is composed of the following steps:
Initialize k cluster centers to be seed points. (These centers can be randomly produced or use other ways to
generate).
For each sample, find the nearest cluster center, put the sample in this cluster and recompute centers of the altered
cluster (Repeat n times).
Exam all samples again and put each one in the cluster identified with the nearest center (don’t recompute any
cluster centers). If members of each cluster haven’t been changed, stop. If changed, go to step 2.
EM ALGORITHM
EM (Expectation Maximization, called EM) algorithm is a method generally from the "incomplete data" to solve
the maximum likelihood estimation of model parameters, the "incomplete data" generally has two situations: one is to
observe the process by its own limitations or mistake, then make observed data as mistakes incomplete data, another is the
directly optimization of the likelihood function of parameter very difficult, and introduce additional parameters (hidden or
lost) make optimization easy, so the definition of original observational data with additional data compose "complete data",
the original observations naturally become "incomplete data."
Basic principle of the EM can be expressed as follows:
Y, is the observed data, complete data X = (Y, Z), Z is missing data, Ɵ is model parameters. Ɵ About Y on the
posterior distribution p (ƟǀY) is very complicated and difficult for a variety of statistical calculation. If the missing data Z
is known, it may be getting a simple added posterior distribution p (Ɵ ǀy, Z) about Ɵ. We can use the simplicity of p (Ɵ ǀY,
Z) for statistical calculation. Then, we can return to examine and improve the assumption of Z, so we can transform a
complex maximization or sampling problem into a simple one. The greatest advantages of EM algorithm are simplicity and
stability, and its main purpose is to provide a simple iterative algorithm to calculate the posterior plural [14].
ORGANIZATION OF DATA
It’s important for search engine to maintain a high quality websites this will improve the optimization. According
to 2011 search engine report following are the factors that affect ranking as shown in Figure 1. [15].We make a database of
30 different websites[12],[13] in which following factors we take length of title and keywords in title[18] from page level
keyword usage, Domain length from Domain level keyword usage, and number of backlinks from page level keyword
Agnostic feature and one discrete factor Top rank website[21],[11].
It is true that other tags, namely the title tag and Meta description tag are of critical importance to SEO best
practices but there are many other factors like relevant contents [19], paid advertisements etc also that affect relevance.
Here we discuss some of the on page factors.
Comparison of Clustering Based on Search Engine Dataset Using Tanagra and Weka 115
Figure 1: Dataset
Figure 2: Search Ranking Factors
Working with Weka on Dataset Using K -Means
Open Weka, then click on right side option explorer then open data file under preprocesses option which is in csv
or arff format. As we choose the explorer option it will appear as given below, the screen shot in Figure 3. [5] Clearly
indicate the open file option. Now we click on view open file and choose the data set. Weka provides filters to accomplish
all of these preprocessing tasks, they are not necessary for clustering in Weka. This is because Weka SimpleKMeans
algorithm automatically handles a mixture of categorical and numerical attributes. This algorithm automatically normalizes
numerical attributes when doing distance computations. This gives all attributes that are present in dataset. We can select
any one which we want to include or select all.
Figure 3: Opening Page
116 Preeti Bansal & Md. Ezaz Ahmed
After this just click on cluster tab and click on choose button on left side and select clustering algorithm which we
want to apply, we select simple k means the screen appears below in Figure 4.
Figure 4: Select Algorithm
Next, click on the text box to the right of the "Choose" button to get the pop-up window shown in Figure 5, for
editing the clustering parameter. In the pop-up window we enter as the number of clusters and we leave the value of
"seed" as is. The seed value is used in generating a random number which is, in turn, used for making the initial assignment
of instances to clusters. Note that, in general, K-means is quite sensitive to how clusters are initially assigned, when we
increase the value of seed then result may differ and number of iterations also different. If we give the value of seed like
100, 1000, 500 , 300 etc then there is no difference in result but iterations are different but if we give value like 150,175 etc
then result may also different.
Figure 5: Choose Parameters
Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the "Cluster
Mode" panel, the "Use training set" option is selected, and we click "Start". We can right click the result set in the "Result
list" panel and view the results of clustering in a separate window.
The result window shows the centroid of each cluster as well as statistics on the number and percentage of
instances assigned to different clusters. Cluster centroids are the mean vectors for each cluster (so, each dimension value in
the centroid represents the mean value for that dimension in the cluster). Thus, centroids can be used to characterize the
clusters. The result shows that in cluster 0 there are 13 websites that are not top rank and have length of title is 59, number
of backlinks are 6638, keywords in title are 5.3, Domain length is 22.53 and in cluster 1 there are 16 websites that are top
rank and have length of title 36.37 and number of backlinks are 19163, keywords in title are 2.9 and Domain length is
29.81 as shown in Figure 6.
Comparison of Clustering Based on Search Engine Dataset Using Tanagra and Weka 117
Figure 6: Result of Clustering in Weka
Working with Tanagra on Dataset Using K Means
Open Tanagra, and then open data files which are in txt, xls or arff format. As we open the file in dataset it will
appear as given below, the screen shot in Figure 7. Clearly indicate the open file name research.xls. Now we right click on
view dataset and choose view from pop-up menu which will appear after right click on view dataset [7]
Figure 7: Opening Page
Now we select view dataset from data visualization tab and drag it and drop to that on dataset. Now we select
define status from feature selection tab and then drags it and drop to dataset then right click on define status and select
parameters from popup menu as in Figure 8.
Figure 8: Selection of Input Parameters
118 Preeti Bansal & Md. Ezaz Ahmed
Now select attributes as input as page rank, backlinks, length of title, keywords in title, Domain length and press
OK button. From statistics tab we choose Univariate continuous stat, drag and drop it in define status1.Then we use view
command from popup menu we will get result as Min, Max values. We want to standardize the variables before performing
the k‐means approach. The aim is to eliminate the discrepancy of scales between the variables. We add the Standardize
component (Feature Construction tab) into the diagram. Then, we click on the View menu. In fact, this operation is not
necessary with Tanagra. It can automatically standardize the variables with the K‐Means component [9].
Now again we select define status from feature selection tab and then drag it and drop to dataset then right click
on define status and select parameters from popup menu and select all standardize variables like std_length of title_1 .We
insert the K‐means component under the Clustering tab. We click on the parameters contextual menu. We set the following
Parameters as in Figure.9.
Figure 9: Define the Number of Clusters
We ask a partitioning into two groups. It is not necessary to normalize the distance because we use already
standardized variables. We validate and we click on the View menu. This gives the TSS, WSS and centroids of clusters.
Tanagra computes and adds automatically a new column to the current dataset. We can visualize it with the View Dataset
component (Data visualization tab). We again insert the Define Status component into the k means to distinguish the
clusters. We set as Target a computed column cluster k means_1 as input as in Figure 10 .Then we add the Group
Characterization component under Statistics tab to get the final result. The result shows that in cluster 1 the websites are
top rank and have length of title is 41.54 , number of backlinks are 14908.42 , keywords in title are 3.42, Domain length is
27.46 and in cluster 2 the websites that are not top rank and have length of title 105 and number of backlinks are 2511
,keywords in title are 13 and Domain length is 18 as shown in Figure 11.
Figure 10: Selection of Discrete Attribute
Comparison of Clustering Based on Search Engine Dataset Using Tanagra and Weka 119
Figure 11: Final Result
Working with Weka on Dataset Using EM
Open Weka, then click on right side option explorer then open data file under preprocesses option which is in csv
or arff format [4], [5] like in K means. Starting points are similar to as describe above in K means. After this just click on
cluster tab and click on choose button on left side and select clustering algorithm which we want to apply, we select simple
EM the screen appears below in Figure 12
Figure 12: Selection of Algorithm
Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the "Cluster
Mode" panel, the "Use training set" option is selected, and we click "Start". We can right click the result set in the "Result
list" panel and view the results of clustering in a separate window. The result shows that there are 11 websites in cluster 0,
14 websites, in cluster 1, 1 websites in cluster 2 and in cluster 3 the 3 websites are present .The websites present in cluster
1 are top rank and have length of title 33 and number of backlinks are 3941, keywords in title are 3.44 and Domain length
is 18.61 as shown in Figure 13.
Figure 13: Final Result
120 Preeti Bansal & Md. Ezaz Ahmed
Working with Tanagra on Dataset Using EM
Open Tanagra, and then open data files which are in txt, xls or arff format. As we open the file in dataset it will
appear as given above, the screen shot in Figure 7. Now we select define status from feature selection tab and then drag it
and drop to dataset then right click on define status and select parameters from popup menu as in Figure 14. Now select
attributes backlinks, length of title, keywords in title, Domain length as input and press OK button [16].After this drag and
drop EM-Clustering in clustering tab under define Status. Then right click on EM clustering and select number of clusters
as 2 and press ok in Figure 15. and then drag and drop EM-selection under EM-clustering and select start and end value
like we want to make 2 clusters then start value is 1 and end value is 2.To see the output just double click on EM clustering
and check the values of factors in clusters as shown in Figure 16.
Figure 14: Selection of Input Parameters
Figure 15: Specify Number of Clusters
Figure 16: Final Result
Comparison of Clustering Based on Search Engine Dataset Using Tanagra and Weka 121
The result shows that in cluster 1 the websites are in top rank and have length of title is 41, number of backlinks
are 14908, keywords in title are 3.42, Domain length is 27.46 and in cluster 2 the websites that are not top rank and have
length of title 105 and number of backlinks are 2511, keywords in title are 13 and Domain length is 18.
Comparison of K Means and EM in Tanagra and Weka
Weka
In Weka, using K means the websites that are top rank have length of title is 36.37 while using EM the length of
title should be 33.62, number of backlinks in Weka by K-means are 19143 while using EM it should be 3941.4, keywords
in title in Weka by K means are 2.93 while by EM it should be 3.42, Domain length by K means is 29.81 while by EM it
should be 18.16 .The graphical comparison is shown in Figure 17 and Figure 18.
Tanagra
Tanagra, using K means the websites that are top rank have length of title is 27.46 while using EM the length of
title should be 41.5, number of backlinks in Tanagra by K-means are 14908.42 while using EM it should be 14908.9 ,
keywords in title in Tanagra by K means are 3.42 while by EM it should be 3.42, Domain length by K means is 27.46
while by EM it should be 27.41 .The graphical comparison is shown in Figure 19 and Figure 20, the values are nearby
same because of this the line shown in Figure 19. is of one colour.
CONCLUSIONS
As we continue to fight with huge data on web ,to optimised the result we should find many more influencing
factors.By this comparison we conclude that Tanagra is good tool in comparison of Weka as results of both algorithms k
means and EM are near by same[10].In future we work on Social signals as this feature has great importance in
optimization.
Figure 17: Comparison of Optimized Factors in Weka Figure 18: Comparison of Backlinks in Weka
Figure 19: Comparison of Optimized Factors in Tanagra Figure 20: Comparison of Backlinks in Tanagra
122 Preeti Bansal & Md. Ezaz Ahmed
ACKNOWLEDGEMENTS
We thanks to Mrs. Latika Singh for their invaluable comments and suggestions to improve the manuscript.
REFERENCES
1. A Document Clustering Algorithm for Web Search Engine Retrieval System, Hongwei Yang School of Software
Yunnan University, Kunming 650021, China;
2. S. Kantabutra, Efficient Representation of Cluster Structure in Large Data Sets, Ph.D. Thesis, Tufts University,
Medford MA, September 2001.
3. Wang Jun, OuYang Zheng-Zheng “The Research of K- Means Clustering Algorithm Based on Association Rules
“.
4. http://maya.cs.depaul.edu/classes/ect584/weka/ k- means.html.
5. http://www.cs.ccsu.edu/~markov/weka-tutorial.pdf.
6. http://thesai.org/Downloads/Volume3No4/Paper_20Knowledge_Discovery_in_Health_Care_Datasets
Using_Data_Mining_Tools.pdf.
7. Tanagra: An Evaluation. Jessica Enright Jonathan Klippenstein.
8. C. Romero, S. Ventura "Educational data Mining: A Survey from 1995 to 2005", Expert System with
Applications (33), pp. 135-146, 2007.
9. http://eric.univ-lyon2.fr/~ricco/tanagra/fichier/ tanagra_etles autres_KMeans.pdf.
10. R. Kannan, S. Vempala, and Adrian Vetta, On Clusterings Good, Bad, and Spectral” Proc. of the 41st Foundations
of Computer Science, Redondo Beach, 2000.5.
11. http://klageswebdesign.com/seo-blog/2011/06/search-ranking-factors-released-what-you- need-to-know/.
12. http://www.backlinkswatch.com.
13. http://www.submitexpress.com/cgi-bin/analyzer/meta.pl.
14. http://cptra.ln.edu.hk/~mlwong/conference/isda2002.pdf
15. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=04381759
16. http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/ en- Tanagra_EM_Clustering.pdf
17. http://www.seomoz.org/beginners-guide-to-seo.
18. http://www.ieee.org/about/webteam/resources/search_ optimization.html.
19. http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35291
20. http://en.wikipedia.org/wiki/Expectation%E2%80%93 maximization algorithm.
21. http://www.seomoz.org/article/search-ranking-factors