Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Cluster Based K-NN Model for Information Retrieval of Text Documents
B.Nalayini1
, P.Prabhu2
‘M.Phil Research Scholar,
2Assistant Professor - DDE
1Department of Computer Applications, Alagappa University, Karaikudi. Tamilnadu, India.
2Department of Computer Applications, Alagappa University, Karaikudi. Tamilnadu, India.
Mail id: [email protected],
Abstract—Text mining optimization in search engine keyword
query is the important process and this increases both search
engine performance and more relevant search results. The major
issue in text mining is the future extraction in large documents.
There are various algorithms have been proposed in the literature. There are various challenges, issues and limitations in
traditional methods and still needs improvement Hence in this
research work cluster based k-NN model for information
retrieval has been proposed This approach first discovers similar
documents in text documents for identifying the most informative contents of the documents and then utilizes the
identified contents to extract useful features for text mining. This
works implements hierarchical clustering algorithm and K-
nearest neighbor algorithm for k-closet training samples for
identifying best matching documents. This algorithm outperforms with traditional methods when tested using real
world datasets.
Keywords: Information Retrieval, Clustering, classification, Data
mining, neural networks, frequent patterns.
I. INTRODUCTION
The precedent twenty years has seen our economy make a
transition into the information age. Now a day's computers,
data and information in many industries have become the
basis for decision making [2,3,9]. Companies are collecting
very large amounts of information about their customers,
products, employees, manufacturing processes, distribution
processes, and marketing processes. This information can be
build up predictive models to guide future decision making.
The machine learning field has continued to evolve in the
academic communities. New concepts, new algorithms,
computer structures and systems have been devised and
applied to real world scientific and business complications.
These ideas are transitioned to industry in the form of new
products. They have also become the basis for developing
entire new businesses. Through experience, an understanding
has developed that data mining is a step in a larger knowledge
discovery process.
A systematic attitude has evolved to take raw data and
convert it into information and to take information and
transform it into knowledge to help solve real business
problems. Now it is understood that the larger process
requires the careful use of different computer technologies in
different stages. Raw operational data [12,17,19] can be
transformed into extrapolate models that support meeting
foremost business objectives. Data mining plays a critical role
in overall process. As the field of machine learning has
developed, there has been a constant evolution of fragile
Artificial Intelligence (AI) technologies from the 1970s and
1980s, such as adept systems and neural computing, into adult
products. These products correctly used can be effectively
deployed in the business environment.
Text mining plays the discovery of interesting knowledge in
text documents. A challenging issue to find accurate
knowledge in text documents that help users to find what is
needed. Many applications, such as market analysis and
business management, can benefit from the use of the
information and knowledge extracted from a large amount of
data. The knowledge discovery[11,12] can be successfully use
and update discovered patterns and apply it to field of text
International Journal of Pure and Applied MathematicsVolume 119 No. 12 2018, 16149-16154ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu
16149
mining. Data mining is therefore an essential step in the
process of knowledge discovery in databases, which means
data mining is having all methods of knowledge discovery
process and presenting modeling phase that is the application
of methods and algorithm for calculation of search pattern
[10] or models.
II. RELATED WORKS
Research in any field needs literature related to the research
field. This chapter discusses various concepts related to the
information retrieval.
Prabhu. et.al., [1] discussed about work access and
retrieval of information Current research areas within the field
of IR include searching and querying, ranking of search
results, navigating and browsing information, optimizing
information representation and storage and document
classification and clustering. In information retrieval, the
process of manually categorizing the pages of an electronic /
website document is often tedious and expensive. Document
clustering has thus often been used to automatically categories
a search result into clusters. In this paper, two partition based
k-means and Spherical k-means Clustering algorithm is used
over the document collection taken from Reuters -21578 and
20 newsgroups dataset and performance is tested. As a result
of various runs of clustering, spherical k-means clustering
algorithm performs well than k-means for document
clustering.
Ning Zhong, et.al., [2] presented an innovative and
effective pattern discovery technique which includes the
processes of pattern implementation and pattern evaluation, to
improve the effectiveness of using and updating discovered
patterns to find relevant and interesting information. The
proposed technique uses two processes, pattern deploying and
pattern evolving, to refine the discovered patterns in text
documents. Certain experiments are carried on RCV1 data
collection and TREC topics to demonstrate that the proposed
solution achieves encouraging performance. These techniques
include association rule mining, frequent data-set mining,
sequential pattern mining, Maximum Pattern mining and
closed pattern mining. With a conclusion an effective pattern
discovery technique has been proposed to overcome the low-
frequency and misinterpretation problems for text mining.
Baeza-Yates, et.al., [3] presented an interaction cycle for
retrieving information in specific forms such as query
specification, receipt and examination of retrieving results and
then either stopping of reformulating the query and repeating
the process until perfect result set is found. the simplest
interaction model contains an underlying assumption that the
user's information need is static and information seeking
process is the one of successively refining a query until it
retrieves all and only those document relevant to the original
information need.
Pascal Soucy, et.al., [4] proposed a simple KNN algorithm
for text categorization. This method does an aggressive feature
selection. Feature Selection is used for selecting a subset of all
available features The KNN algorithm provides a solid ground
for text categorization in large document sets and diverse
vocabulary. In this method, each text document is called as an
instance. In order to categorize texts using KNN, each
example document X is represented as a vector of length |F|,
the size of the vocabulary. In this approach, distance is treated
as a basis to weight the contribution of each k neighbor in the
class assignment process. With text documents, text
categorization may involve thousands of features, most of
them being irrelevant. It is very difficult to identify the
relevant feature for text categorization because the interaction
between the features is much closer. The feature selection
method used here to aggressively reduce the vocabulary size
using feature interaction. This simple KNN algorithm can
reach impressive results using a few features .
Songbo Tan [5] analyzed a neighbor weighted K-nearest
neighbor for unbalanced text corpus. To unbalanced text
corpora, the majority class tends to have more examples in the
K-neighbor set for each test document. If we utilize the
conventional KNN technique to classify the test document, the
document tends to assign the label of the bulk class to the
document. Hence, the big category tends to have higher
classification accuracy, while the other that is the minority
class tends to have low classification accuracy. So that, the
total performance of KNN will be degraded., The algorithm
assigns a big weight for neighbors from small class, and
assigns a little weight for the neighbors that are in large
category. For each test document d, first select K neighbors
among training documents contained in K* categories. For
test document d, it should be assigned the class that has the
highest resulting weighted sum, as in the case of traditional
KNN. The NWKNN yields much better performance than
KNN. The algorithm NWKNN is an effective algorithm for
unbalanced text classification problems.
Zhou Yong, et.al., [6] proposed an improved KNN text
classification algorithm based on clustering, wh ich Doesn’t
use all training samples as traditional KNN Algorithm and it
can overcome the defect of uneven Distribution of training
samples which may cause multipeak Effect. This improved
algorithm used samples Austerity technology for removing the
border samples firstly. Afterwards, it dealt with all training
categories by k-means clustering algorithm for getting the
cluster centers which were used as the new training samples,
and then introduced a weight value for each cluster center that
can indicate the different importance of them. At last, the
revised samples with weight were used in the algorithm.
Mrs. Leena, et.al., [7] proposed a multistage feature
selection model for document classification using information
Gain and Rough set. Hence in proposed model features of less
importance are ignored due to which dimensionality of feature
space is reduced. Again computational time and complexity of
the method are also reduced. At each stage classifiers
performance are evaluated in term of precision, recall and f-
measures. To analyze the effectiveness and accuracy of
proposed model, experiments are performed using KNN and
Naive Bayes classifier.
Femi Joseph et.al., [8] described a text categorization
technique used K-Nearest Neighbor algorithm. The text
International Journal of Pure and Applied Mathematics Special Issue
16150
categorization technique based on improved K Nearest
Neighbor Algorithm is the most appropriate one due to its
minimal time and computational requirements. In this text
categorization technique, clustering is a great tool to find out
the complex sharing of the training texts. It uses controlled
one pass clustering algorithm to obtain the categories
relationship by the constrained condition (each cluster only
contains one label). Better reflect the complex distributions of
the training texts than original text samples. As integrating the
advantages of the constrained one pass clustering and Weight
Adjusted KNN approach.
III. PREPROCESSING
The data in the real-world are large, ugly, unrelated and
noise. To clean this dataset there is a necessity to pre-process
the dataset. Data selection, Transformation, stop word
removal, word stemming, building of inverted index and
TF*IDF calculation steps are performed during this
preprocessing of text documents. The steps are explained as
follows;
A. Data Selection
In the selection step the important data gets selected or
created. Hence forward the KDD[8,9,10] process is
maintained on the gathered target data. Only applicable
information is selected, and also metadata or data that
represents environment knowledge.
B. Data Transformation
Data transform is preprocessing of the document. It consists
of remove irrelevant data from documents. The makeover
phase of data may result in a number of different formats,
since variable data mining tools may require variable formats.
The data also manually or automatically it is reduced. The
reduction can be made via loss less aggregation [18,20] or a
loss full selection of only the most important elements. A
representative selection can used to draw the conclusions to
entire data.
C. Stopword Removal
Stop words be words which are clean out before or
after processing of natural language data (text).Though stop
words generally refer to the most common words in a
language, there is no single common list of stop words used
by all processing of natural language tools, and indeed not all
tools even use such a list. Some tools specifically avoid
removing these stop words to maintain phrase search.
D. Word Stemming
Stemming is the process of understanding dissimilar words
as variations of origin or stem. For example, the terms
addicted, addicting, addiction, addictions, addictive, and
addicts might be conflated to their stem, addict.
E. Building Inverted Index
Inverted index file contains an inverted file entry that stores
a list of pointers to all occurrences of that term in the main test
for every term in the main test for every term in the lexicon.
Where each pointers is, in effect, the number of a documents
in which the terms appears .
F. TF*IDF Calculation
Terms frequency and inverse document frequency is a
weight often used in text mining and information retrieval. It
is a measure of how significant a word is to a document in a
collection. Term frequency is defined as the total count of
word that is frequent in a document. Inverse Document
frequency is defined as the total number of times the word
occurs in the entire documents
IV. METHODOLOGY
In this phase a model is constructed using k-nn
classification to perform information retrieval. First step is to
begin with text document and split into paragraphs.
Preprocess the document and calculates its support count and
compare with the minimum support value. Compute the
frequent patterns in the document which satisfies minimum
support count. Calculate Euclidean distance of each document
and compare its distance from other documents. Finally
classify using k-nn classifier and identify the best matching in
the documents.
A. Frequent Patterns:
Once the pre-processing is completed, the data set is ready
for processing. Using pre-processed dataset, the frequent
patterns and their support count are calculated. The frequent
patterns which satisfy minimum support count is take for
classification.
Fig1. Architecture of K-NN-IR .
B. k-NN Classification
The k-Nearest Neighbors algorithm is a non-parametric
method used for classification and regression. both cases, the
input consists k closest training .
International Journal of Pure and Applied Mathematics Special Issue
16151
Algorithm : k-NN-IR
Input: Text Document dataset (TDD). Search query q
Output: Set of frequent patterns fp
: Best matching document cluster
Begin
Documents (TDD) → split into paragraph p
for each document
Do preprocessing of each paragraph p
Calculate the absolute and relative support.
Get minimum support min_sup value
: Compute the frequent patterns fp
if (support count ≥ min_sup)
Calculate Euclidean distance.
Classify using k-NN classifier.
: Compare distance between documents and q.
Identify best matching document.
End if
: End for
End
This algorithm has been experimentally implemented and
evaluated using real-world datasets.
V. RESULTS AND DISCUSSION
Extensive experimental evaluation to compare the
effectiveness of the proposed model with other schemes is
performed. In the following, we describe in detail the
organization of these experiments.
A. Benchmark Datasets and Evaluation Metrics
In this research work, reuters-21578 and 20 Newsgroup
dataset collected from UCI machine learning repository is
used. Reuters consists of 21,578 articles. 20-Newsgroup is a
set of 18,828 Usenet posts segmented across 20 discussion
groups. For each dataset, all documents were pre-processed by
removing special characters, numbers and stop words using a
predefined list. Common Porter Stemmer is then used to
remaining words. The model is implemented to extract
patterns and evaluated using precision, recall and F1 measure.
B. Experimental Results
The effectiveness of classification is tested with varying the
term weighting scheme on several real-world datasets.
Classification for varying the number of features p selected for
each category is tested for each dataset Figure2 represents the
pattern mining of text document.
Figure 2. Pattern mining of text documents
Figure.3. represents the k-nn classification of the text
document.
Figure 3. k-nn classification of text documents
The performance of different term weighting methods in
terms of precision, recall and f1 measure on the datasets
previously described is tested. The conventional PDM model
[15] f1 measure values for Acq, crude, Earn, Money and
Wheat are 0.57, 0.48, 0.29, 0.50, 0.44 respectively. The
conventional IPM model [15] f1 measure values for Acq,
crude, Earn, Money and Wheat are 0.81, 0.46, 0.61, 0.45, 0.58
respectively. The proposed k-NN-IR model f1 measure values
for Acq, crude, Earn, Money and Wheat are 0.85, 0.51,
0.69,0.58, 0.63 respectively. In general, proposed k-NN-IR
model achieved top results in all datasets.
VI. CONCLUSION
This research presents a cluster based k-nn model for
classification of documents for information retrieval. First the
dataset is preprocessed and then modeled. The results put in
evidence a close competition using our model; in particular,
the best results obtained with the different datasets and feature
selection algorithms. The empirical results show that the
proposed technique is effective. As future works, this idea can
be applied to larger datasets with different classifier to
improve the performance further.
REFERENCES
[1] Prabhu.P, Jeyshankar.R,"Evaluating performance of Partition based document clustering algorithms for Information Retrieval", International Journal of Mathematical Achieve 2011.
[2] Ning Zhong, Yuefeng Li, and Sheng-Tang Wu, “Effective Pattern Discovery for Text Mining”, IEEE transactions, vol.24 No. 1, Jan 2012.
[3] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval.Addison Wesley, 1999.
[4] Pascal Soucy, Guy W Mineau. ―A Simple KNN Algorithm for Text Categorization.‖ Proceedings of International Conference on Date Mining. 2001: 647-648.
[5] Tan, Songbo. "An effective refinement strategy for KNN text classifier." Expert Systems with Applications 30.2 (2006): 290-298.
[6] Zhou, Yong, Youwen Li, and Shixiong Xia. "An improved KNN text classification algorithm based on clustering." Journal of computers 4.3
(2009): 230-237. [7] Mrs. Leena. H. Patil, Dr. Mohammed Atique”A Multistage Feature
Selection Model for Document Classification Using Information Gain and Rough Set” (IJARAI) International Journal of Advanced Research
in Artificial Intelligence, Vol. 3, No.11, 2014. [8] P. Keerthana, B.G. Geetha, P. Kanmani, “Crustose Using Shape
Features And Color Histogram With Knearest Neighbour Classifiers”,
International Journal of Pure and Applied Mathematics Special Issue
16152
International Journal of Innovations in Scientific and Engineering Research (IJISER), Vol.4, No.9pp.199-203, 2017.
[9] Femi Joseph, Nithin and Ramakrishnan[4] “Text Categorization Using Improved K Nearest Neighbor Algorithm ” international journal for trends in engineering & technology volume 4 issue 2 – april 2015 - issn: 2349 – 9303.
[10] N. Cancedda, E. Gaussier, C. Goutte, and J.-M. Renders, “Word- Sequence Kernels,” J. Machine Learning Research, vol. 3, pp. 1059-1082, 2003.
[11] M.F. Caropreso, S. Matwin, and F. Sebastiani, “Statistical Phrases in
Automated Text Categorization,” Technical Report IEI-B4-07- 2000, Instituto di Elaborazione dell’ Informazione, 2000.
[12] L. P. Jing, H. K. Huang, and H. B. Shi. “Improved feature selection approach tf*idf in text mining.” International Conference on Machine
Learning and Cybernetics, 2002. [13] H. Ahonen-Myka. Discovery of frequent word sequences in text. In
Proceedings of Pattern Detection and Discovery, pages 180–189,
2002.34, 61. . [14] Asmeeta Mali. Spam detection using baysian with patteren discovery.
International Journal of Recent Technology and Engineering (IJRTE), 2:139-143, 2013.
[15] M. Popovic and P. Willett . The effectiveness of stemming for natural language access to slovene textual data. JASIS, 43(5):191–203, 1993. .
[16] Minakshi R. Shinde1 , Prof. S. A. Kinariwala” Information Retrieval in Text Mining Using Pattern Based Approach” International Journal of
Science and Research (IJSR) ISSN (Online): 2319-7064 Volume 4 Issue 10, October 2015.
International Journal of Pure and Applied Mathematics Special Issue
16153
16154