Cluster Based K -N N Model for Information Retrieval of Text … · 2018-05-06 · Cluster Based K -N N Model for Information Retrieval of Text Documents B.Nalayini 1, P.Prabhu 2

Cluster Based K-NN Model for Information Retrieval of Text Documents

B.Nalayini1

, P.Prabhu2

‘M.Phil Research Scholar,

2Assistant Professor - DDE

1Department of Computer Applications, Alagappa University, Karaikudi. Tamilnadu, India.

2Department of Computer Applications, Alagappa University, Karaikudi. Tamilnadu, India.

Mail id: [email protected],

[email protected]

Abstract—Text mining optimization in search engine keyword

query is the important process and this increases both search

engine performance and more relevant search results. The major

issue in text mining is the future extraction in large documents.

There are various algorithms have been proposed in the literature. There are various challenges, issues and limitations in

traditional methods and still needs improvement Hence in this

research work cluster based k-NN model for information

retrieval has been proposed This approach first discovers similar

documents in text documents for identifying the most informative contents of the documents and then utilizes the

identified contents to extract useful features for text mining. This

works implements hierarchical clustering algorithm and K-

nearest neighbor algorithm for k-closet training samples for

identifying best matching documents. This algorithm outperforms with traditional methods when tested using real

world datasets.

Keywords: Information Retrieval, Clustering, classification, Data

mining, neural networks, frequent patterns.

I. INTRODUCTION

The precedent twenty years has seen our economy make a

transition into the information age. Now a day's computers,

data and information in many industries have become the

basis for decision making [2,3,9]. Companies are collecting

very large amounts of information about their customers,

products, employees, manufacturing processes, distribution

processes, and marketing processes. This information can be

build up predictive models to guide future decision making.

The machine learning field has continued to evolve in the

academic communities. New concepts, new algorithms,

computer structures and systems have been devised and

applied to real world scientific and business complications.

These ideas are transitioned to industry in the form of new

products. They have also become the basis for developing

entire new businesses. Through experience, an understanding

has developed that data mining is a step in a larger knowledge

discovery process.

A systematic attitude has evolved to take raw data and

convert it into information and to take information and

transform it into knowledge to help solve real business

problems. Now it is understood that the larger process

requires the careful use of different computer technologies in

different stages. Raw operational data [12,17,19] can be

transformed into extrapolate models that support meeting

foremost business objectives. Data mining plays a critical role

in overall process. As the field of machine learning has

developed, there has been a constant evolution of fragile

Artificial Intelligence (AI) technologies from the 1970s and

1980s, such as adept systems and neural computing, into adult

products. These products correctly used can be effectively

deployed in the business environment.

Text mining plays the discovery of interesting knowledge in

text documents. A challenging issue to find accurate

knowledge in text documents that help users to find what is

needed. Many applications, such as market analysis and

business management, can benefit from the use of the

information and knowledge extracted from a large amount of

data. The knowledge discovery[11,12] can be successfully use

and update discovered patterns and apply it to field of text

International Journal of Pure and Applied MathematicsVolume 119 No. 12 2018, 16149-16154ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

16149

mining. Data mining is therefore an essential step in the

process of knowledge discovery in databases, which means

data mining is having all methods of knowledge discovery

process and presenting modeling phase that is the application

of methods and algorithm for calculation of search pattern

[10] or models.

II. RELATED WORKS

Research in any field needs literature related to the research

field. This chapter discusses various concepts related to the

information retrieval.

Prabhu. et.al., [1] discussed about work access and

retrieval of information Current research areas within the field

of IR include searching and querying, ranking of search

results, navigating and browsing information, optimizing

information representation and storage and document

classification and clustering. In information retrieval, the

process of manually categorizing the pages of an electronic /

website document is often tedious and expensive. Document

clustering has thus often been used to automatically categories

a search result into clusters. In this paper, two partition based

k-means and Spherical k-means Clustering algorithm is used

over the document collection taken from Reuters -21578 and

20 newsgroups dataset and performance is tested. As a result

of various runs of clustering, spherical k-means clustering

algorithm performs well than k-means for document

clustering.

Ning Zhong, et.al., [2] presented an innovative and

effective pattern discovery technique which includes the

processes of pattern implementation and pattern evaluation, to

improve the effectiveness of using and updating discovered

patterns to find relevant and interesting information. The

proposed technique uses two processes, pattern deploying and

pattern evolving, to refine the discovered patterns in text

documents. Certain experiments are carried on RCV1 data

collection and TREC topics to demonstrate that the proposed

solution achieves encouraging performance. These techniques

include association rule mining, frequent data-set mining,

sequential pattern mining, Maximum Pattern mining and

closed pattern mining. With a conclusion an effective pattern

discovery technique has been proposed to overcome the low-

frequency and misinterpretation problems for text mining.

Baeza-Yates, et.al., [3] presented an interaction cycle for

retrieving information in specific forms such as query

specification, receipt and examination of retrieving results and

then either stopping of reformulating the query and repeating

the process until perfect result set is found. the simplest

interaction model contains an underlying assumption that the

user's information need is static and information seeking

process is the one of successively refining a query until it

retrieves all and only those document relevant to the original

information need.

Pascal Soucy, et.al., [4] proposed a simple KNN algorithm

for text categorization. This method does an aggressive feature

selection. Feature Selection is used for selecting a subset of all

available features The KNN algorithm provides a solid ground

for text categorization in large document sets and diverse

vocabulary. In this method, each text document is called as an

instance. In order to categorize texts using KNN, each

example document X is represented as a vector of length |F|,

the size of the vocabulary. In this approach, distance is treated

as a basis to weight the contribution of each k neighbor in the

class assignment process. With text documents, text

categorization may involve thousands of features, most of

them being irrelevant. It is very difficult to identify the

relevant feature for text categorization because the interaction

between the features is much closer. The feature selection

method used here to aggressively reduce the vocabulary size

using feature interaction. This simple KNN algorithm can

reach impressive results using a few features .

Songbo Tan [5] analyzed a neighbor weighted K-nearest

neighbor for unbalanced text corpus. To unbalanced text

corpora, the majority class tends to have more examples in the

K-neighbor set for each test document. If we utilize the

conventional KNN technique to classify the test document, the

document tends to assign the label of the bulk class to the

document. Hence, the big category tends to have higher

classification accuracy, while the other that is the minority

class tends to have low classification accuracy. So that, the

total performance of KNN will be degraded., The algorithm

assigns a big weight for neighbors from small class, and

assigns a little weight for the neighbors that are in large

category. For each test document d, first select K neighbors

among training documents contained in K* categories. For

test document d, it should be assigned the class that has the

highest resulting weighted sum, as in the case of traditional

KNN. The NWKNN yields much better performance than

KNN. The algorithm NWKNN is an effective algorithm for

unbalanced text classification problems.

Zhou Yong, et.al., [6] proposed an improved KNN text

classification algorithm based on clustering, wh ich Doesn’t

use all training samples as traditional KNN Algorithm and it

can overcome the defect of uneven Distribution of training

samples which may cause multipeak Effect. This improved

algorithm used samples Austerity technology for removing the

border samples firstly. Afterwards, it dealt with all training

categories by k-means clustering algorithm for getting the

cluster centers which were used as the new training samples,

and then introduced a weight value for each cluster center that

can indicate the different importance of them. At last, the

revised samples with weight were used in the algorithm.

Mrs. Leena, et.al., [7] proposed a multistage feature

selection model for document classification using information

Gain and Rough set. Hence in proposed model features of less

importance are ignored due to which dimensionality of feature

space is reduced. Again computational time and complexity of

the method are also reduced. At each stage classifiers

performance are evaluated in term of precision, recall and f-

measures. To analyze the effectiveness and accuracy of

proposed model, experiments are performed using KNN and

Naive Bayes classifier.

Femi Joseph et.al., [8] described a text categorization

technique used K-Nearest Neighbor algorithm. The text

International Journal of Pure and Applied Mathematics Special Issue

16150

categorization technique based on improved K Nearest

Neighbor Algorithm is the most appropriate one due to its

minimal time and computational requirements. In this text

categorization technique, clustering is a great tool to find out

the complex sharing of the training texts. It uses controlled

one pass clustering algorithm to obtain the categories

relationship by the constrained condition (each cluster only

contains one label). Better reflect the complex distributions of

the training texts than original text samples. As integrating the

advantages of the constrained one pass clustering and Weight

Adjusted KNN approach.

III. PREPROCESSING

The data in the real-world are large, ugly, unrelated and

noise. To clean this dataset there is a necessity to pre-process

the dataset. Data selection, Transformation, stop word

removal, word stemming, building of inverted index and

TF*IDF calculation steps are performed during this

preprocessing of text documents. The steps are explained as

follows;

A. Data Selection

In the selection step the important data gets selected or

created. Hence forward the KDD[8,9,10] process is

maintained on the gathered target data. Only applicable

information is selected, and also metadata or data that

represents environment knowledge.

B. Data Transformation

Data transform is preprocessing of the document. It consists

of remove irrelevant data from documents. The makeover

phase of data may result in a number of different formats,

since variable data mining tools may require variable formats.

The data also manually or automatically it is reduced. The

reduction can be made via loss less aggregation [18,20] or a

loss full selection of only the most important elements. A

representative selection can used to draw the conclusions to

entire data.

C. Stopword Removal

Stop words be words which are clean out before or

after processing of natural language data (text).Though stop

words generally refer to the most common words in a

language, there is no single common list of stop words used

by all processing of natural language tools, and indeed not all

tools even use such a list. Some tools specifically avoid

removing these stop words to maintain phrase search.

D. Word Stemming

Stemming is the process of understanding dissimilar words

as variations of origin or stem. For example, the terms

addicted, addicting, addiction, addictions, addictive, and

addicts might be conflated to their stem, addict.

E. Building Inverted Index

Inverted index file contains an inverted file entry that stores

a list of pointers to all occurrences of that term in the main test

for every term in the main test for every term in the lexicon.

Where each pointers is, in effect, the number of a documents

in which the terms appears .

F. TF*IDF Calculation

Terms frequency and inverse document frequency is a

weight often used in text mining and information retrieval. It

is a measure of how significant a word is to a document in a

collection. Term frequency is defined as the total count of

word that is frequent in a document. Inverse Document

frequency is defined as the total number of times the word

occurs in the entire documents

IV. METHODOLOGY

In this phase a model is constructed using k-nn

classification to perform information retrieval. First step is to

begin with text document and split into paragraphs.

Preprocess the document and calculates its support count and

compare with the minimum support value. Compute the

frequent patterns in the document which satisfies minimum

support count. Calculate Euclidean distance of each document

and compare its distance from other documents. Finally

classify using k-nn classifier and identify the best matching in

the documents.

A. Frequent Patterns:

Once the pre-processing is completed, the data set is ready

for processing. Using pre-processed dataset, the frequent

patterns and their support count are calculated. The frequent

patterns which satisfy minimum support count is take for

classification.

Fig1. Architecture of K-NN-IR .

B. k-NN Classification

The k-Nearest Neighbors algorithm is a non-parametric

method used for classification and regression. both cases, the

input consists k closest training .


16151

Algorithm : k-NN-IR

Input: Text Document dataset (TDD). Search query q

Output: Set of frequent patterns fp

: Best matching document cluster

Begin

Documents (TDD) → split into paragraph p

for each document

Do preprocessing of each paragraph p

Calculate the absolute and relative support.

Get minimum support min_sup value

: Compute the frequent patterns fp

if (support count ≥ min_sup)

Calculate Euclidean distance.

Classify using k-NN classifier.

: Compare distance between documents and q.

Identify best matching document.

End if

: End for

End

This algorithm has been experimentally implemented and

evaluated using real-world datasets.

V. RESULTS AND DISCUSSION

Extensive experimental evaluation to compare the

effectiveness of the proposed model with other schemes is

performed. In the following, we describe in detail the

organization of these experiments.

A. Benchmark Datasets and Evaluation Metrics

In this research work, reuters-21578 and 20 Newsgroup

dataset collected from UCI machine learning repository is

used. Reuters consists of 21,578 articles. 20-Newsgroup is a

set of 18,828 Usenet posts segmented across 20 discussion

groups. For each dataset, all documents were pre-processed by

removing special characters, numbers and stop words using a

predefined list. Common Porter Stemmer is then used to

remaining words. The model is implemented to extract

patterns and evaluated using precision, recall and F1 measure.

B. Experimental Results

The effectiveness of classification is tested with varying the

term weighting scheme on several real-world datasets.

Classification for varying the number of features p selected for

each category is tested for each dataset Figure2 represents the

pattern mining of text document.

Figure 2. Pattern mining of text documents

Figure.3. represents the k-nn classification of the text

document.

Figure 3. k-nn classification of text documents

The performance of different term weighting methods in

terms of precision, recall and f1 measure on the datasets

previously described is tested. The conventional PDM model

[15] f1 measure values for Acq, crude, Earn, Money and

Wheat are 0.57, 0.48, 0.29, 0.50, 0.44 respectively. The

conventional IPM model [15] f1 measure values for Acq,

crude, Earn, Money and Wheat are 0.81, 0.46, 0.61, 0.45, 0.58

respectively. The proposed k-NN-IR model f1 measure values

for Acq, crude, Earn, Money and Wheat are 0.85, 0.51,

0.69,0.58, 0.63 respectively. In general, proposed k-NN-IR

model achieved top results in all datasets.

VI. CONCLUSION

This research presents a cluster based k-nn model for

classification of documents for information retrieval. First the

dataset is preprocessed and then modeled. The results put in

evidence a close competition using our model; in particular,

the best results obtained with the different datasets and feature

selection algorithms. The empirical results show that the

proposed technique is effective. As future works, this idea can

be applied to larger datasets with different classifier to

improve the performance further.

REFERENCES

[1] Prabhu.P, Jeyshankar.R,"Evaluating performance of Partition based document clustering algorithms for Information Retrieval", International Journal of Mathematical Achieve 2011.

[2] Ning Zhong, Yuefeng Li, and Sheng-Tang Wu, “Effective Pattern Discovery for Text Mining”, IEEE transactions, vol.24 No. 1, Jan 2012.

[3] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval.Addison Wesley, 1999.

[4] Pascal Soucy, Guy W Mineau. ―A Simple KNN Algorithm for Text Categorization.‖ Proceedings of International Conference on Date Mining. 2001: 647-648.

[5] Tan, Songbo. "An effective refinement strategy for KNN text classifier." Expert Systems with Applications 30.2 (2006): 290-298.

[6] Zhou, Yong, Youwen Li, and Shixiong Xia. "An improved KNN text classification algorithm based on clustering." Journal of computers 4.3

(2009): 230-237. [7] Mrs. Leena. H. Patil, Dr. Mohammed Atique”A Multistage Feature

Selection Model for Document Classification Using Information Gain and Rough Set” (IJARAI) International Journal of Advanced Research

in Artificial Intelligence, Vol. 3, No.11, 2014. [8] P. Keerthana, B.G. Geetha, P. Kanmani, “Crustose Using Shape

Features And Color Histogram With Knearest Neighbour Classifiers”,


16152

International Journal of Innovations in Scientific and Engineering Research (IJISER), Vol.4, No.9pp.199-203, 2017.

[9] Femi Joseph, Nithin and Ramakrishnan[4] “Text Categorization Using Improved K Nearest Neighbor Algorithm ” international journal for trends in engineering & technology volume 4 issue 2 – april 2015 - issn: 2349 – 9303.

[10] N. Cancedda, E. Gaussier, C. Goutte, and J.-M. Renders, “Word- Sequence Kernels,” J. Machine Learning Research, vol. 3, pp. 1059-1082, 2003.

[11] M.F. Caropreso, S. Matwin, and F. Sebastiani, “Statistical Phrases in

Automated Text Categorization,” Technical Report IEI-B4-07- 2000, Instituto di Elaborazione dell’ Informazione, 2000.

[12] L. P. Jing, H. K. Huang, and H. B. Shi. “Improved feature selection approach tf*idf in text mining.” International Conference on Machine

Learning and Cybernetics, 2002. [13] H. Ahonen-Myka. Discovery of frequent word sequences in text. In

Proceedings of Pattern Detection and Discovery, pages 180–189,

2002.34, 61. . [14] Asmeeta Mali. Spam detection using baysian with patteren discovery.

International Journal of Recent Technology and Engineering (IJRTE), 2:139-143, 2013.

[15] M. Popovic and P. Willett . The effectiveness of stemming for natural language access to slovene textual data. JASIS, 43(5):191–203, 1993. .

[16] Minakshi R. Shinde1 , Prof. S. A. Kinariwala” Information Retrieval in Text Mining Using Pattern Based Approach” International Journal of

Science and Research (IJSR) ISSN (Online): 2319-7064 Volume 4 Issue 10, October 2015.


16153

16154