A Survey on Web Usage Mining: Theory and Applications 2 Dr. P.Sumathi 1 Doctoral student in Manonmaniam Sundaranar University, Tirunelveli ,Tamil Nadu, India 2 Asst. Professor, Chikkanna

1P.Nithya,

2 Dr. P.Sumathi

1Doctoral student in Manonmaniam Sundaranar University, Tirunelveli ,Tamil Nadu, India

2Asst. Professor, Chikkanna Govt. Arts College, Tirupur, Tamil Nadu, India

E-mail:[email protected]

Abstract--- The World Wide Web maintaining its

development at an incredible pace. The information available

in the WWW is a gateway and an intermediate for carrying out

business. Web mining is the extraction of exciting and

constructive facts and inherent information from artifacts or

actions related to the WWW. Web usage mining (WUM) puts

an effort to determine valuable information from the

secondary data obtained from the communications of the users

with the Web. WUM has turned out to be an extremely

significant for successful Web site organization, generating

adaptive Web sites, business and maintenance services,

personalization, network traffic flow examination and so on.

WUM comprises of three steps, namely preprocessing, pattern

discovery, and pattern analysis. WUM has become an active

area of research in field of data mining due to its vital

importance. This paper provides a comprehensive discussion

of the all the phases in WUM and related works in this field.

Keywords--- Web Usage Mining (WUM), Customer

Relationship Management (CRM), Preprocessing, Pattern

Discovery, Pattern Analysis

I. INTRODUCTION

Data in Web Usage Mining, can be obtained in server logs,

browser logs, proxy logs, or collected from an organization's

database. These data collections vary in terms of the location

of the data source, the kinds of data available, the segment of

population from which the data was obtained, and techniques

of implementation [1].

WUM is a division of Web Mining, which, sequentially, is

a component of Data Mining. The process of mining

significant and valuable information from vast database is

called Data Mining [2]. WUM mines the usage features of the

users of Web Applications. This obtained data can then be

applied in a various ways such as, checking of fake elements

etc.

WUM is considered as a component of the Business

Intelligence in an organization [3]. It is applied for deciding

business approaches via the competent use of Web

Applications. It is very vital for the Customer Relationship

Management (CRM) since it can guarantee customer

fulfillment till the interface between the customer and the

organization is concerned [4].

There are many kinds of data that can be used in Web Mining.

1. Content: The visible data in the Web pages or the data

which was intended to be provided to the users. This

greatly includes text and graphics (images).

2. Structure: The organization of the website is

illustrated by this data. It is partitioned into two

categories. Intra-page structure data consist of the

arrangement of several Hyper Text Markup Language

(HTML) or Extended Markup Language (XML) tags

within a given page. The key type of inter-page

structure information is the hyper-links used for site

navigation.

3. Usage: Data that illustrates the usage patterns of Web

pages, such as IP addresses, page references and the

date and time of accesses and other information based

on the log format.

The main processes in WUM are:

Preprocessing: Data preprocessing illustrates any sort of

processing executed on raw data to organize it for another

processing process [5]. Data preprocessing alters the data into

a format that will be more efficiently processed for the

convenient of the user. Preprocessing steps used in WUM are

[6]:

1. Usage Pre-Processing: Pre-Processing involving

Usage patterns of users.

2. Content Pre-Processing: Pre-Processing of content

accessed.

3. Structure Pre-Processing: Pre-Processing involving

structure of the website.

Pattern Discovery: WUM can be utilized to expose patterns

in server logs but is frequently executed only on samples of

data. The mining procedure will be unproductive if the models

are not a significant illustration of the larger body of data [7].

The following are the pattern discovery methods.

1. Statistical Analysis

2. Association Rules

3. Clustering

4. Classification

5. Sequential Patterns

6. Dependency Modeling

Pattern Analysis: This is the ultimate step in the WUM

process. After the completion of the preprocessing and pattern

discovery, the collected usage patterns are examined to filter

insignificant information and obtain the valuable information

A Survey on Web Usage Mining: Theory and Applications

P Nithya et al ,Int.J.Computer Technology & Applications,Vol 3 (4), 1625-1629

IJCTA | July-August 2012 Available [email protected]

1625

ISSN:2229-6093

[8]. The techniques like Structured Query Language (SQL)

processing and Online Analytical Processing (OLAP) can be

used.

There are several approaches present in the literature for

WUM. This paper provides detailed study about important

techniques available for WUM.

II. LITERATURE SURVEY

This section provides detailed discussion about several

WUM techniques. The following section discusses the various

works of several authors in the following phases.

Preprocessing

Pattern Discovery

Pattern Analysis

2.1. Preprocessing

There has been an ever rising growth in web applications

and its users are increasing at rapid speed. The rapid changes

in technology assist in capturing the users' spirit and

interactions with web applications through web server log file.

Web log file is saved as text (.txt) file. Because of the huge

quantity of “unrelated information” in the web log, the

original log file can not be directly used in the WUM

procedure. Thus, the preprocessing of web log file becomes

vital. The appropriate examination of web log file is very

essential to effectively handle the web sites for administrative

and users' potential. Web log preprocessing is a first vital step

to enhance the quality and competence of the later steps of

WUM. There are number of approaches present at

preprocessing level of WUM. Various approaches such as data

cleaning, data filtering and data integration are used at

preprocessing level. Hussain et al., [9] examined the

preprocessing approaches to recognize the problems and how

WUM preprocessing can be enhanced for pattern mining and

analysis.

WUM uses data mining process to examine user access of

Web sites. WUM consists of three major steps

namely preprocessing, knowledge extraction, and results

analysis with any Knowledge Discovery and Data mining

(KDD) process. Tanasa et al., [10] investigated

data preprocessing which is a fussy, complex process.

Analysts focus on determining the correct list of users who

accessed the Web site and to reconstitute user sessions-the

sequence of actions each user carried out on the Web site.

Intersites WUM handle Web server logs from

various Web sites, usually belonging to the same organization.

Therefore, analysts should rebuild the path of the user through

all the various Webservers that is visited. The proposed

resolution is to link all the log files and reconstitute the visit.

Classical data preprocessing consists of three steps namely

data fusion, data cleaning and data structuration. The proposed

solution for WUM adds advanced data preprocessing. This

involves a data summarization step, which will facilitate the

analyst to choose only the vital and valuable information. The

proposed solution has been tested in an experiment with log

files from INRIA Web sites.

The present data mining tools is used to construct

knowledge depending on a huge historical data. Knowledge

should be updated frequently to guarantee its quality and

precision the decision making process can be improved. Data

mining is very significant in mining interesting knowledge

from large databases. But, existing data mining techniques and

tools are very costly and most of the techniques are too

complex in their processes when handling large databases.

Recently, agents are very potential in computing, as it is self-

directed, supple and provides intelligence. Embedding agents

in the current data mining processes and tools are believed to

be able to solve the obstacle. Data preprocessing is one of the

key process in data mining. It is observed that 60% of the data

mining scheme is on preprocessing. Data preprocessing

comprises of integration, selection, cleaning and

transformation of data set that can be utilized for mining.

Othman et al., [11] investigated agent-based preprocessing

framework. The main goal is to offer an auto preprocessing a

set of new data, which set to data mining novice user. The

proposed agent based preprocessing structure comprises of

seven agents namely user interface agents, coordinator agent,

identify agent, CleanMiss agent, CleanNoisy agent,

transformation agent and discretization agent.

User identification and session identification are the two

most important major steps in preprocessing Web log data for

WUM. Khasawneh et al., [12] developed a rapid active user-

based user recognition algorithm with time complexity O(n).

This approach utilizes both an IP address and a limited users'

stationary time to recognize several users in the Web log. Web

site ontology is helpful for recognizing Web site composition

and break points for browsing activities. In order to identify

session, an ontology-dependent technique is developed that

makes use of the Web site composition and functionalities to

recognize different sessions.

Tanasa et al., [13] concentrates on data preprocessing for

WUM. WUM utilizes data measures to evaluate user access of

web sites. Any knowledge, discovery and data mining (KDD)

process in WUM comprises of three most important phases:

preprocessing, knowledge extraction and results examination.

The initial phase attempts to establish the accurate list of users

who browsed the Web site and to reconstitute user sessions-

the series of events each user carried out at the Web site. For

confidentiality reasons, the preprocessing phase makes use of

Web server log files from Web servers together with the

Website map and then anonymizing and integrated log files

are used. This phase performs data fusion, data cleaning, data

structuration and data summarization. This phase not only

minimizes the size of log file moreover it also enhances the

quality of accessible data through the novel data structures.

2.2. Pattern Discovery

Xidong Wang et al., [14] developed a technique that can

find out users' common access patterns basic users' browsing

Web behaviors. Initially, this technique initiates the idea of

access pattern based on a user's access path, and subsequently

proposed a modified algorithm (FAP-Mining) in accordance

with the FP-tree algorithm to obtain frequent access patterns

[25]. The novel technique initially builds a frequent access

pattern tree and then extracts users' frequent access patterns on



1626

ISSN:2229-6093

the tree. This technique is more precise and scalable for

mining frequent access patterns with dissimilar lengths.

Qianhui Althea LIANG et al., [15] proposed and

sophisticated the conception of Web service usage patterns

and pattern discovery through service mining. The author

described three different levels of service handling data: (a)

user demand level, (b) template level and (c) instance level. At

every stage, the author examined the patterns of service

handling data and the detection of these patterns. A technique

for service pattern detection at the template level has been

developed.

Huge amount of data are collected automatically by Web

servers and accumulated in access log files. Examination of

server access data can offer considerable and valuable

information. WUM is the procedure of utilizing data mining

approaches to the discovery of usage patterns from Web data

and is aimed towards applications. It extracts the secondary

data obtained from the communications of the users

throughout certain time of Web sessions. By knowing the

value of significant application, WUM has seen a quick raise

in attention, from both the research and application fields.

Etminani et al., [16] exploited the use of Kohonen's SOM

(Self Organizing Map) to pre-processed Web logs of one of

the leading university Web server logs (http://www.um.ac.ir/)

and mined frequent patterns.

Nina et al., [17] presented a complete scheme regarding the

pattern discovery of WUM. Web site developers are supposed

to possess comprehensible understanding of user's profile and

site objectives, over and above underlined facts of the manner

users will look through the Web pages. The developers can

learn the visitor's behavior by means of the Web investigation

and discover patterns of the visitor's activities. This Web

analysis engages the renovation and understanding of the Web

log data to realize the hidden information or predictive pattern

by the data mining and knowledge discovery method.

Cooley et al., [18] developed information and pattern

discovery techniques on the WWW. Application of data

mining approaches to the WWW, known as Web mining, has

been the center of attention of numerous recent research

projects. The word Web mining has been applied in two

distinctive manners. The primary word, called Web content

mining in is the procedure of information discovery from

sources throughout the WWW. The second word, termed as

WUM, is the procedure of mining for user browsing behavior

and access patterns. Cooley et al., expressed WEBMINER in

brief, is a scheme for WUM.

2.3. Pattern Analysis

Klos et al., [19] established researches with the technique

used for the examination and assessment of Web pages. This

technique is constructed on a silent contract stuck between

Web developers and web users. The major features of this

contract are Web patterns which are utilized by Web

developers in their Web page designs. Using this technique it

is easy to determine whether the pattern is accessible on the

page with a better level of significance.

Web applications are dependent on uninterrupted and quick

development. Over and over again it takes place that

developers by chance duplicate Web pages without allowing

for systematic improvement and maintenance techniques. This

method facilitates code clones that create Web applications

complicated to maintain and use again. De Lucia et al., [20]

proposed a technique for reengineering Web applications

derived from clone investigation that intends to recognize and

simplify static and dynamic pages and navigational patterns of

a Web application. Clone investigation is also supportive for

recognizing literals that can be produced from a database. A

case study is illustrated by this author which demonstrates

how this technique can be used for restructuring the

navigational pattern of a Web application by eradicating

redundant code.

Kudelka et al., [21] proposed an innovative technique for

semantic investigation of Web pages. Examination is carried

out based on the accepted and empirically confirmed contract

between users and Web developers by means of Web patterns

[26]. This technique is developed for the extraction of patterns

which are uniqueness for actual domain. Patterns present

formalization of the contract and facilitate assignment of

semantics to segments of Web pages. Experimental

observations confirm the effectiveness of this technique.

Most of the approaches that have been exploited for pattern

detection from Web Usage Data (WUD) are clustering

techniques. In e-commerce applications, clustering techniques

can be exploited for the function of formulating marketing

approaches, product assistance, personalization and Web site

revision. An innovative Partitional dependent technique for

dynamically combining Web users in accordance with their

Web access patterns using Adaptive Resonance Theory1

Neural Network (ART1 NN) clustering approach is developed

by Raju et al., [22]. Experimental outcome confirms that this

ART1 NN clustering technique achieves better on the basis of

intra-cluster and inter-cluster distances when evaluated against

the K-Means and SOM clustering approaches.

Owing to the inbuilt correlation between Web objects and

the need of a standardized representation of Web documents,

Web community mining and investigation has turned out to be

a significant area for Web data management and analysis. The

investigation of Web communities lengthens the amount of

research fields such as Web mining, clustering, Web search

and text retrieval. Yanchun Zhang et al., [23] provides some

up to date investigations on this area, which cover finding

appropriate Web pages on the basis of linkage information,

determining user access patterns through examining Web log

files, co-clustering Web objects and examining social

networks from Web data.

One of the objects significant for reuse is design pattern.

This technique focuses on the usage of web design patterns

while examining the structural design and contents of web

pages. Kudelka et al., [24] have generated a technique called

Pattrio technique of pattern discovery on web pages. The

identified patterns on web pages illustrate the web page

structural design from the external point of view of the user.

The information of this structural design can be utilized in

different connections. Experiments have been discussed in

numerous conferences using the technique of knowing the



1627

ISSN:2229-6093

composition of web pages as automatically found. The

experimental evaluation compares this technique with other

selected techniques. The evaluation result confirms that web

design patterns can play a key role in the field of analysis of

web page composition and contents.

III. PROBLEMS AND DIRECTIONS

The most important difficulty with Web Mining in common

and WUM in specific is the temperament of the data they deal

with. With the exception of the quantity of the data, the data is

not absolutely structured. It is in a semi-structured

arrangement hence it needs numerous preprocessing steps

before the extraction of the essential information. Several

researches have to be done on preprocessing the data and the

on following problems.

Reducing the Paths of High visit Pages: The pages which

are recurrently visited by the users can be seen as to

follow a particular path. These pages can be integrated in

a simply accessible branch of the Website thus resulting

in reducing the navigation path length.

Eradicating or Integrating Low Visit Pages: The pages

which are not regularly visited by users can be either

eliminated or their content can be integrated with pages

with frequent access.

Redesigning Pages to facilitate User Navigation: To

assist the user to browse through the website in the best

achievable way, the information acquired can be used to

redesign the configuration of the Website.

IV. CONCLUSION

The increasing popularity of the Web has greatly attracted

the Web mining technology. A vital research area in Web

mining is Web usage mining which mainly focuses on the

discovery of patterns in the browsing and navigation data of

Web users. WUM has been a potential technology for

understanding behavior of the user on the Web.

There are several techniques proposed by different

researchers for the web usage mining. This paper discussed

about various techniques available for web usage mining. This

paper mainly discusses about three vial steps in WUM such as

preprocessing, pattern discovery and pattern analysis. It is

obvious that enhanced cluster recovery provides highly

accurate guessing of a Web user’s future visit if the user’s

cluster can be exactly determined.

REFERENCES

[1] Dr. G. K. Gupta, “Introduction to Data Mining with Case

Studies”, PHI Publication, 2005.

[2] Jaideep Srivastava, Robert Cooley, Mukund Deshpande,

Pang-Ning Tan, “Web Usage Mining: Discovery and

Applications of Usage Patterns from Web Data”, SIGKDD

Explorations, Vol. 1, No. 2, Pp. 12-23, 2000.

[3] Adel T. Rahmani and B. Hoda Helmi, “EIN-WUM an AIS-

based Algorithm for Web Usage Mining”, Proceedings of

GECCO’08, Atlanta, Georgia, USA, ACM978-1-60558-130-

9/08/07, Pp. 291-292, 2008.

[4] Shailey Minocha, Nicola Millard, Lisa Dawson, “Integrating

Customer Relationship Management Strategies in (B2C) E-

Commerce Environments”, IFIP Conference on Human-

Computer Interaction- INTERACT, 2003.

[5] C. Ramya, G. Kavitha, K. S. Shreedhara, “Preprocessing: A

Prerequisite for Discovering Patterns in Web Usage Mining

Process”, Computing Research Repository - CORR, vol.

abs/1105.0, 2011.

[6] V. Chitraa, Antony Selvdoss Davamani, “A Survey on

Preprocessing Methods for Web Usage Data”, Computing

Research Repository-CORR, Vol. abs/1004.1, 2010.

[7] Nizar R. Mabroukeh, Christie I. Ezeife, “A taxonomy of

sequential pattern mining algorithms”, ACM Computing

Surveys - CSUR, Vol. 43, No. 1, Pp. 1-41, 2010.

[8] Francesco Moscato, Nicola Mazzocca, Valeria Vittorini,

Giusy Di Lorenzo, Paola Mosca, Massimo Magaldi,

“Workflow Pattern Analysis in Web Services”, High

Performance Computing and Communications - HPCC, Pp.

395-400, 2005.

[9] Hussain, T.; Asghar, S.; Masood, N.; “Web usage mining: A

survey on preprocessing of web log file”, International

Conference on Information and Emerging Technologies

(ICIET), Pp. 1 – 6, 2010.

[10] Tanasa, D.; Trousse, B.; “Advanced data preprocessing for

intersites Web usage mining”, IEEE Intelligent Systems, Vol.

19, No. 2, Pp. 59 – 65, 2004.

[11] Othman, Z.A.; Abu Bakar, A.; Hamdan, A.R.; Omar, K.;

Shuib, N.L.M.; “Agent based preprocessing”, International

Conference on Intelligent and Advanced Systems (ICIAS),

Pp. 219 – 223, 2007.

[12] Khasawneh, N.; Chien-Chung Chan; “Active User-Based and

Ontology-Based Web Log Data Preprocessing for Web Usage

Mining”, IEEE/WIC/ACM International Conference on Web

Intelligence, Pp. 325 – 328, 2006.

[13] Tanasa, D.; Trousse, B.; “Data preprocessing for WUM”,

IEEE Potentials, Vol. 23, No. 3, Pp. 22 – 25, 2004.

[14] Xidong Wang; Yiming Ouyang; Xuegang Hu; Yan Zhang;

“Discovery of user frequent access patterns on Web usage

mining”, The Proceedings 8th International Conference on

Computer Supported Cooperative Work in Design, Vol. 1, Pp.

765 – 769, 2004.

[15] Qianhui Althea LIANG; Jen-Yao CHUNG; Steven MILLER;

Yang OUYANG; “Service Pattern Discovery of Web Service

Mining in Web Service Registry-Repository”, IEEE

International Conference on e-Business Engineering (ICEBE

'06), Pp. 286 – 293, 2006.

[16] Etminani, K.; Delui, A.R.; Yanehsari, N.R.; Rouhani, M.;

“Web usage mining: Discovery of the users' navigational

patterns using SOM”, First International Conference on

Networked Digital Technologies (NDT '09), Pp. 224 – 249,

2009.

[17] Nina, S.P.; Rahman, M.; Bhuiyan, K.I.; Ahmed, K.; “Pattern

Discovery of Web Usage Mining”, International Conference

on Computer Technology and Development (ICCTD '09),

Vol. 1, Pp. 499 – 503, 2009.

[18] Cooley, R.; Mobasher, B.; Srivastava, J.; “Web mining:

information and pattern discovery on the World Wide Web”,

Proceedings Ninth IEEE International Conference on Tools

with Artificial Intelligence, Pp. 558 – 567, 1997.

[19] Klos, K.; Kocibova, J.; Lehecka, O.; Kudelka, M.; Snasel,

V.; “Web Page Analysis: Experiments Based on Web

Patterns”, 4th International Conference on Innovations in

Information Technology (IIT '07), Pp. 16 – 20, 2007.



1628

ISSN:2229-6093

[20] De Lucia, A.; Francese, R.; Scanniello, G.; Tortora, G.;

“Reengineering Web applications based on cloned pattern

analysis”, Proceedings 12th IEEE International Workshop on

Program Comprehension, Pp. 132 – 141, 2004.

[21] Kudelka, M.; Snasel, V.; Lehecka, O.; El-Qawasmeh, E.;

“Semantic Analysis of Web Pages Using Web Patterns”,

IEEE/WIC/ACM International Conference on Web

Intelligence, Pp. 329 – 333, 2006.

[22] Raju, G.T.; Sudhamani, M.V.; “A novel approach for

extraction of cluster patterns from Web Usage Data and its

performance analysis”, International Conference on Emerging

Trends in Electrical and Computer Technology (ICETECT),

Pp. 718 – 723, 2011.

[23] Yanchun Zhang; Guandong Xu; “Using Web Clustering for

Web Communities Mining and Analysis”, IEEE/WIC/ACM

International Conference on Web Intelligence and Intelligent

Agent Technology (WI-IAT '08), Vol. 1, Pp. 20 – 31, 2008.

[24] Kudelka, Milos; Snasel, Vaclav; Lehecka, Ondrej; El-

Qawasmeh, Eyas; “Web content mining using web design

patterns”, IEEE International Conference on Information

Reuse and Integration (IRI), Pp. 232 – 237, 2008.

[25] Kudelka, Milos; Lehecka, Ondrej; Snasel, Vaclav; El-

Qawasmeh, Eyas; “Web pages clustering based on web

patterns”, 2nd International Conference on Digital

Information Management (ICDIM '07), Vol. 2, Pp. 657 – 664,

2007.

[26] Rui Wu; “Clustering Web Access Patterns Based on Hybrid

Approach”, Fifth International Conference on Fuzzy Systems

and Knowledge Discovery (FSKD '08), Vol. 1, Pp. 52 – 56,

2008.



1629

ISSN:2229-6093

Documents

A Survey on Web Usage Mining: Theory and Applications 2 Dr. P.Sumathi 1 Doctoral student in Manonmaniam Sundaranar University, Tirunelveli ,Tamil Nadu, India 2 Asst. Professor, Chikkanna