Upload
truongtram
View
216
Download
3
Embed Size (px)
Citation preview
1P.Nithya,
2 Dr. P.Sumathi
1Doctoral student in Manonmaniam Sundaranar University, Tirunelveli ,Tamil Nadu, India
2Asst. Professor, Chikkanna Govt. Arts College, Tirupur, Tamil Nadu, India
E-mail:[email protected]
Abstract--- The World Wide Web maintaining its
development at an incredible pace. The information available
in the WWW is a gateway and an intermediate for carrying out
business. Web mining is the extraction of exciting and
constructive facts and inherent information from artifacts or
actions related to the WWW. Web usage mining (WUM) puts
an effort to determine valuable information from the
secondary data obtained from the communications of the users
with the Web. WUM has turned out to be an extremely
significant for successful Web site organization, generating
adaptive Web sites, business and maintenance services,
personalization, network traffic flow examination and so on.
WUM comprises of three steps, namely preprocessing, pattern
discovery, and pattern analysis. WUM has become an active
area of research in field of data mining due to its vital
importance. This paper provides a comprehensive discussion
of the all the phases in WUM and related works in this field.
Keywords--- Web Usage Mining (WUM), Customer
Relationship Management (CRM), Preprocessing, Pattern
Discovery, Pattern Analysis
I. INTRODUCTION
Data in Web Usage Mining, can be obtained in server logs,
browser logs, proxy logs, or collected from an organization's
database. These data collections vary in terms of the location
of the data source, the kinds of data available, the segment of
population from which the data was obtained, and techniques
of implementation [1].
WUM is a division of Web Mining, which, sequentially, is
a component of Data Mining. The process of mining
significant and valuable information from vast database is
called Data Mining [2]. WUM mines the usage features of the
users of Web Applications. This obtained data can then be
applied in a various ways such as, checking of fake elements
etc.
WUM is considered as a component of the Business
Intelligence in an organization [3]. It is applied for deciding
business approaches via the competent use of Web
Applications. It is very vital for the Customer Relationship
Management (CRM) since it can guarantee customer
fulfillment till the interface between the customer and the
organization is concerned [4].
There are many kinds of data that can be used in Web Mining.
1. Content: The visible data in the Web pages or the data
which was intended to be provided to the users. This
greatly includes text and graphics (images).
2. Structure: The organization of the website is
illustrated by this data. It is partitioned into two
categories. Intra-page structure data consist of the
arrangement of several Hyper Text Markup Language
(HTML) or Extended Markup Language (XML) tags
within a given page. The key type of inter-page
structure information is the hyper-links used for site
navigation.
3. Usage: Data that illustrates the usage patterns of Web
pages, such as IP addresses, page references and the
date and time of accesses and other information based
on the log format.
The main processes in WUM are:
Preprocessing: Data preprocessing illustrates any sort of
processing executed on raw data to organize it for another
processing process [5]. Data preprocessing alters the data into
a format that will be more efficiently processed for the
convenient of the user. Preprocessing steps used in WUM are
[6]:
1. Usage Pre-Processing: Pre-Processing involving
Usage patterns of users.
2. Content Pre-Processing: Pre-Processing of content
accessed.
3. Structure Pre-Processing: Pre-Processing involving
structure of the website.
Pattern Discovery: WUM can be utilized to expose patterns
in server logs but is frequently executed only on samples of
data. The mining procedure will be unproductive if the models
are not a significant illustration of the larger body of data [7].
The following are the pattern discovery methods.
1. Statistical Analysis
2. Association Rules
3. Clustering
4. Classification
5. Sequential Patterns
6. Dependency Modeling
Pattern Analysis: This is the ultimate step in the WUM
process. After the completion of the preprocessing and pattern
discovery, the collected usage patterns are examined to filter
insignificant information and obtain the valuable information
A Survey on Web Usage Mining: Theory and Applications
P Nithya et al ,Int.J.Computer Technology & Applications,Vol 3 (4), 1625-1629
IJCTA | July-August 2012 Available [email protected]
1625
ISSN:2229-6093
[8]. The techniques like Structured Query Language (SQL)
processing and Online Analytical Processing (OLAP) can be
used.
There are several approaches present in the literature for
WUM. This paper provides detailed study about important
techniques available for WUM.
II. LITERATURE SURVEY
This section provides detailed discussion about several
WUM techniques. The following section discusses the various
works of several authors in the following phases.
Preprocessing
Pattern Discovery
Pattern Analysis
2.1. Preprocessing
There has been an ever rising growth in web applications
and its users are increasing at rapid speed. The rapid changes
in technology assist in capturing the users' spirit and
interactions with web applications through web server log file.
Web log file is saved as text (.txt) file. Because of the huge
quantity of “unrelated information” in the web log, the
original log file can not be directly used in the WUM
procedure. Thus, the preprocessing of web log file becomes
vital. The appropriate examination of web log file is very
essential to effectively handle the web sites for administrative
and users' potential. Web log preprocessing is a first vital step
to enhance the quality and competence of the later steps of
WUM. There are number of approaches present at
preprocessing level of WUM. Various approaches such as data
cleaning, data filtering and data integration are used at
preprocessing level. Hussain et al., [9] examined the
preprocessing approaches to recognize the problems and how
WUM preprocessing can be enhanced for pattern mining and
analysis.
WUM uses data mining process to examine user access of
Web sites. WUM consists of three major steps
namely preprocessing, knowledge extraction, and results
analysis with any Knowledge Discovery and Data mining
(KDD) process. Tanasa et al., [10] investigated
data preprocessing which is a fussy, complex process.
Analysts focus on determining the correct list of users who
accessed the Web site and to reconstitute user sessions-the
sequence of actions each user carried out on the Web site.
Intersites WUM handle Web server logs from
various Web sites, usually belonging to the same organization.
Therefore, analysts should rebuild the path of the user through
all the various Webservers that is visited. The proposed
resolution is to link all the log files and reconstitute the visit.
Classical data preprocessing consists of three steps namely
data fusion, data cleaning and data structuration. The proposed
solution for WUM adds advanced data preprocessing. This
involves a data summarization step, which will facilitate the
analyst to choose only the vital and valuable information. The
proposed solution has been tested in an experiment with log
files from INRIA Web sites.
The present data mining tools is used to construct
knowledge depending on a huge historical data. Knowledge
should be updated frequently to guarantee its quality and
precision the decision making process can be improved. Data
mining is very significant in mining interesting knowledge
from large databases. But, existing data mining techniques and
tools are very costly and most of the techniques are too
complex in their processes when handling large databases.
Recently, agents are very potential in computing, as it is self-
directed, supple and provides intelligence. Embedding agents
in the current data mining processes and tools are believed to
be able to solve the obstacle. Data preprocessing is one of the
key process in data mining. It is observed that 60% of the data
mining scheme is on preprocessing. Data preprocessing
comprises of integration, selection, cleaning and
transformation of data set that can be utilized for mining.
Othman et al., [11] investigated agent-based preprocessing
framework. The main goal is to offer an auto preprocessing a
set of new data, which set to data mining novice user. The
proposed agent based preprocessing structure comprises of
seven agents namely user interface agents, coordinator agent,
identify agent, CleanMiss agent, CleanNoisy agent,
transformation agent and discretization agent.
User identification and session identification are the two
most important major steps in preprocessing Web log data for
WUM. Khasawneh et al., [12] developed a rapid active user-
based user recognition algorithm with time complexity O(n).
This approach utilizes both an IP address and a limited users'
stationary time to recognize several users in the Web log. Web
site ontology is helpful for recognizing Web site composition
and break points for browsing activities. In order to identify
session, an ontology-dependent technique is developed that
makes use of the Web site composition and functionalities to
recognize different sessions.
Tanasa et al., [13] concentrates on data preprocessing for
WUM. WUM utilizes data measures to evaluate user access of
web sites. Any knowledge, discovery and data mining (KDD)
process in WUM comprises of three most important phases:
preprocessing, knowledge extraction and results examination.
The initial phase attempts to establish the accurate list of users
who browsed the Web site and to reconstitute user sessions-
the series of events each user carried out at the Web site. For
confidentiality reasons, the preprocessing phase makes use of
Web server log files from Web servers together with the
Website map and then anonymizing and integrated log files
are used. This phase performs data fusion, data cleaning, data
structuration and data summarization. This phase not only
minimizes the size of log file moreover it also enhances the
quality of accessible data through the novel data structures.
2.2. Pattern Discovery
Xidong Wang et al., [14] developed a technique that can
find out users' common access patterns basic users' browsing
Web behaviors. Initially, this technique initiates the idea of
access pattern based on a user's access path, and subsequently
proposed a modified algorithm (FAP-Mining) in accordance
with the FP-tree algorithm to obtain frequent access patterns
[25]. The novel technique initially builds a frequent access
pattern tree and then extracts users' frequent access patterns on
P Nithya et al ,Int.J.Computer Technology & Applications,Vol 3 (4), 1625-1629
IJCTA | July-August 2012 Available [email protected]
1626
ISSN:2229-6093
the tree. This technique is more precise and scalable for
mining frequent access patterns with dissimilar lengths.
Qianhui Althea LIANG et al., [15] proposed and
sophisticated the conception of Web service usage patterns
and pattern discovery through service mining. The author
described three different levels of service handling data: (a)
user demand level, (b) template level and (c) instance level. At
every stage, the author examined the patterns of service
handling data and the detection of these patterns. A technique
for service pattern detection at the template level has been
developed.
Huge amount of data are collected automatically by Web
servers and accumulated in access log files. Examination of
server access data can offer considerable and valuable
information. WUM is the procedure of utilizing data mining
approaches to the discovery of usage patterns from Web data
and is aimed towards applications. It extracts the secondary
data obtained from the communications of the users
throughout certain time of Web sessions. By knowing the
value of significant application, WUM has seen a quick raise
in attention, from both the research and application fields.
Etminani et al., [16] exploited the use of Kohonen's SOM
(Self Organizing Map) to pre-processed Web logs of one of
the leading university Web server logs (http://www.um.ac.ir/)
and mined frequent patterns.
Nina et al., [17] presented a complete scheme regarding the
pattern discovery of WUM. Web site developers are supposed
to possess comprehensible understanding of user's profile and
site objectives, over and above underlined facts of the manner
users will look through the Web pages. The developers can
learn the visitor's behavior by means of the Web investigation
and discover patterns of the visitor's activities. This Web
analysis engages the renovation and understanding of the Web
log data to realize the hidden information or predictive pattern
by the data mining and knowledge discovery method.
Cooley et al., [18] developed information and pattern
discovery techniques on the WWW. Application of data
mining approaches to the WWW, known as Web mining, has
been the center of attention of numerous recent research
projects. The word Web mining has been applied in two
distinctive manners. The primary word, called Web content
mining in is the procedure of information discovery from
sources throughout the WWW. The second word, termed as
WUM, is the procedure of mining for user browsing behavior
and access patterns. Cooley et al., expressed WEBMINER in
brief, is a scheme for WUM.
2.3. Pattern Analysis
Klos et al., [19] established researches with the technique
used for the examination and assessment of Web pages. This
technique is constructed on a silent contract stuck between
Web developers and web users. The major features of this
contract are Web patterns which are utilized by Web
developers in their Web page designs. Using this technique it
is easy to determine whether the pattern is accessible on the
page with a better level of significance.
Web applications are dependent on uninterrupted and quick
development. Over and over again it takes place that
developers by chance duplicate Web pages without allowing
for systematic improvement and maintenance techniques. This
method facilitates code clones that create Web applications
complicated to maintain and use again. De Lucia et al., [20]
proposed a technique for reengineering Web applications
derived from clone investigation that intends to recognize and
simplify static and dynamic pages and navigational patterns of
a Web application. Clone investigation is also supportive for
recognizing literals that can be produced from a database. A
case study is illustrated by this author which demonstrates
how this technique can be used for restructuring the
navigational pattern of a Web application by eradicating
redundant code.
Kudelka et al., [21] proposed an innovative technique for
semantic investigation of Web pages. Examination is carried
out based on the accepted and empirically confirmed contract
between users and Web developers by means of Web patterns
[26]. This technique is developed for the extraction of patterns
which are uniqueness for actual domain. Patterns present
formalization of the contract and facilitate assignment of
semantics to segments of Web pages. Experimental
observations confirm the effectiveness of this technique.
Most of the approaches that have been exploited for pattern
detection from Web Usage Data (WUD) are clustering
techniques. In e-commerce applications, clustering techniques
can be exploited for the function of formulating marketing
approaches, product assistance, personalization and Web site
revision. An innovative Partitional dependent technique for
dynamically combining Web users in accordance with their
Web access patterns using Adaptive Resonance Theory1
Neural Network (ART1 NN) clustering approach is developed
by Raju et al., [22]. Experimental outcome confirms that this
ART1 NN clustering technique achieves better on the basis of
intra-cluster and inter-cluster distances when evaluated against
the K-Means and SOM clustering approaches.
Owing to the inbuilt correlation between Web objects and
the need of a standardized representation of Web documents,
Web community mining and investigation has turned out to be
a significant area for Web data management and analysis. The
investigation of Web communities lengthens the amount of
research fields such as Web mining, clustering, Web search
and text retrieval. Yanchun Zhang et al., [23] provides some
up to date investigations on this area, which cover finding
appropriate Web pages on the basis of linkage information,
determining user access patterns through examining Web log
files, co-clustering Web objects and examining social
networks from Web data.
One of the objects significant for reuse is design pattern.
This technique focuses on the usage of web design patterns
while examining the structural design and contents of web
pages. Kudelka et al., [24] have generated a technique called
Pattrio technique of pattern discovery on web pages. The
identified patterns on web pages illustrate the web page
structural design from the external point of view of the user.
The information of this structural design can be utilized in
different connections. Experiments have been discussed in
numerous conferences using the technique of knowing the
P Nithya et al ,Int.J.Computer Technology & Applications,Vol 3 (4), 1625-1629
IJCTA | July-August 2012 Available [email protected]
1627
ISSN:2229-6093
composition of web pages as automatically found. The
experimental evaluation compares this technique with other
selected techniques. The evaluation result confirms that web
design patterns can play a key role in the field of analysis of
web page composition and contents.
III. PROBLEMS AND DIRECTIONS
The most important difficulty with Web Mining in common
and WUM in specific is the temperament of the data they deal
with. With the exception of the quantity of the data, the data is
not absolutely structured. It is in a semi-structured
arrangement hence it needs numerous preprocessing steps
before the extraction of the essential information. Several
researches have to be done on preprocessing the data and the
on following problems.
Reducing the Paths of High visit Pages: The pages which
are recurrently visited by the users can be seen as to
follow a particular path. These pages can be integrated in
a simply accessible branch of the Website thus resulting
in reducing the navigation path length.
Eradicating or Integrating Low Visit Pages: The pages
which are not regularly visited by users can be either
eliminated or their content can be integrated with pages
with frequent access.
Redesigning Pages to facilitate User Navigation: To
assist the user to browse through the website in the best
achievable way, the information acquired can be used to
redesign the configuration of the Website.
IV. CONCLUSION
The increasing popularity of the Web has greatly attracted
the Web mining technology. A vital research area in Web
mining is Web usage mining which mainly focuses on the
discovery of patterns in the browsing and navigation data of
Web users. WUM has been a potential technology for
understanding behavior of the user on the Web.
There are several techniques proposed by different
researchers for the web usage mining. This paper discussed
about various techniques available for web usage mining. This
paper mainly discusses about three vial steps in WUM such as
preprocessing, pattern discovery and pattern analysis. It is
obvious that enhanced cluster recovery provides highly
accurate guessing of a Web user’s future visit if the user’s
cluster can be exactly determined.
REFERENCES
[1] Dr. G. K. Gupta, “Introduction to Data Mining with Case
Studies”, PHI Publication, 2005.
[2] Jaideep Srivastava, Robert Cooley, Mukund Deshpande,
Pang-Ning Tan, “Web Usage Mining: Discovery and
Applications of Usage Patterns from Web Data”, SIGKDD
Explorations, Vol. 1, No. 2, Pp. 12-23, 2000.
[3] Adel T. Rahmani and B. Hoda Helmi, “EIN-WUM an AIS-
based Algorithm for Web Usage Mining”, Proceedings of
GECCO’08, Atlanta, Georgia, USA, ACM978-1-60558-130-
9/08/07, Pp. 291-292, 2008.
[4] Shailey Minocha, Nicola Millard, Lisa Dawson, “Integrating
Customer Relationship Management Strategies in (B2C) E-
Commerce Environments”, IFIP Conference on Human-
Computer Interaction- INTERACT, 2003.
[5] C. Ramya, G. Kavitha, K. S. Shreedhara, “Preprocessing: A
Prerequisite for Discovering Patterns in Web Usage Mining
Process”, Computing Research Repository - CORR, vol.
abs/1105.0, 2011.
[6] V. Chitraa, Antony Selvdoss Davamani, “A Survey on
Preprocessing Methods for Web Usage Data”, Computing
Research Repository-CORR, Vol. abs/1004.1, 2010.
[7] Nizar R. Mabroukeh, Christie I. Ezeife, “A taxonomy of
sequential pattern mining algorithms”, ACM Computing
Surveys - CSUR, Vol. 43, No. 1, Pp. 1-41, 2010.
[8] Francesco Moscato, Nicola Mazzocca, Valeria Vittorini,
Giusy Di Lorenzo, Paola Mosca, Massimo Magaldi,
“Workflow Pattern Analysis in Web Services”, High
Performance Computing and Communications - HPCC, Pp.
395-400, 2005.
[9] Hussain, T.; Asghar, S.; Masood, N.; “Web usage mining: A
survey on preprocessing of web log file”, International
Conference on Information and Emerging Technologies
(ICIET), Pp. 1 – 6, 2010.
[10] Tanasa, D.; Trousse, B.; “Advanced data preprocessing for
intersites Web usage mining”, IEEE Intelligent Systems, Vol.
19, No. 2, Pp. 59 – 65, 2004.
[11] Othman, Z.A.; Abu Bakar, A.; Hamdan, A.R.; Omar, K.;
Shuib, N.L.M.; “Agent based preprocessing”, International
Conference on Intelligent and Advanced Systems (ICIAS),
Pp. 219 – 223, 2007.
[12] Khasawneh, N.; Chien-Chung Chan; “Active User-Based and
Ontology-Based Web Log Data Preprocessing for Web Usage
Mining”, IEEE/WIC/ACM International Conference on Web
Intelligence, Pp. 325 – 328, 2006.
[13] Tanasa, D.; Trousse, B.; “Data preprocessing for WUM”,
IEEE Potentials, Vol. 23, No. 3, Pp. 22 – 25, 2004.
[14] Xidong Wang; Yiming Ouyang; Xuegang Hu; Yan Zhang;
“Discovery of user frequent access patterns on Web usage
mining”, The Proceedings 8th International Conference on
Computer Supported Cooperative Work in Design, Vol. 1, Pp.
765 – 769, 2004.
[15] Qianhui Althea LIANG; Jen-Yao CHUNG; Steven MILLER;
Yang OUYANG; “Service Pattern Discovery of Web Service
Mining in Web Service Registry-Repository”, IEEE
International Conference on e-Business Engineering (ICEBE
'06), Pp. 286 – 293, 2006.
[16] Etminani, K.; Delui, A.R.; Yanehsari, N.R.; Rouhani, M.;
“Web usage mining: Discovery of the users' navigational
patterns using SOM”, First International Conference on
Networked Digital Technologies (NDT '09), Pp. 224 – 249,
2009.
[17] Nina, S.P.; Rahman, M.; Bhuiyan, K.I.; Ahmed, K.; “Pattern
Discovery of Web Usage Mining”, International Conference
on Computer Technology and Development (ICCTD '09),
Vol. 1, Pp. 499 – 503, 2009.
[18] Cooley, R.; Mobasher, B.; Srivastava, J.; “Web mining:
information and pattern discovery on the World Wide Web”,
Proceedings Ninth IEEE International Conference on Tools
with Artificial Intelligence, Pp. 558 – 567, 1997.
[19] Klos, K.; Kocibova, J.; Lehecka, O.; Kudelka, M.; Snasel,
V.; “Web Page Analysis: Experiments Based on Web
Patterns”, 4th International Conference on Innovations in
Information Technology (IIT '07), Pp. 16 – 20, 2007.
P Nithya et al ,Int.J.Computer Technology & Applications,Vol 3 (4), 1625-1629
IJCTA | July-August 2012 Available [email protected]
1628
ISSN:2229-6093
[20] De Lucia, A.; Francese, R.; Scanniello, G.; Tortora, G.;
“Reengineering Web applications based on cloned pattern
analysis”, Proceedings 12th IEEE International Workshop on
Program Comprehension, Pp. 132 – 141, 2004.
[21] Kudelka, M.; Snasel, V.; Lehecka, O.; El-Qawasmeh, E.;
“Semantic Analysis of Web Pages Using Web Patterns”,
IEEE/WIC/ACM International Conference on Web
Intelligence, Pp. 329 – 333, 2006.
[22] Raju, G.T.; Sudhamani, M.V.; “A novel approach for
extraction of cluster patterns from Web Usage Data and its
performance analysis”, International Conference on Emerging
Trends in Electrical and Computer Technology (ICETECT),
Pp. 718 – 723, 2011.
[23] Yanchun Zhang; Guandong Xu; “Using Web Clustering for
Web Communities Mining and Analysis”, IEEE/WIC/ACM
International Conference on Web Intelligence and Intelligent
Agent Technology (WI-IAT '08), Vol. 1, Pp. 20 – 31, 2008.
[24] Kudelka, Milos; Snasel, Vaclav; Lehecka, Ondrej; El-
Qawasmeh, Eyas; “Web content mining using web design
patterns”, IEEE International Conference on Information
Reuse and Integration (IRI), Pp. 232 – 237, 2008.
[25] Kudelka, Milos; Lehecka, Ondrej; Snasel, Vaclav; El-
Qawasmeh, Eyas; “Web pages clustering based on web
patterns”, 2nd International Conference on Digital
Information Management (ICDIM '07), Vol. 2, Pp. 657 – 664,
2007.
[26] Rui Wu; “Clustering Web Access Patterns Based on Hybrid
Approach”, Fifth International Conference on Fuzzy Systems
and Knowledge Discovery (FSKD '08), Vol. 1, Pp. 52 – 56,
2008.
P Nithya et al ,Int.J.Computer Technology & Applications,Vol 3 (4), 1625-1629
IJCTA | July-August 2012 Available [email protected]
1629
ISSN:2229-6093