MINING THE E-BUSINESS TO ENHANCE THE MARKET
STRATEGIES OF A COMPANY
ENROLMENT No. - 8103532 8103592
NAME OF THE STUDENT - SHRADDHA SINGH DHRUV GOEL
NAME OF THE SUPERVISIOR - Mrs.ARTI GUPTA
May- 2012
Submitted in partial fulfilment of the Degree of
Bachelor of Technology
In
Computer Science Engineering
DEPARTMENT OF COMPUTER SCIENCE ENGINEERING &
INFORMATION TECHNOLOGY
JAYPEE INSTITUTE OF INFORMATION TECHNOLOGY, NOIDA
TABLE OF CONTENTS
Chapter No. Topics Page No.
Cover Page
Student Declaration II
Certificate From The Supervisor III
Acknowledgement IV
Abstract V
List Of Figures VI
List Of Tables VII
List Of Symbols And Acronyms VIII
Chapter -1 Introduction 1-8
1.1 General Introduction
1.2 Problem Statement
1.3 Empirical Study
1.4 Current and Open Problems
1.5 Approach To Problem In Terms Of Technology /Platform
To Be Used
1.6 Support For Novelty/ Significance Of Problem
1.7 Solution Approach
Chapter -2 Literature Survey 9-18
2.1 Summary Of Papers
2.2 Diagrammatic Integrated Summary Of The Literature
Studied
Chapter -3 Analysis, Design And Modelling 19-44
3.1 Overall Description Of The Project
3.2 Specific Requirements
3.2.1 External Interfaces
3.2.2 Functions
3.2.3 Performance Requirements
3.2.4 Logical Database Requirements
3.2.5 Design Constraints
3.2.6 Software Attributes (H/W, S/W)
3.3 Design Diagrams
3.3.1Use Case Diagrams
3.3.2 Class Diagrams / Control Flow Diagrams
3.3.3 Sequence Diagram/Activity Diagrams
3.4 Risk Analysis
3.5 Risk Mitigation Plan
Chapter-4 Implementation And Testing 45-62
4.1 Implementation Details And Issues
4.1.1 Implementation
4.1.2 Debugging
4.1.3 Error And Exception Handling
4.2 Risk Management
4.3 Testing
4.3.1 Testing Plan
4.3.2 Features To Be Tested
4.3.3 Features Not Be Tested
4.3.4 Approach Taken For Testing 4.3.3.
4.3.5 Item Pass/Fail Criteria
4.3.6 Test Cases: For All Features To Be Tested
Chapter -5 Conclusion 63-64
5.1 Conclusion
5.2 Future Work
References 65.
Appendices 66-67
Appendix A Work Plan
Appendix B Description of Tool
DECLARATION
We hereby declare that this submission is our own work and that, to the best of our
knowledge and belief, it contains no material previously published or written by another
person nor material which has been accepted for the award of any other degree or diploma of
the university or other institute of higher learning, except where due acknowledgment has
been made in the text.
Place: Signature
Name: Shraddha Singh Dhruv Goel
Date: Enrolment No: 8103532 8103592
CERTIFICATE
This is to certify that the work titled “Mining The E-Business To Enhance The Market
Strategies Of A Company” submitted by Dhruv Goel & Shraddha Singh in partial
fulfilment for the award of Degree of Bachelor of Technology of Jaypee Institute of
Information Technology, Noida has been carried out under my supervision. This work has
not been submitted partially or wholly to any other university or institute for the award of this
or any other degree or diploma.
Signature of Supervisor
Name of Supervisor Mrs.Arti Gupta
Designation Lecturer
Date
ACKNOWLEDGEMENT
A project is an attempt by a student to put best of his skills and come out to conclude with
something productive or useful in understanding of the field. This project too has brought to
us with many ideas and knowledge of the topics covered.
We express our deepest gratitude to our supervisor Mrs.Arti Gupta for her invaluable
guidance and blessings. We are very grateful to her for providing us with an environment to
work on this project successfully. We would like to thank her for her unwavering support
during the entire course of this project work.
Signature of the student :
Name of the student :
Dhruv Goel Shraddha Singh
Enrolment No. : 8103592 8103532
Date :
ABSTRACT
The rapid growth of Internet is reshaping the industries and is giving a massive change to
business market. Traditional business is undergoing a major transformation into to the E-
business. Unfortunately the enormous size and hugely unstructured data on the web, even for
a single commodity, has become a cause of ambiguity for consumers. Extracting valuable
information from such an ever-increasing data is an extremely tedious task and is fast
becoming critical towards the success of businesses. Data mining is an emerging technology
aimed at discovering patterns in the underlying historical data and identifies trends within
data that go beyond simple analysis. Through the use of sophisticated algorithms, it provides
users an opportunity to identify key attributes of business processes and target opportunities.
A new dimension has been added to data mining by extending this technique to the realm of
e-business as it provides all the right ingredients for successful data mining. Data mining
techniques assist e-businesses to seek and retain the most profitable customers by analysing
customer-buying and traversing patterns collected online or offline. Essentially, e-business
companies can improve products quality or sales by anticipating problems before they occur
with the use of data mining techniques. Data mining, in general, is the task of extracting
implicit, previously unknown, valid and potentially useful information from data. Web
mining is the use of data mining techniques to automatically discover and extract information
from Web documents and services for obtaining useful information. Application of web
content mining can be very encouraging in the areas of Customer Relations Modelling,
billing records, product cataloguing and quality management.
Thus, in our project we have worked in the field of WEB TECHNOLOGY AND WEB
MINING and developed an efficient E-business process management system and also studied
and implemented techniques from the field of web content mining and studied their impact in
the area specific to business user needs focusing both on the customer as well as the
producer. Thus, our system aims at applying various data mining techniques on the business
data extracted from the web and analyse it which will in turn help in improving the
company‘s marketing strategies.
Signature
Signature
Name of the
student
Shraddha Singh Dhruv Goel Name of the
supervisor
Mrs.Arti Gupta
Date
Date
LIST OF SYMBOLS AND ACRONYMS
HTML –Hyper Text Mark-up Language
XML stands for EXtensible Mark-up Language.XML is a markup language much like
HTML .XML was designed to carry data, not to display data
ARFF-An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a
list of instances sharing a set of attributes. ARFF files have two distinct sections. The first
section is the Header information, which is followed the Data information.
Eg: @RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
The Data of the ARFF file looks like the following:
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
INTRODUCTION
GENERAL INTRODUCTION
This project is aimed on developing an efficient business process management system and the
techniques of web content mining to satisfy the customer‘s product hunt and to get useful
information so as to analyse the business data and further use it to improve the market
strategies of a company.
Web content mining aims to extract/mine useful information or knowledge from web page
contents. Web content mining is related but different from data mining and text mining. It is
related to data mining because many data mining techniques can be applied in Web content
mining. It is related to text mining because much of the web contents are texts. However, it is
also quite different from data mining because Web data are mainly semi-structured and/or
unstructured, while data mining deals primarily with structured data. Web content mining is
also different from text mining because of the semi-structure nature of the Web, while text
mining focuses on unstructured texts. Web content mining thus requires creative applications
of data mining and/or text mining techniques and also its own unique approaches. Internet is
probably the biggest world‘s database. Moreover, data is available using easily accessible
techniques. Often it is important and detailed data that let people achieve goals or use it in
various realms. Data is held in various forms: text, multimedia, database. Web pages keep
standard of html which makes it kind of structural form, but not sufficient to easily use it in
data mining. Typical website contains, in addition to main content and links, various stuff
like ads or navigation items. It is also widely known that most of the data in the Internet is
redundant – a lot of information appear in different sites, in more or less alike form. In the
Web mining domain, web content mining essentially is an analogue of data mining
techniques for relational databases, since it is possible to find similar types of knowledge
from the unstructured data residing in Web documents. The Web document usually contains
several types of data, such as text, image, audio, video, metadata and hyperlinks. Some of
them are semi-structured such as HTML documents or a more structured data like the data in
the tables or database generated HTML pages, but most of the data is unstructured text data.
The unstructured characteristic of Web data forces the Web content mining towards a more
complicated approach.
PROBLEM STATEMENT
E-commerce has changed the face of most business functions in competitive enterprises.
Web-Enabled Electronic Business is generating massive amount of data on customer
purchases, browsing patterns, usage times and preferences at an increasing rate.
Unfortunately the enormous size and hugely unstructured data on the web, even for a single
commodity, has become a cause of ambiguity for consumers. Gathering information from the
web and then Extracting valuable information to be able to make proper decisions from such
an ever increasing data is an extremely tedious task and is fast becoming critical towards the
success of businesses.
EMPIRICAL STUDY
Various tools and software are available for purpose of data mining and web content mining:
WEKA TOOL
Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine
learning software written in Java, developed at the University of Waikato, New
Zealand. Weka is free software available under the GNU General Public License. Weka
is a comprehensive set of advanced data mining and analysis tools. The strength of
Weka lies in the area of classification where it covers many of the most current
machine learning (ML) approaches. At its simplest, it provides a quick and easy way to
explore and analyze data. Weka is also suitable for dealing with large data where the
resources of many computers and or multi-processor computers can be used in
parallel. Weka also allows for data to be pulled directly from database servers as well
as web servers. Its native data format is known as the ARFF format.
WEKA consists of
• Explorer
• Experimenter
• Knowledge flow
• Simple Command Line Interface
• Java interface
Weka has a comprehensive set of classification tools. Many of these algorithms are
very new and reflect an area of active development. We will only be examining the tree
based classifiers but this is only a very small part of all the classification methods
available in Weka. There are 11 tree algorithms, and 71algorithms in all.
MOZENDA SOFTWARE
Intuitive software that allows you to mine data in just minutes.Mozenda is Software as
a Service company that enables users of all types to easily and affordably extract and
manage web data. With Mozenda, users can set up agents that routinely extract data,
store data, and publish data to multiple destinations. Once information is in the
Mozenda systems users can format, repurpose, and mash up the data to be used in other
online/offline applications.
CURRENT AND OPEN PROBLEMS
In today‘s era where the entire world has become a global village and the driving force is
internet having e-business to internet blogs to search engines, the major questions in front of
the business users is while they would like to retain the existing customers and also would
like to understand the patterns and trends of customer behaviour so that their decisions can be
supported with facts represented with visualizations and appropriate reporting made possible
with web mining. Also, there is a huge competition amongst the companies and in order to be
ahead of others in terms of products one is selling and also to identify the strong and weak
parts of the competitors.
Thus, some relevant problems as listed as follows:
Very high data volumes and data flow rates
Complex, structured, semi-structured, and unstructured data
A growing trend among companies, organizations and individuals alike to gather
information to utilize it for their interest.
Need to unearth hidden relationships among various attributes of data and between
several snapshots of data over a period of time. These hidden patterns have enormous
potential in predictions and personalisation in e-commerce
Need of organized data for analysis in order to improve market strategies
Information Extraction for Catalogue Creation, Service Discovery
APPROACH TO THE PROBLEM IN TERMS OF TECHNOLOGY
USED
INTRODUCTION TO .NET Framework
The .NET Framework is a new computing platform that simplifies application development
in the highly distributed environment of the Internet. The .NET Framework is designed to
fulfill the following objectives:
To provide a consistent object-oriented programming environment whether object code is
stored and executed locally, executed locally but Internet-distributed, or executed
remotely.
To provide a code-execution environment that minimizes software deployment and
versioning conflicts.
To provide a code-execution environment that guarantees safe execution of code,
including code created by an unknown or semi-trusted third party.
To provide a code-execution environment that eliminates the performance problems of
scripted or interpreted environments.
To make the developer experience consistent across widely varying types of applications,
such as Windows-based applications and Web-based applications.
To build all communication on industry standards to ensure that code based on the .NET
Framework can integrate with any other code.
.NET FRAMEWORK CLASS LIBRARY
The .NET Framework class library is a collection of reusable types that tightly integrate with
the common language runtime. The class library is object oriented, providing types from
which your own managed code can derive functionality. This not only makes the .NET
Framework types easy to use, but also reduces the time associated with learning new features
of the .NET Framework. In addition, third-party components can integrate seamlessly with
classes in the .NET Framework.For example, the .NET Framework collection classes
implement a set of interfaces that you can use to develop your own collection classes. Your
collection classes will blend seamlessly with the classes in the .NET Framework.As you
would expect from an object-oriented class library, the .NET Framework types enable you to
accomplish a range of common programming tasks, including tasks such as string
management, data collection, database connectivity, and file access. In addition to these
common tasks, the class library includes types that support a variety of specialized
development scenarios. For example, you can use the .NET Framework to develop the
following types of applications and services:
Console applications.
Scripted or hosted applications.
Windows GUI applications (Windows Forms).
ASP.NET applications.
XML Web services.
Windows services.
ACTIVE SERVER PAGES.NET
ASP.NET is a programming framework built on the common language runtime that can be
used on a server to build powerful Web applications. ASP.NET offers several important
advantages over previous Web development models:
Enhanced Performance. ASP.NET is compiled common language runtime code running
on the server.
World-Class Tool Support. The ASP.NET framework is complemented by a rich
toolbox and designer in the Visual Studio integrated development environment.
WYSIWYG editing, drag-and-drop server controls, and automatic deployment are just a
few of the features this powerful tool provides.
Power and Flexibility. Because ASP.NET is based on the common language runtime,
the power and flexibility of that entire platform is available to Web application
developers. The .NET Framework class library, Messaging, and Data Access solutions
are all seamlessly accessible from the Web. ASP.NET is also language-independent, so
you can choose the language that best applies to your application or partition your
application across many languages.
Simplicity. ASP.NET makes it easy to perform common tasks, from simple form
submission and client authentication to deployment and site configuration.
Manageability. ASP.NET employs a text-based, hierarchical configuration system,
which simplifies applying settings to your server environment and Web applications.
Scalability and Availability. ASP.NET has been designed with scalability in mind, with
features specifically tailored to improve performance in clustered and multiprocessor
environments.
Customizability and Extensibility. ASP.NET delivers a well-factored architecture that
allows developers to "plug-in" their code at the appropriate level.
Security. With built in Windows authentication and per-application configuration, you
can be assured that your applications are secure.
ASP.NET WEB FORMS
The ASP.NET Web Forms page framework is a scalable common language runtime
programming model that can be used on the server to dynamically generate Web pages.
ASP.NET Web Forms pages are text files with an .aspx file name extension. They can be
deployed throughout an IIS virtual root directory tree. When a browser client requests .aspx
resources, the ASP.NET runtime parses and compiles the target file into a .NET Framework
class. This class can then be used to dynamically process incoming requests. ASP.NET
provides syntax compatibility with existing ASP pages. This includes support for <% %>
code render blocks that can be intermixed with HTML content within an .aspx file. These
code blocks execute in a top-down manner at page render time.
INTRODUCTION TO ASP.NET SERVER CONTROLS
In addition to (or instead of) using <% %> code blocks to program dynamic content,
ASP.NET page developers can use ASP.NET server controls to program Web pages. Server
controls are declared within an .aspx file using custom tags or intrinsic HTML tags that
contain a runat="server" attributes value. Intrinsic HTML tags are handled by one of the
controls in the System.Web.UI.HtmlControls namespace. Any tag that doesn't explicitly
map to one of the controls is assigned the type of
System.Web.UI.HtmlControls.HtmlGenericControl. Server controls automatically
maintain any client-entered values between round trips to the server. This control state is not
stored on the server (it is instead stored within an <input type="hidden"> form field that is
round-tripped between requests). Note also that no client-side script is required.
C#.NET
ADO.NET is an evolution of the ADO data access model that directly addresses user
requirements for developing scalable applications. It was designed specifically for the web
with scalability, statelessness, and XML in mind. ADO.NET uses some ADO objects, such as
the Connection and Command objects, and also introduces new objects. Key new
ADO.NET objects include the DataSet, DataReader, and DataAdapter. Some objects are:
Connections. For connection to and managing transactions against a database.
Commands. For issuing SQL commands against a database.
Data Readers. For reading a forward-only stream of data records from a SQL Server data
source.
Datasets. For storing, Removing and programming against flat data, XML data and
relational data.
Data Adapters. For pushing data into a Dataset, and reconciling data against a database.
When dealing with connections to a database, there are two different options: SQL Server
.NET Data Provider (System.Data.SqlClient) and OLE DB .NET Data Provider
(System.Data.OleDb). In these samples we will use the SQL Server .NET Data Provider.
These are written to talk directly to Microsoft SQL Server.
SQL SERVER
SQL Server stores each data item in its own fields. In SQL Server, the fields relating to a
particular person, thing or event are bundled together to form a single complete unit of data,
called a record (it can also be referred to as raw or an occurrence). Each record is made up of
a number of fields. No two fields in a record can have the same field name. During a SQL
Server Database design project, the analysis of your business needs identifies all the fields or
attributes of interest. If your business needs change over time, you define any additional
fields or change the definition of existing fields.
JAVA APPLET
Applets are used to provide interactive features to web applications that cannot be provided
by HTML alone. They can capture mouse input and also have controls like buttons or check
boxes. In response to the user action an applet can change the provided graphic content. This
makes applets well suitable for demonstration, visualization and teaching. There are online
applet collections for studying various subjects. An applet can also be a text area only,
providing, for instance, a cross platform command-line interface to some remote system. If
needed, an applet can leave the dedicated area and run as a separate window. A Java applet
extends the class java.applet.Applet.
SUPPORT FOR THE NOVELTY OF THE PROBLEM
Why E-business???
In e-commerce websites you have the ability to sell, advertise, and introduce different kinds
of services and products in the web. E-commerce websites have the advantage of reaching a
large number of customers regardless of distance and time limitations. Furthermore, an
advantage of e-commerce over traditional businesses is the faster speed and the lower
expenses for both ecommerce website owners and customers in completing customers‘
transactions and orders Retail websites aim to inspire, reflect a good image about the business
and improve it online. An important factor in having a successful retail website is to know
your competitors. On one hand, by identifying their points of strength and trying to get
benefit of them by improving those points and adopting powerful strategies. On the other
hand, identifying weakness points of your competitors and avoid them is a good practice in
having a successful retail website.
Web Mining versus Data Mining
Web mining is the use of data mining techniques to automatically discover and extract
information from Web documents and services. When comparing web mining with traditional
data mining, there are three main differences to consider:
1. Scale – In traditional data mining, processing 1 million records from a database would
be large job. In web mining, even 10 million pages wouldn‘t be a big number.
2. Access – When doing data mining of corporate information, the data is private and
often requires access rights to read. For web mining, the data is public and rarely
requires access rights
3. Structure – A traditional data mining task gets information from a database, which
provides some level of explicit structure. A typical web mining task is processing
unstructured or semi-structured data from web pages. Even when the underlying
information for web pages comes from a database, this often is obscured by HTML
mark-up
Thus, Web Mining can be used to support enterprises to create marketable products.
SOLUTION APPROACH
Developing Business Management Software
Implementing Web Crawler
Implementing Web Extractor
Implementation of Data Mining Techniques
Developing Business Management Software
A business Process model for Marketing team in order to service the target groups about its
products and services and also be able to make them place online orders that can be viewed
by the business managers ,in other words to automate a whole retail store for the sale of
products and providing the customers with best of services. The standards of security and
data protective mechanism have been given a big choice for proper usage. The application
takes care of different modules and their associated reports, which are produced as per the
applicable strategies and standards that are put forwarded by the administrative staff.
Implementing Web crawler for collection of web data
A web crawler based on path incremental crawling that applies breadth first search for
searching the linked pages to a URL. It starts to search as soon the crawl button on its
interface is pressed. The crawler application is designed in C# and the searching algorithm
based on the pseudo code provided in a research paper is also implemented in C# in
Microsoft Visual Studio 2010.
Implementing Web Extractor/Parser
One of the critical problems in building an extractor is defining a set of extraction rules that
precisely define how to locate the information on the page. For any given item to be extracted
from a page, one needs an extraction rule to locate both the beginning and end of that item.
Implementation of Data Mining Techniques
The data mining techniques will be applied on the data sets so extracted in order to retrieve
useful information and solve the required queries related to customers in order to enhance the
market strategies and combat the issue of competition by comparing various products and
services. The data can be categorized on the basis of similarity and relationships. The
categorization can be obtained by using classification techniques and Association is an
exploratory method of discovering previously unknown relationships. Thus, applying data
mining techniques to the business data will lead us to achieve the following:
Build unique market segments identifying the attributes of high value prospects,
Select promotional strategies that best reach the client‘s Web customer segments
Analyze online sales to improve targeting of the client‘s high-value customers
Test and determine which marketing activities have the greatest impact
Identify client customers most likely to be interested in their new products
LITERATURE SURVEY
SUMMARY OF PAPERS
TITLE RESEARCH ON DATA MINING IN E-BUSINESS
AUTHOR
Luo Hanyang, Shenzhen Graduate School,Harbin Institute of
Technology,Shenzhen, P.R.China,
Gao Jinling, Ji Wenli,College of Management,Shenzhen University
Shenzhen, P.R.China
YEAR OF
PUBLICATION
22 December 2008
PUBLISHING
DETAILS
Computer Science and Software Engineering, 2008 International
Conference
SUMMARY Data mining is an emerging technology and can be applied for
searching valuable business information from e-business as huge
details available in website‘s background database. The architecture
and sources of information in an e-business websites like server logs
,Customer Register information are made familiar with a brief
mention of data mining techniques applicable in such a scenario. It
lays emphasis on the main goal of data mining in E-business which is
to mine the customer visiting information, to understand customers‘
browse action and mode, and find useful market information and
provide personalized services. Data mining adopts many techniques,
the main methods: discrimination, association analysis, classification
and prediction, cluster analysis and evolution analysis.
WEB LINK http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4722629&tag=1
TITLE DATA MINING ON SYMBOLIC KNOWLEDGE EXTRACTED
FROM THE WEB
AUTHOR Rayid Ghani , Rosie Jones ,Dunja Mladeni´c,Kamal Nigam,Se´an
Slattery, School of Computer Science
Department for Intelligent Systems,Carnegie Mellon University,J.
Stefan Institute,Pittsburgh
YEAR OF
PUBLICATION
2000
PUBLISHING
DETAILS
Workshop on Text Mining at the Sixth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD-2000),
2000
SUMMARY As a part of E-business, since it is crucial to know about competitors,
one needs to about details like products and services by them in
various domains .The Paper discusses about creating a dataset which
can be built by spidering sources on web and then applying data
mining techniques on it .A brief overview of data mining Techniques
applicable on corporate databases is highlighted. It discusses the need
of a Web crawler for extracting information from company‘s websites
and also need of a wrapper to extract information to augment crawler‘s
information. It is only after an available dataset that data mining
techniques such as clustering, classification and association can be
applied and interesting regularities can be discovered in a company‘s
dataset according to the requirements.
WEB LINK http://www.kamalnigam.com/papers/shield-kddws00
TITLE INTEGRATING E-COMMERCE AND DATA MINING:
ARCHITECTURE AND CHALLENGES
AUTHOR Suhail Ansari, Ron Kohavi, Llew Mason, and Zijian Zheng
YEAR OF
PUBLICATION
2000
PUBLISHING
DETAILS
Appeared in,WEBKDD‘2000 workshop: Web Mining for E-
Commerce -- Challenges and Opportunities
Appeared in ICDM'01: The 2001 IEEE International Conference on
Data Mining
SUMMARY The paper discusses integration of data mining and E-business
,mainly focusing on E-business being a killer domain for data mining
.An architecture that successfully integrates data mining with an e-
commerce system has been proposed consisting of three main parts as
of Business Data definition, Customer Interaction and Analysis
.Business Data Definition discusses the data and meta data associated
with E -business and to be able to define rich set of attributes for
example products can have attributes like size, colour etc. For a
business to be successful, customer interaction plays a major role and
this gives rise to a need of an efficient e-business website .The third
component lays emphasis on analysis of data collected by various data
mining techniques concerning mainly on customer data .It is through
an analysis tool that reports can be generated to be able to get varied
knowledge about different point like top selling and worst selling
products etc. Finally several challenging problems that need to be
addressed for further enhancement of this architecture were
highlighted.
WEB LINK
http://ai.stanford.edu/~ronnyk/icdmIntegratingEcom
TITLE DATA MINING TECHNIQUES AND APPLICATIONS
AUTHOR Mrs. Bharati M. Ramageri, Lecturer,Modern Institute of Information
Technology and Research,Department of Computer Application,
Yamunanagar, Nigdi
Pune, Maharashtra
YEAR OF
PUBLICATION
2009
PUBLISHING
DETAILS
Indian Journal of Computer Science and Engineering
Vol. 1 No. 4 301-305
SUMMARY Data mining is a process which finds useful patterns from large amount
of data. The goal of this technique is to find patterns that were
previously unknown. Once these patterns are found they can further be
used to make certain decisions for development of their businesses.
Data mining techniques and algorithms such as classification,
clustering etc., helps in finding the patters to decide upon the future
trends in businesses to grow. Classification is the most commonly
applied data mining technique, which employs a set of pre-classified
examples to develop a model that can classify the population of
records at large Clustering can be said as identification of similar
classes of objects. Data mining has wide application domain almost in
every industry where the data is generated that‘s why data mining is
considered one of the most important frontiers in database and
information systems and one of the most promising interdisciplinary
developments in Information Technology.
WEB LINK
http://www.ijcse.com/docs/IJCSE10-01-04-51
TITLE A SURVEY ON WEB CONTENT MINING AND EXTRACTION OF
STRUCTURED AND SEMISTRUCTURED DATA
AUTHOR Kshitija Pol, Nita Patil, Shreya Patankar, Chhaya Das,Datta Meghe
College of Engineering,Airoli , Navi Mumbai-400708
YEAR OF
PUBLICATION
2008
PUBLISHING
DETAILS
Emerging Trends in Engineering and Technology, 2008. ICETET '08.
First International Conference
Nagpur, Maharashtra
SUMMARY
Information available on the web is mostly in the form of unstructured
data and as the data on the web is growing at an explosive rate it has
lead to problems such as extracting potentially useful knowledge and
learning about customers and individuals. This paper discusses
techniques to represent such data in structured form as table that can be
queried for further information by using Web content mining .it
explains in details the unstructured, structured and semi structured data
of the web and techniques for extraction by means of a web crawler
and web scrapper highlighting example of building a structured XML
document from web page data. It also discusses the various problems
and major challenges of web content mining.
WEB LINK http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4579960&tag=1
TITLE WEB MINING IN E-COMMERCE
AUTHOR Istrate Mihai
YEAR OF
PUBLICATION
2009
PUBLISHING
DETAILS
Annals of Faculty of Economics, 2009
vol. 4 , issue 1
SUMMARY The web is a very good place to run successful business and it is
important to have a successful website to serve as a sales and
marketing tool. One of the effective technologies used for this purpose
is Data Mining .Web mining is the usage of data mining techniques to
extract interesting information from web data.Web Content is the
mining of data a web page contains like list of products and services.
There is a lot of information need to be defined before starting
building the e-commerce website such as identifying business goals
and, if the website supposed to attract new customers or increase the
sales of current customers, identify if the proposed website will
increase the business overall profit, and also to identify the most
suitable tools and techniques need to be used/followed in order to
target those requirements. Retail websites aim to inspire, reflect a
good image about the business and improve it online. An important
factor in having a successful retail website is to know your
competitors and identifying weak and strong points of your
competitors and accordingly implementing them in one‘s own is a
good practice in having a successful retail website.
WEB LINK http://steconomice.uoradea.ro/anale/volume/2009/v4-management-
and-marketing/196.pdf
TITLE IMPLEMENTATION OF WEB CRAWLER
AUTHOR
Pooja Gupta ,Assistant Professor, Linagay‘s University
Mrs. Kalpana Johari, Sr. Lecture, Center for Development of
Advanced Computing, Noida
YEAR OF
PUBLICATION
16-18 Dec. 2009
PUBLISHING
DETAILS
Emerging Trends in Engineering and Technology (ICETET), 2009
2nd International Conference
Location :Nagpur
SUMMARY Web crawler continuously keeps on crawling the web and finds any
new web pages that have been added to the web. They continue
visiting the web until local resources such as storage are exhausted.
The paper shed some light on the design of the crawler and also
various implementation techniques. Also, in this paper, pattern
recognition is applied on the crawler like this, When we start the
crawler it will give me the links related to the keyword. It will then
read the web pages that are extracted from the links and while it will
read the web page it will extract only the content. Here content means
only the text that is available on the web page.
WEB LINK http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5395052&tag=1
INTERGRATED SUMMARY OF LITERATURE
THE WEB: OPPORTUNITIES & CHALLENGES
Web offers an unprecedented opportunity and challenge to data mining
The amount of information on the Web is huge, and easily accessible.
The coverage of Web information is very wide and diverse One can find information
about almost anything.
Information/data of almost all types exist on the Web, e.g., structured tables, texts,
multimedia data, etc.
Much of the Web information is semi-structured due to the nested structure of HTML
code.
Much of the Web information is linked. There are hyperlinks among pages within a
site, and across different sites.
Much of the Web information is redundant. The same piece of information or its
variants may appear in many pages.
The Web is noisy. A Web page typically contains a mixture of many kinds of
information, e.g., main contents, advertisements, navigation panels, copyright notices,
etc.
The Web is also about services. Many Web sites and pages enable people to perform
operations with input parameters, i.e., they provide services.
The Web is dynamic. Information on the Web changes constantly. Keeping up with the
changes and monitoring the changes are important issues.
Above all, the Web is a virtual society. It is not only about data, information and
services, but also about interactions among people, organizations and automatic
systems, i.e., communities.
MINING THE WEB
When extracting Web content information using web mining, there are four typical steps.
1. Collect – fetch the content from the Web
2. Parse – extract usable data from formatted data (HTML, PDF, etc)
3. Analyze – tokenize, rate, classify, cluster, filter, sort, etc.
4. Produce – turn the results of analysis into something useful (report, search index,
etc)
CRAWLING
Web crawler (also known as a Web spider or Web robot) is a program or automated script
which browses the World Wide Web in a methodical and automated manner. This process is
called Web crawling or spidering. Many legitimate sites, in particular search engines, use
spidering as a means of providing up-to-date data. Following are some reasons to use a web
crawler:
To maintain mirror sites for popular Web sites.
To test web pages and links for valid syntax and structure.
A typical web crawler starts by parsing a specified web page: noting any hypertext links on
that page that point to other web pages. The Crawler then parses those pages for new links,
and so on, recursively. The crawler simply sends HTTP requests for documents to other
machines on the Internet, just as a web browser does when the user clicks on links. All the
crawler really does is to automate the process of following links. There are two important
characteristics of the Web that generate a scenario in which Web crawling is very difficult:
1. Large volume of Web pages.
2. Rate of change on web pages.
The difficulties in implementing efficient web crawler clearly state that bandwidth for
conducting crawls is neither infinite nor free. So, it is becoming essential to crawl the web in
not only a scalable, but efficient way, if some reasonable amount of quality or freshness of
web pages is to be maintained. This ensures that a crawler must carefully choose at each step
which pages to visit next. Thus the implementer of a web crawler must define its behaviour.
Defining the behaviour of a Web crawler is the outcome of a combination of below
mentioned strategies:
Selecting the better algorithm to decide which page to download.
Strategizing how to re-visit pages to check for updates.
Strategizing how to avoid overloading websites.
We intend the crawler to download as many resources as possible from a particular Web site.
That way a crawler would ascend to every path in each URL that it intends to crawl. For
example, when given a seed URL of http://foo.org/a/b/page.html, it will attempt to crawl
/a/b/, /a/, and /. The advantage with Path-ascending crawler is that they are very effective in
finding isolated resources, or resources for which no inbound link would have been found in
regular crawling. Thus, a crawler must have a good crawling strategy, as noted in the
previous sections, but it also needs a highly optimized architecture.
EXTRACTING DATA FROM WEB PAGE
In order to gather data in structured form from the highly unstructured web data, we need to
extract the contents of a web page excluding the advertisements and other needless
information and gather with us important information such as product catalogue of various
websites etc. A wrapper is a piece of software that enables a semi structured Web source to
be queried as if it were a database. These are sources where there is no explicit structure or
schema, but there is an implicit underlying structure. One of the critical problems in building
a wrapper is defining a set of extraction rules that precisely define how to locate the
information on the page. For any given item to be extracted from a page, one needs an
extraction rule to locate both the beginning and end of that item. Since, in our framework,
each document consists of a sequence of tokens (e.g., words, numbers, HTML tags, etc), this
is equivalent to finding the first and last tokens of an item. A key idea underlying our work is
that the extraction rules are based on ―landmarks‖ (i.e., groups of consecutive tokens) that
enable a wrapper to locate the start and end of the item within the page. XML has made it
possible to improve its presentation and redefine the way in which documents and data were
exchanged. Most of the websites are in html and need to be converted to xml as data sets
available in xml are converted to CSV (Comma Separated Values) or ARFF (Attribute
Relational File Format). Conversion to these file formats makes it easy to use XML datasets.
One can exploit XML hierarchy levels using these file formats. An ARFF (Attribute-
Relational File Format) file is an ASCII text file that describes a list of instances sharing a set
of attributes. ARFF files have two distinct sections. The first section is the Header
information, which is followed the Data information. The Header of the ARFF file contains
the name of the relation, a list of the attributes (the columns in the data), and their types. The
CSV file is used and is stored in the database. XML files can also be stored directly to the
database at different levels. An XML document along with its associated schema is input into
an XML parser. The parser checks that the document is well formed and, if the schema is also
available, checks that the XML is valid according to what has been defined in the schema.
Because the schema is also an XML document, it is validated recursively against another
schema, respectively. The parser then provides access methods for another application to
access the data that was contained within the original XML document.
DATA MINING TECHNIQUES
Data mining is needed in many fields to extract the useful information from the large amount
of database. Large amount of data is maintained in every field to keep different records.
Scientific data, medical data, demographic data, financial data, marketing data etc are the
type of database maintained in different fields. So, different ways were found to
automatically analyze the data, to summarize it, to discover and characterize trends in it and
to automatically flag anomalies. Various techniques were introduced by the different
researchers. These techniques were used to do classification, to do clustering, to find
interesting patterns etc. Data mining is the discovery of knowledge and useful information
from the large amounts of data stored in databases. It is referred to as knowledge discovery
from databases (KDD), is the automated or convenient extraction of patterns representing
knowledge implicitly stored in large databases. Data mining tools predict future trends and
behaviours, allowing businesses to make proactive, knowledge driven decisions. Data mining
is becoming an increasingly important tool to transform these data into information. Data
mining can also be referred as knowledge mining or knowledge discovery from data. Many
techniques are used in data mining to extract patterns from large amount of database, for
example: Association rule Analysis, Classification.
CLASSIFICATION METHODS
Classification is a data mining technique used to predict group membership for data
instances.
Naïve Bayes Classifier: The Naïve Bayes classifier works on a simple, but comparatively
intuitive concept. Also, in some cases it is also seen that Naïve Bayes outperforms many
other comparatively complex algorithms. It makes use of the variables contained in the data
sample, by observing them individually, independent of each other. The Naïve Bayes
classifier is based on the Bayes rule of conditional probability. It makes use of all the
attributes contained in the data, and analyses them individually as though they are equally
important and independent of each other.
J48 Decision Trees: A decision tree is a predictive machine-learning model that decides the
target value (dependent variable) of a new sample based on various attribute values of the
available data. The internal nodes of a decision tree denote the different attributes; the
branches between the nodes tell us the possible values that these attributes can have in the
observed samples, while the terminal nodes tell us the final value (classification) of the
dependent variable. The attribute that is to be predicted is known as the dependent variable,
since its value depends upon, or is decided by, the values of all the other attributes. The other
attributes, which help in predicting the value of the dependent variable, are known as the
independent variables in the dataset. The J48 Decision tree classifier follows the following
simple algorithm. In order to classify a new item, it first needs to create a decision tree based
on the attribute values of the available training data. So, whenever it encounters a set of items
(training set) it identifies the attribute that discriminates the various instances most clearly.
This feature that is able to tell us most about the data instances so that we can classify them
the best is said to have the highest information gain. Now, among the possible values of this
feature, if there is any value for which there is no ambiguity, that is, for which the data
instances falling within its category have the same value for the target variable, then we
terminate that branch and assign to it the target value that we have obtained. For the other
cases, we then look for another attribute that gives us the highest information gain. Hence we
continue in this manner until we either get a clear decision of what combination of attributes
gives us a particular target value, or we run out of attributes. In the event that we run out of
attributes, or if we cannot get an unambiguous result from the available information, we
assign this branch a target value that the majority of the items under this branch possess. Now
that we have the decision tree, we follow the order of attribute selection as we have obtained
for the tree. By checking all the respective attributes and their values with those seen in the
decision tree model, we can assign or predict the target value of this new instance.
ASSOCIATION RULES
Association rule mining, one of the most important and well researched techniques of data
mining, It aims to extract interesting correlations, frequent patterns, associations or casual
structures among sets of items in the transaction databases or other data repositories.
Let I = {i1, i2, …, im}: a set of itemsbe a set of m distinct attributes, T be transaction that
contains a set of items such that T I. An association rule is an implication of the form X
Y, where X, Y I are sets of items called item sets, and X Y = .X is called antecedent
while Y is called consequent, the rule means X implies Y. There are two important basic
measures for association rules, support(s) and confidence(c). Since the database is large and
users concern about only those frequently purchased items, usually thresholds of support and
confidence are pre-defined by users to drop those rules that are not so interesting or useful.
The two thresholds are called minimal support and minimal confidence respectively,
additional constraints of interesting rules also can be specified by the users. The two basic
parameters of Association Rule Mining (ARM) are: support and confidence.
Support(s) of an association rule is defined as the percentage/fraction of records that contain
X U Y to the total number of records in the database. The count for each item is increased by
one every time the item is encountered in different transaction T in database D during the
scanning process. It means the support count does not take the quantity of the item into
account. For example in a transaction a customer buys three bottles of milk but we only
increase the support count number of {milk} by one, in another word if a transaction contains
a item then the support count of this item is increased by one. Support(s) is calculated by the
following formula:
From the definition we can see, support of an item is a statistical significance of an
association rule. Suppose the support of an item is 0.1%, it means only 0.1 percent of the
transaction contain purchasing of this item. The retailer will not pay much attention to such
kind of items that are not bought so frequently, obviously a high support is desired for more
interesting association rules. Before the mining process, users can specify the minimum
support as a threshold, which means they are only interested in certain association rules that
are generated from those item sets whose supports exceed that threshold. However,
sometimes even the item sets are not so frequent as defined by the threshold, the association
rules generated from them are still important. For example in the supermarket some items are
very expensive, consequently they are not purchased so often as the threshold required, but
association rules between those expensive items are as important as other frequently bought
items to the retailer.
Confidence(c) of an association rule is defined as the percentage/fraction of the number of
transactions that contain X U Y to the total number of records that contain X, where if the
percentage exceeds the threshold of confidence an interesting association rule X=>Y can be
generated.
Confidence is a measure of strength of the association rules, suppose the confidence of the
association rule X=>Y is 80%, it means that 80% of the transactions that contain X also
contain Y together, similarly to ensure the interestingness of the rules specified minimum
confidence is also pre-defined by users.
APRIORI ALGORITHM: In computer science and data mining, Apriori is a classic
algorithm for learning association rules. Apriori is designed to operate on databases
containing transactions (for example, collections of items bought by customers, or details of a
website frequentation). The algorithm attempts to find subsets which are common to at least a
minimum number C (the cutoff, or confidence threshold) of the item sets. Apriori uses a
"bottom up" approach, where frequent subsets are extended one item at a time (a step known
as candidate generation, and groups of candidates are tested against the data. The algorithm
terminates when no further successful extensions are found. Apriori uses breadth-first search
and a hash tree structure to count candidate item sets efficiently.
APRIORI ADVANTAGES/DISADVANTAGES
Advantages
o Uses large item set property
o Easily parallelized
o Easy to implement
Disadvantages
o Assumes transaction database is memory resident.
o Requires many database scans.
ANALYSIS, DESIGN AND MODELLING
OVERALL DESCRIPTION OF THE PROJECT
System Interface:
This Project aims at improving the market strategies of an E-business website and also to
provide its customers better option to select a product from the various choices.
Figure 1: Process Management System environment
The Business Process Management System will be a hierarchical based system for the
marketing team. The system after careful analysis has been identified to be presented with the
following modules:
Administrator:-In this module the Administrator has the privileges to add all the Target
Groups, Newsletters, and Metrics. He can search all the info about the Target Groups. And he
will assign the work to the Target Group person (Group Manager).
Target Groups:-In this module the Target Groups person has the task given by admin.
Target group persons are Group Manager, Group Leader, and Group Member.
Newsletters:-In this module admin will provide all the product information in a form of
advertisements and that will be visible to all the group members.
Metrics:-In this module all the Target Group and admin can give the survey to each other
based on all the products and customers.
Reports:-This module contains all the information about the reports generated by the admin
based on the particular user, particular quotation, all customers or users, all quotation
generated by the users.
Authentication: - This module contains all the information about the authenticated user.
User without his username and password can‘t enter into the login if he is only the
authenticated user then he can enter to his login and he can see the quotation and give the
quotation for the particular products.
Web crawlers typically identify themselves to a Web server by using the User-agent field of an
HTTP request. Web site administrators typically examine their Web servers‘ log and use the user
agent field to determine which crawlers have visited the web server and how often. The user agent
field may include a URL where the Web site administrator may find out more information about the
crawler.. It is important for Web crawlers to identify themselves so that Web site administrators can
contact the owner if needed. In some cases, crawlers may be accidentally trapped in a crawler trap or
they may be overloading a Web server with requests, and the owner needs to stop the crawler.
Identification is also useful for administrators that are interested in knowing when they may expect
their Web pages to be indexed by a particular search engine.
Web scraping or Web data extraction is a computer software technique of extracting
information from websites. Usually, such software programs simulate human exploration of
the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP),
or embedding certain full-fledged Web browsers, such as Internet Explorer or Mozilla
Firefox. Web scraping focuses more on the transformation of unstructured data on the Web,
typically in HTML format, into structured data that can be stored and analyzed in a central
local database or spreadsheet that can be used to compare various products sales and other
relevant information to improve our strategies.
User interface:
It is essential to consult the system users and discuss their needs while designing the user
interface. User interface systems can be broadly classified as:
1. User initiated interface: The user is in charge, controlling the progress of the
user/computer dialogue. In the computer-initiated interface, the computer selects the next
stage in the interaction.
2. Computer initiated interfaces: In the computer initiated interfaces the computer guides the
progress of the user/computer dialogue. Information is displayed and the user response of
the computer takes action or displays further information.
User initiated interfaces:
User initiated interfaces fall into two approximate classes:
1. Command driven interfaces: In this type of interface the user inputs commands or
queries which are interpreted by the computer.
2. Forms oriented interface: The user calls up an image of the form to his/her screen and
fills in the form. The forms oriented interface is chosen because it is the best choice.
3. A form named ‗Web Frame‘ is being used for crawler. It contains an entry field for
the user where the user can type his\her valid URL (web address), including the
"http://" portion, in the text field at the top of the application window
4. A button for search/stop is being provided in the form with the help of which, the user
can either retrieve the search results or it can stop the crawler processing whenever
required. Hence for searching a URL, the search button is clicked.
5. A screen containing a search panel providing area for the user to input his search
query.
6. A results page which lists the links of the documents relevant to the given query
Computer-Initiated Interfaces:
The following computer – initiated interfaces were used:
1. The menu system for the user is presented with a list of alternatives and the user
chooses one of alternatives.
2. Questions – answer type dialog system where the computer asks question and takes
action based on the basis of the users reply.
Communication Interfaces:
The crawler module of the search engine software uses the HTTP protocol to download the
pages from WWW. The user uses the search engine through browser.
Right from the start the system is going to be menu driven, the opening menu displays the
available options. Choosing one option gives another popup menu with more options. In this
way every option leads the users to data entry form where the user can key in the data.
Error Message Design:
The design of error messages is an important part of the user interface design. As user is
bound to commit some errors or other while designing a system the system should be
designed to be helpful by providing the user with information regarding the error he/she has
committed.
SPECIFIC REQUIREMENTS
EXTERNAL INTERFACE REQUIREMENTS:
The whole system is controlled by various user initiated and computer initiated interfaces
mainly used for the purpose of login into the system and to be able to perform the various
operations in order to carry on the marketing process.
Name of the item Login
Description of purpose The Actor will give the user name and password to the
system. The system will verify the authentication.
Participating actors Admin ,User
Entry conditions The actor will enter the system by using username and
password
Exit Conditions If un authenticated should be exited
Quality Requirements Password must satisfy the complexity requirements
Name of the item Admin Registration
Description of purpose The Admin will submit all the details and place in the
application.
Participating actors Admin
Entry conditions Must satisfy all the norms given by the interface site.
Exit Conditions Successful or Un successful completion of creation of
account.
Quality Requirements All fields are mandatory.
Name of the item User Registration
Description of purpose The User must enter all his personal details.
Participating actors User
Entry conditions View Home page
Exit Conditions Registered User should be successfully logged out. Error
Message should be displayed on Un successful creation.
Quality Requirements Best Error Handling techniques. Check on Mandatory fields.
Name of the item Web crawling
Description of purpose The crawler starts to search the linked web pages from the
given url .
Participating actors User
Entry conditions View Crawler application
Exit Conditions Error messages displayed if difficulty in searching
Quality Requirements Proper input format of the Url .Check on
Mandatory fields.
Name of the item Web Scrapping
Description of purpose The scrapper application begins extracting content from the
web pages on giving the desired address of the page.
Participating actors User
Entry conditions View scrapper application
Exit Conditions Error messages displayed if difficulty in extracting
Quality Requirements Proper input format of the Url of the web page. Check on
Mandatory fields.
Name of the item Data Mining Application
Description of purpose To apply techniques of data mining in order to gain
knowledge from business data
Participating actors User
Entry conditions Data set in arff format or xml format
Exit Conditions Error message displayed if improper format
Quality Requirements Proper input of the dataset.
FUNCTIONAL REQUIREMENTS
INPUT DESIGN:
Input design is a part of overall system design. The main objective during the input design is
as given below:
To produce a cost-effective method of input.
To achieve the highest possible level of accuracy.
To ensure that the input is acceptable and understood by the user.
INPUT STAGES:
The main input stages can be listed as below:
Data recording
Data transcription
Data conversion
Data verification
Data control
Data transmission
Data validation
Data correction
INPUT TYPES:
It is necessary to determine the various types of inputs. Inputs can be categorized as follows:
External inputs, which are prime inputs for the system.
Internal inputs, which are user communications with the system.
Operational, which are computer department‘s communications to the system.
Interactive, which are inputs entered during a dialogue.
INPUT MEDIA:
At this stage choice has to be made about the input media. To conclude about the input
media consideration has to be given to:
Type of input
Flexibility of format
Speed
Accuracy
Verification methods
Rejection rates
Ease of correction
Storage and handling requirements
Security
Easy to use
Portability
Keeping in view the above description of the input types and input media, it can be said that
most of the inputs are of the form of internal and interactive. Input data is to be the directly
keyed in by the user, the keyboard can be considered to be the most suitable input device.
OUTPUT DESIGN
Outputs from computer systems are required primarily to communicate the results of
processing to users. They are also used to provide a permanent copy of the results for later
consultation. The various types of outputs in general are:
External Outputs, whose destination is outside the organization.
Internal Outputs whose destination is within organization and they are the
User‘s main interface with the computer.
Operational outputs whose use is purely within the computer department.
Interface outputs, which involve the user in communicating directly.
OUTPUT DEFINITION
The outputs should be defined in terms of the following points:
Type of the output
Content of the output
Format of the output
Location of the output
Frequency of the output
Volume of the output
Sequence of the output
It is not always desirable to print or display data as it is held on a computer. It should be
decided as which form of the output is the most suitable. For Example
Will decimal points need to be inserted
Should leading zeros be suppressed.
OUTPUT MEDIA:
In the next stage it is to be decided that which medium is the most appropriate for the output.
The main considerations when deciding about the output media are:
The suitability for the device to the particular application.
The need for a hard copy.
The response time required.
The location of the users
The software and hardware available.
ERROR AVOIDANCE:
At this stage care is to be taken to ensure that input data remains accurate form the stage at
which it is recorded up to the stage in which the data is accepted by the system. This can be
achieved only by means of careful control each time the data is handled.
ERROR DETECTION:
Even though every effort is make to avoid the occurrence of errors, still a small proportion of
errors are always likely to occur, these types of errors can be discovered by using validations
to check the input data.
DATA VALIDATION:
Procedures are designed to detect errors in data at a lower level of detail. Data validations
have been included in the system in almost every area where there is a possibility for the user
to commit errors. The system will not accept invalid data. Whenever an invalid data is
keyed in, the system immediately prompts the user and the user has to again key in the data
and the system will accept the data only if the data is correct. Validations have been included
where necessary.
The system is designed to be a user friendly one. In other words the system has been
designed to communicate effectively with the user. The system has been designed with
popup menus.
PERFORMANCE REQUIREMENTS
The product will be sensitive to bottlenecks, particularly if the number of access to the
services of the system is large and becomes difficult to manage. The number of crawlers
working at a time is dynamically created depending on the available bandwidth. The average
response time for a user is 0.36 seconds. The expected accuracy of output is 90%
SAFETY REQUIREMENTS
If the speed of crawler is higher than that the web server can handle then it may lead the web
server to crash. Hence a website developer should specify the speed supported.
LOGICAL DATABASE REQUIREMENTS
The product makes use of the inbuilt database of the Microsoft Visual Studio 2010 which
consists of tables. The tables will store the attributes related to each tool available on the tool
box. In case of web crawler the process responds to the client message requesting a list of
URLs to retrieve. The retriever process makes up many connections to web servers
simultaneously and downloads contents. The Retrieved contents are stored in a local disk of
the client. The retriever returns two lists, retrieved URLs and found URLs. The found URLs
are the links which have been found in retrieved pages. It also extracts new URLs which have
not been retrieved yet and enqueue them. And when we make use of the scrapper to extract
the various contents from the web pages the content is stored in the form of a spreadsheet and
the huge data extracted will act as a database.
GENERAL CONSTRAINTS AND ASSUMPTIONS
DESIGN CONSTRAINTS:
The product will face the following design constraints:
There is limited access to target group members as compared to administrator
The web crawler may not be successful to crawl all the web sites depending on their
structure; same constraint goes with web scrapper.
OTHER CONSTRAINTS:
This product is a web based application; hence a major constraint on the performance will
be due to the bandwidth of the server‘s web connection. A faster bandwidth will result in
faster crawling of web pages.
SYSTEM ATTRIBUTES
Hardware Requirements:
PC with 160 GB hard-disk and 1 GB RAM
Software Requirements:
WINDOWS OS (Vista)
Visual Studio .Net 2005 Enterprise Edition
Internet Information Server 5.0 (IIS)
Visual Studio .Net Framework (Minimal for Deployment)
SQL Server 2000 Enterprise Edition
DESIGN DIAGRAMS
DATA FLOW DIAGRAMS
Figure2: Business Process Management
Figure 3: Login Details in System
Figure 4: Web Crawler
Figure 5: Web scrapper
Figure 6:Steps to perform Apiori association mining
USE CASE DIAGRAMS
Check Prerequisites
Met
Add Group
Manager
Place an Order View Customer
Details
Generate Analysis
Report
Update order
Info
Update Product
Details
Confirm
Order
<<includes>>
Admin
Add
customer
Customer
Target
Group
Database
Product Services
Use-case 1
Login
Admin
Group Leader
Group Manager
Database
* *
*
*
*
*
* *
Group Member
*
*
Use case -2
Use case -3
Usecase -4
Usecase-5
Pass URL to
download
Download the page
Apply extractor
on web page User
User
File
ACTIVITY DIAGRAMS
Registration Activity Diagram
Get The Details
Validate Details
[Enter User Name and Password]
Get Details[Enter Details]
[submit]
[submit]
Validate Data
Accepted
[Success Fully Registered]
Login Activity Diagram
Get Details
Validate Data
[Enter User Name and Password]
[Submit]
Rejected AcceptedyesNo
Admin Activity Diagram
Get the Data
Validate Data
[Enter Login Details]
Get the Data Get the Data
[ProcessProject Details] [Generate Reports]
Validate Data
no
yes
no
[submit]
[submit]
Validate Details
yes
yes
no
Employee Activity Diagram
Get the Data
Validate Data
[Enter Login Details]
Get the Data Get the Data
[View Task Allocation]
Validate Data
no
yes
no
[submit]
[submit]
Validate Details
yes
yes
no
Submit project status
Web crawler activity Diagram
Web extractor activity Diagram
E-R DIAGRAMS
SEQUENCE DIAGRAMS
SEQUENCE DIAGRAM FOR ADMINISTRATOR OF THE SYSTEM
SEQUENCE DIAGRAM FOR ADDING EMPLOYEE
Home Page Login Page Admin Home Page
AdminUse URL
Press Login Button
If Yes Goes to Admin Home Page
Validate If NOT
Come Back to Login Page
Add Employee Info Database Confirm Page
Click on Link for Add Employee Page
Press Button for Saving Data
if Validation NOT OK
Back to Add Employee Info Page
If OK Then go to Confirmation Page
Time
Admin
Home Page Databas
e
Admin Home Page
Use URL
Press login button
If Yes Goes to its Home Page
If No Come Back to Home Page Validate if NO
SEQUENCE DIAGRAM FOR ADDING PRODUCTS
Home Page Login Page Admin Home Page
AdminUse URL
Press Login Button
If Yes Goes to Admin Home Page
Validate If NOT
Come Back to Login Page
Add Product Info Database Confirm Page
Click on Link for Add Product Page
Press Button for Saving Data
if Validation NOT OK
Back to Add Product Info Page
If OK Then go to Confirmation Page
SEQUENCE DIAGRAM FOR WEB CRAWLER
SEQUENCE DIAGRAM FOR WEB PARSER
RISK ANAYLSIS
Risk
Id
Classification Description Risk Area
R-01
R
E
Q
U
I
R
E
M
E
N
T
S
Stability It refers to the degree to which
the requirements are changing.
The attribute also includes issues
that arise from the inability to
control rapidly changing
requirements
We estimate 3 possible
projected changes to the
requirements. These will be
as a result of our realization
of what is required and not
required as we get further
into implementation, as well
as a result of interaction with
the customer and verification
of the customer‘s
Requirements.
R-02 Feasibility The feasibility attribute refers to
the difficulty of implementing a
single technical or operational
Requirement, or of
simultaneously meeting
conflicting requirements.
Presently no such issue has
arisen as the system is a web
interface that can be
implemented with very low
risk estimates and provides
easy access to users. The
system meets the
organization‘s operating
requirements and is also
economically feasible.
R-03
D
E
S
I
G
N
Functionality It covers functional requirements
that may not submit to a feasible
design, or use
of specified algorithms or
designs without a high degree of
certainty.
The techniques of crawling
used in implementation
slightly differ from the ones
that were formulated at the
time of design as they had a
higher degree of certainty to
satisfy the
Source requirements.
R-04 Performance The performance attribute refers
to time-critical performance:
user and real-time response
requirements,
throughput requirements,
performance analyses, and
performance modelling
throughout the development
cycle
The Performance of the
Process Management system
may decrease as the number
of user transaction increases
and also in case of web
crawler the searching time
may vary depending on the
network traffic.
RISKS CATEGORY PROBABILITY IMPACT RE (P*I)
Changes in
Requirements
Stability 20% 5 1
Requirements are
not properly stated
Clarity 40% 5 2
Technology will not
Meet Expectations
Feasibility 25% 3 0.75
Lack of
Development
Experience
Coding &
Implementation
20% 3 0.6
More stress of users
than expected
Safety 20% 1 0.2
Less reuse than
expected
Reliability 20% 1 0.2
Lack of Database
Stability
Capacity 40% 3 1.2
Too many
development errors
Testing 50% 3 1.5
Poor Quality
Documentation
Maintainability 35% 1 0.35
Low estimation of
time
Scale 50% 1 0.5
Poor Comments in
Code
Maintainability 20% 1 0.2
Impact values
High – 5
Medium -3
Low -1
RISK MITIGATION PLAN
Risk
Mitigation Plan
Stability Re-evaluate user requirements by interacting
with the user
Document requirement and operational
procedure deviations
Clarity
Request for information
Functionality and Performance Evaluate through prototyping.
Consult other users with similar requirements
to see what their experiences have been with
the product.
Safety Issues Analyze the vulnerability of the system due
to untrusted components and determine if the
system can be designed to reduce the
vulnerability to an acceptable level
Capacity and Maintainability Use market research to determine size
and satisfaction of customer base.
Conduct demonstrations, prototyping
before final selection.
Consult other users with similar
requirements.
Security Select certified products in accordance
with the system requirements if such
products are available..
Design the system to encapsulate the non-
secure products and limit the
vulnerabilities they create.
Morale Task statements should specify:
- Do early and frequent prototyping
- Do continuous market research
IMPLEMENTATION AND TESTING
IMPLEMENTATION
Business process management system
It was implemented in ASP.NET Technologies. Using the constructs of MS-SQL Server and
all the user interfaces has been designed using the ASP.Net technologies. The database
connectivity is planned using the ―SQL Connection‖ methodology.
Web crawler
Web crawler is implemented as a project in Microsoft Visual Studio and was implemented
with efficient connections with the web using the concerned constructs and functions
available. Many inbuilt classes in Visual studio were also made in use.
Web extractor
The extraction techniques were implemented in order to get the html contents in xml
formed.One is console application developed in order to get all the contents in one xml
document and other is a windows form application that is developed with the regular
expressions techniques to extract the content. It makes use of Regex inbuilt class in Visual
Studio.
Implementation of data mining techniques
Data mining techniques are used to find some useful knowledge from the xml format. We
have implemented the Apiori association algorithm which takes into input a xml file and
develops the association rule that can be analysed later to get the useful knowledge. In order
to get a deep knowledge of the techniques of data mining applied in the field of e-business,we
have implemented applets in JAVA that make use of the weka tool package used for mining
,it takes into input the available the datasets in ARFF format and generate the expected output
that can be used for analysis.
ERROR AND EXCEPTION HANDLING
ASP.NET and .NET support a rich error-handling architecture that provides a flexible way to
catch/handle errors at multiple levels within an application. Specifically, you can catch and
handle a runtime exception with a class, within a page, or on the global application level
using the Application_Error event handler within the Global.asax class.
When the database is down or if the credentials in the connection string are invalid
then the method throws a SqlException. Exceptions were handled by the use of
try/catch/finally blocks.
System.NullReferenceException: Object reference not set to an instance of an object.
Compiler Error CS0071: An explicit interface implementation of an event must use
event accessor syntax.When explicitly implementing an event that was declared in an
interface, you must use manually provide the add and remove event accessors that are
typically provided by the compiler.
Sys.WebForms.PageRequestManagerParserErrorException: The message received
from the server could not be parsed. Common causes for this error are when the
response is modified by calls to Response.Write(), response filters, HttpModules, or
server trace is enabled. Details: Error parsing near '.It was solved by removing the
button from the update panel.
System.Web.HttpException: An unhandled exception was generated during the
execution of the current web request.
Logic errors were a hurdle in getting the expected results.
For an ArgumentOutOfRangeException exception, the handler writes some text on
the page, provides a link back to the page, logs the error, and notifies system
administrators. For an InvalidOperationException exception, the handler simply
transfers the exception to the Generic Error Page. For any other kind of exception, the
handler does nothing, which allows your site to automatically redirect to the generic
page specified in the Web.config file.
RISK MANAGEMENT
The relevant stages in risk management are risk identification, risk planning and risk
monitoring. Of course, no retrospective assessment could be entirely accurate as risk
management is a process that starts at the beginning of the project and continues
throughout.For risk identification, it shall be examined what possible risks could have
occurred as well as identifying the setbacks that did occur. For risk planning, it will be seen
which risks could have been and can be avoided and what contingency plans could and have
been made. For risk monitoring, it shall be looked out how we have monitored the risks
throughout the project and how we shall continue to do this.
WEIGHTED INTERRELATIONSHIP GRAPH
Performance
Stability
Clarity
Feasibility
Coding &
Implementation
Safety
Open Source Code
External inputs
Personnel Related
Scale
Maintainability
Testing
Reliability
Capacity
9
3
3
9
3
3
3
9
3
1
1
1
3
1
3
3
3
1
Risk Area Wise Total Weighting Factor
SNo. Risk Area # of
Risk
Statements Weights (In +
Out)
Total
Weight
Priority
1 Performance 9 9+3+3+3 18 1
2 Stability 4 9+3 12 6
3 Clarity 3 3+3 6 8
4 Feasibility 2 3+9 12 9
5 Coding&
Implementation
8 9+9+3+3 24 3
6 Safety 5 3+3 6 7
7 Reliability 3 1+3+3 7 5
8 Open Source Code 4 3+3+1 7 4
9 Testing 4 9+3+1 13 10
10 External input 3 1+3+3 7 2
TESTING
TYPE OF TEST WILL THE
TEST BE
PERFORMED?
COMMENTS/EXPLANAT
IONS
SOFTWARE
COMPONENT
Requirements
Testing
Yes Requirements are testable,
clear, consistent and complete
with the specifications. They
should not even be
ambiguous, incomplete or
invalid. Ideal requirements
should clearly define
expected behaviour under
normal usage and exceptional
workflows.
It is done before
implementation to check
that they are written in a
simple manner
emphasizing the
business need only,
without forcing
implementation methods
Unit Testing Yes Individual units of source
code are tested and the goal is
to isolate each part of the
program and show that the
individual parts are correct.
Write test cases for all
functions and methods
so that whenever a
change causes a fault, it
can be quickly identified
Unit testing by definition
only tests the functionality of
the units themselves.
Therefore, it will not catch
integration errors or broader
system-level errors.
and fixed.
On all the classes that
are independent and not
linked with database and
other classes.
Integration
Testing
Yes Objective of Integration
testing is to make sure that
the interaction of two or more
components produces results
that satisfy functional
requirement. In integration
testing, test cases are
developed with the express
purpose of exercising the
interface between the
components. Integration
testing is complete when you
make sure that all the
interfaces where components
interact with each other are
covered.
Assumptions made on
receiving data from
different components
and passing data to
different components.
Integration testing tests
a class while it is
integrated with other
classes and those linked
with the MySQL
database like the target
group data and orders
that will be placed.
Performance
Testing
Yes Designed to test runtime
performance of software
within the context of an
integrated system. It is used
to determine the speed or
effectiveness of a computer,
network, software program or
device.
Test application
performance on different
internet connection
speed.
Performed on the Web
crawler and extractor
applications to check
how effectively they
search and produce the
results and evaluate
qualitative attributes
such as reliability,
scalability and
interoperability.
Stress Testing Yes Greater emphasis on
robustness, availability, and
error handling under a heavy
load, rather than on what
would be considered correct
behaviour under normal
circumstances.
Identify the maximum
expected number of
users during peak load
conditions for the
application.
Compliance
Testing
Yes It is basically an audit of a
system carried out against a
known criterion. It is related
with the IT standards
followed by the company and
it is the testing done to find
the deviations from the
Verification that the
intended system under
development meets the
configuration and
lockdown standards
requested by the
customer.
company prescribed
standards.
Database Servers : Microsoft SQL Server
Operating Systems: Microsoft Windows
Vista, Microsoft
Windows Server 2003.
Security Testing Yes Attempts to verify the
protection mechanisms built
into the system It is an
indispensable part of Web
application development life
cycle due to increase in
privacy breaches in
businesses and organizations.
.
Test by pasting internal
url directly into browser
address bar without
login. Internal pages
should not open.
Try some invalid inputs
in input fields like login
username, password,
input text boxes. Check
the system reaction on
all invalid inputs.
Load Testing Yes Load testing helps to identify
the maximum operating
capacity of the application
and any bottlenecks that
might be degrading
performance.
Response time: For
example, the product
catalogue must be
displayed in less than 3
seconds.
Resource utilization A frequently overlooked
aspect is the amount of
resources your
application is
consuming, in terms of
processor, memory, disk
input output (I/O), and
network I/O.
Volume Testing Yes Volume testing refers to
testing a software application
with a certain amount of data.
Application is tested
with a specific database
size by expanding the
database to a particular
size and then tests the
application‘s
performance on it.
Functionality
Testing
Yes Functionality testing is done
for all the links in web pages,
database connection, forms
used in the web pages for
submitting or getting
information from user,
Test the outgoing links
from all the pages from
specific domain under
test and test all internal
links.
Test links jumping on
the same pages.
Test links used to send
the email to admin or
other users from web
pages.
Check if data is
retrieved correctly and
also updated correctly
from the database
TEST TEAM DETAILS
ROLE NAME SPECIFIC RESPONSIBILITIES
Software Tester Dhruv Goel Performance, Unit,Integration Testing
Software Tester Shraddha Singh Performance,Unit ,Integration Testing
Software Tester Dhruv, Shraddha Requirement Testing ,Unit Testing, Stress & Load
Testing
TEST SCHEDULE
ACTIVITY START DATE COMPLETION
DATE
HOURS COMMENTS
Requirements
were Analysed
10/09/2011 10/09/2011 2 hrs Gathered requirements
were found to be clear,
consistent and complete
Requirements
were Analysed
10/10/2011 10/10/2011 2 hrs Gathered requirements
were found to be clear,
consistent and complete
Login as
administrator with
username as admin
and password as
admin
10/20/2011 10/20/2011 30 min Login Successful
Admin Employee
details
10/29/2011 10/29/2011 45 min Employee details were
added and account
created. Also, view all
employees details
Tasks list and order
information
11/03/2011 11/03/2011 1 hr View Assign tasks and
view order details, order
status
Web crawler
connection with
URL‘s
11/11/2011 11/11/2011 30 min Crawler could connect
with the web
Crawling process
output
11/15/2011 11/15/2011 1 hour Web pages viewed
sequentially
Parsing of html into
xml for amazon.com
specifically
11/18/2011 11/18/2011 1 hr Xml generated
Web extraction
based on regular
expressions
11/20/2011 11/20/2011 45 min Got required results for
every module like images,
keywords, phones etc.
Apiori application 25/11/2011 25/11/2011 30 min Results verified but with
simple xml
Data mining Applets 30/11/2011 30/11/2011 45 min Got expected results with
both algorithms
TEST ENVIRONMENT
SOFTWARE ITEMS
Operating System: Windows Vista
Visual Studio .Net 2005 Enterprise Edition
Internet Information Server 5.0 (IIS)
SQL Server 2000 Enterprise Edition
HARDWARE ITEMS
HDD 20 GB Hard Disk Space and Above
RAM 512MB and Above
FEATURES TO BE TESTED
Admin login
Employee Login
Customer Login
Customer and Group member Interaction via email
Add/update employee information
Add/update customer information
Search / Lookup employee information
Escape to return to Main Menu
Security features
Scaling to 700 employee records
Error messages
Report Printing
Screen mappings (GUI flow). Includes default settings
Order placement by customer
Order confirmation by Group manager/Administrator
Check the resources are efficiently used like processor and network bandwidth.
Web crawler should operate in continuous mode: it should obtain fresh copies of
previously fetched pages.
Web crawler is searching and downloading the web pages efficiently
Extractor is getting the contents of the HTML webpage properly in xml format The
data mining algorithms are generating the results in proper manner as expected.
FEATURES NOT TO BE TESTED
Order Entry processes.
Only the Data Interface of the Order Entry application will be verified. Changes to the
interface to support Reassigned Sales are not anticipated to have an impact on the Order
Processing application. Order Entry is a separate application sharing the data interface only,
orders will continue to process in the same manner.
PC based dataset analysis applications using products data.
These applications are completely under the control of the administrator and are outside the
scope of this project. The necessary data base format information will be provided to the
customers to allow them to extract data. Testing of their applications is the responsibility of
the application maintainer/developer.
Business Analysis functions.
These applications are completely under the control of the management support team and are
outside the scope of this project. The necessary data base format information will be provided
to the support team to allow them to extract data. Testing of their applications is the
responsibility of the application maintainer/developer.
APPROACH FOR TESTING
Unit Testing: Unit testing will be done by the developer and will be approved by the
development team leader.
Validation Testing: At the end of integration testing software ids completely assembled as a
package. Validation testing is the next stage which can be defined as successful when the
software functions in the manner reasonably expected by the customer. Reasonable
expectations are those defined in the software requirements specifications. Information
contained in those sections form a basis for validation testing approach.
System Testing: System testing is actually a series of different tests whose primary purpose is
to fully exercise the computer-based system. Although each test has a different purpose, all
work to verify that all system elements have been properly integrated to perform allocated
functions.
Recovery Testing: It is a system test that forces the system to fail in a variety of ways and
verities that the recovery is properly performed.
Security Testing: Attempts to verify the protection mechanisms built into the system.
Performance Testing: This method is designed to test runtime performance of software
within the context of an integrated system.
Stress Testing: Stress testing can be defined as performing the sequences of actions at larger
than normal volumes, at faster than normal speeds and for longer than normal periods of time
as a method to accelerate the rate of finding defects and verify the robustness of the product.
Stress testing in its simplest form is any test that repeats a set of actions over and over with the
purpose of "breaking the product".
ITEM PASS/FAIL CRITERIA
The whole project consists of four modules from the business project management system to
the data analysis with data mining techniques. The system is declared pass when it
successfully allows the administrator, group members and customers to login in to the system
with the respective restrictions applicable to all. The customer is able to successfully select
products and place their order with the respective group manager.Further, the group members
are able to view the orders and accordingly accept or reject them depending on the
availiblity.Moreover,the mining process consisting of gathering data by means of the crawler
and further generating useful knowledge from the extracted web data .Thus, the output of the
algorithms shows that the system runs successfully.
TEST CASES Test
S.No. Input Expected behaviour Status
P = Passed
F = Failed
1 Login as
administrator with
username as admin
and password as
admin
Home page for
administrator should be
displayed
Passed
2
Admin Employee
details
Add employee details and
create account type view all
employees details
Passed
3
Admin Groups Add new group details and
group Schedule of
employees
Passed
4
Product details
Adding new product and
viewing all products details
Passed
5
Reports It Should Display all the
details of Matrices of
Customers
Passed
6
Search Admin can search all types
of group search and Product
Search
Passed
7 Group manager
details
Group manager can
maintain all details of new
leaders and assign tasks and
Assigned all tasks.
Passed
8 Group leaders
details
Group leaders can maintain
all details of new leaders
and assign tasks and
Assigned all tasks. View all
members
Passed
9
Tasks list and order
information
View Assign tasks and view
order details, order status
Passed
10 Group Members
details
View All the details of
customers
Passed
11
Order details View all the details of check
order list and order status
Passed
12
Validate the user
inputs
Validate all input and output
details
Passed
13
Target information View all the customer target
information of products.
Passed
14
Customer details Register all the details of
customer and order the
product and check order
status
Passed
15 Web crawler Starts to download webpage
on taking url as input
Passed
16 Extraction HTML to
XML
Takes html webpage as
input and generates an xml
document
Passed
17 Converting XML to
CSV to ARFF
Xml document cannot be
converted in all cases due to
the structure that could not
be parsed
Failed
18 Data mining
application
Takes xml or arff format as
input and gives the mining
result on the algorithm used
Passed
CONCLUSION
It has been our great pleasure work on this exciting and challenging project. This project
proved good for us as it provided practical knowledge of not only programming in ASP.NET
and C#.NET web based application and no some extent Windows Application and SQL
Server, Crawling Techniques and Parsing and various data representation formats but also
about all handling procedure related with Business Process Management. It also provides
knowledge about the latest technology used in developing web enabled application and client
server technology that will be great demand in future. This will provide better opportunities
and guidance in future in developing projects independently.
BENEFITS:
The project is identified by the merits of the system offered to the user. The merits of this
project are as follows: -
It‘s a web-enabled project.
This project offers user to enter the data through simple and interactive forms. This is
very helpful for the client to enter the desired information through so much simplicity.
The user is mainly more concerned about the validity of the data, whatever he is entering.
There are checks on every stages of any new creation, data entry or updation so that the
user cannot enter the invalid data, which can create problems at later date.
Sometimes the user finds in the later stages of using project that he needs to update some
of the information that he entered earlier. There are options for him by which he can
update the records. Moreover there is restriction for his that he cannot change the primary
data field. This keeps the validity of the data to longer extent.
User is provided the option of monitoring the records he entered earlier. He can see the
desired records with the variety of options provided by him.
From every part of the project the user is provided with the links through framing so that
he can go from one option of the project to other as per the requirement. This is bound to
be simple and very friendly as per the user is concerned. That is, we can say that the
project is user friendly which is one of the primary concerns of any good project.
Data storage and retrieval will become faster and easier to maintain because data is stored
in a systematic manner and in a single database.
Decision making process would be greatly enhanced because of faster processing of
information since data collection from information available on computer takes much less
time then manual system.
Allocating of sample results becomes much faster because at a time the user can see the
records of last years.
Easier and faster data transfer through latest technology associated with the computer and
communication.
Through these features it will increase the efficiency, accuracy and transparency,
LIMITATIONS:
The size of the database increases day-by-day, increasing the load on the database back
up and data maintenance activity.
Training for simple computer operations is necessary for the users working on the
system.
Certain websites have all the more different structure and thus, the xml document
generated by the html parser proves to be a hurdle in the way of getting a proper dataset
in ARFF format to be analyzed using the data mining application.
FUTURE WORK
The system is just in its initial phase and has to be made comparable to other E-business
site dominant in the market and provide the best of the facilities to its customers and also
for the member teams
This System being web-based and an undertaking of Cyber Security Division, needs to be
thoroughly tested to find out any security gaps.
A console for the data centre may be made available to allow the personnel to monitor on
the sites which were cleared for hosting during a particular period.
Moreover, it is just a beginning; further the system may be utilized in various other types
of auditing operation viz. Network auditing or similar process/workflow based
applications.
We have just focussed on the Web content mining area of Web mining, it can further be
extended to web structure and web usage mining in order to automatically identify the
location of the customers buying the products by means of networking and IP address Log.
Currently, a lot of research work is active in this field.
REFERENCES
1. Luo Hanyang, Gao Jinling, Ji Wenli,Research on data mining in E-
business,International Conference on Computer Science and Software
Engineering,2008
2. Ansari S., Kohavi R., Mason L., Zijian Zheng. Integrating E-commerce and data
mining: architecture and challenges, Proceedings of IEEE International Conference on
Data Mining, 2001:27 – 34
3. Bing Liu, Kevin Chen_Chuan Chang,‖Editorial Issue on Web Content Mining‖,
issue2, 2004.
4. Rayid Ghani , Rosie Jones , Dunja Mladeni´c, Kamal Nigam , Se´an Slattery ―Data
Mining on Symbolic Knowledge Extracted from the Web‖,2009
5. M. Craven, D. DiPasquo, D. Freitag, A. McCallum,T. Mitchell, K. Nigam, and S.
Slattery. Learning to construct knowledge bases from the world wide web, 2000
6. Istrate Mihai‖ Web Mining In E-Commerce‖,2008
7. N. Girija, 2006. Web Mining. Publishers, ICFAI University Press, Hyderabad, INDIA.
8. Jiawei Han and Micheline Kamber (2006), Data Mining Concepts and Techniques,
published by Morgan Kauffman,2nd ed.
9. Tutorial on Web Scrapping.
http://www.codediesel.com/php/web-scraping-in-php-tutorial/
10. www.mozenda.com
11. www.kdnuggets.com/.
12. http://www.devbistro.com/articles/Misc/Implementing-Effective-Web-Crawler
13. http://www.codeproject.com/KB/IP/Crawler.aspx
14. http://www.progmic.com/2010/03/how-to-make-web-crawler-in-c#/
15. Dr. M.H. Dunham - http://engr.smu.edu/~mhd/dmbook/part2.ppt.
16. Dr. Lee, Sin-Min – San Jose State University
17. Mu-Yu Lu, SJSU
18. Database System Concepts, Silberschatz, Korth, Sudarshan
19. D.W. Embley*, D.M Campbell†, Y.S. Jiang, Y.-K. Ng, R.D. Smith, Li Xu
Department of Computer Science
20. Kotb, Y., Gondow, K., Katayama, T., XML Semantics. In: Scime, A.(Ed.), Web
Mining: Applications and Techniques, Idea, London. pp.169-188.
APPENDICES
APPENDIX A: WORK PLAN
S.No. START DATE END DATE WORK
DESCRIPTION
1 26/7/2011 31/7/2011 LITERATURE
SURVEY
2 4/8/2011 20/9/2011 Studied about the
various techniques
and approach
applicable
3 5/9/2011 15/9/2011 Worked on
developing the
business process
system different
modules with its
testing
4 5/10/2011 21/10/2011 Implemented the
web crawler class
+testing
5 5/11/2011 15/11/2011 Extraction tool
based on regular
expressions
6 7/11/2011 13/11/2011 Parser from html to
xml
7 15/11/2011 20/11/2011 Implemented the
Apiori algorithm
8 21/11/2011 23/11/2011 Gathering data sets
in arff format
9 22/11/2011 24/11/2011 Applets were
implemented for
data mining
algorithms +testing
10 26/11/2011 5/12/2011 Working on parsing
xml to arff
11 3/12/2011 6/12/2011 Debugging errors of
parsing
12 6/12/2011 7/12/2011 Testing of the whole
project
APPENDIX –B
DESCRIPTION OF TOOLS USED IN IMPLEMENTATION
Microsoft Visual Studio is an integrated development environment (IDE) from Microsoft. It
is used to develop console and graphical user interface applications along with Windows
Forms applications, web sites, web applications, and web services in both native code
together with managed code for all platforms supported by Microsoft Windows, Windows
Mobile, Windows CE, .NET Framework, .NET Compact Framework and Microsoft
Silverlight.Visual Studio includes a code editor supporting IntelliSense as well as code
refactoring. The integrated debugger works both as a source-level debugger and a machine-
level debugger. Other built-in tools include a forms designer for building GUI applications,
web designer, class designer, and database schema designer. It accepts plug-ins that enhance
the functionality at almost every level—including adding support for source-control systems
(like Subversion and Visual SourceSafe) and adding new toolsets like editors and visual
designers for domain-specific languages or toolsets for other aspects of the software
development lifecycle (like the Team Foundation Server client: Team Explorer).Visual
Studio supports different programming languages by means of language services, which
allow the code editor and debugger to support (to varying degrees) nearly any programming
language, provided a language-specific service exists. Built-in languages include C/C++ (via
Visual C++), VB.NET (via Visual Basic .NET), C# (via Visual C#), and F# (as of Visual
Studio 2010[3]
). Support for other languages such as M, Python, and Ruby among others is
available via language services installed separately. It also supports XML/XSLT,
HTML/XHTML, JavaScript and CSS. Individual language-specific versions of Visual Studio
also exist which provide more limited language services to the user: Microsoft Visual Basic,
Visual J#, Visual C#, and Visual C++.Microsoft provides "Express" editions of its Visual
Studio 2010 components Visual Basic, Visual C#, Visual C++, and Visual Web Developer at
no cost. Visual Studio 2010, 2008 and 2005 Professional Editions, along with language-
specific versions (Visual Basic, C++, C#, J#) of Visual Studio Express 2010 are available for
free to students as downloads via Microsoft's Dream Spark program.
NetBeans refers to both a platform framework for Java desktop applications, and an
integrated development environment (IDE) for developing with Java, JavaScript, PHP],
Python (no longer supported after Net Beans 7), Groovy, C, C++, Clojure, and others. The
NetBeans IDE 7.0 no longer supports Ruby and Ruby on Rails, but a third party has begun
work on a separate plug-in. The NetBeans IDE is written in Java and can run anywhere a
compatible JVM is installed, including Windows, Mac OS, Linux, and Solaris. A JDK is
required for Java development functionality, but is not required for development in other
programming languages. The NetBeans platform allows applications to be developed from a
set of modular software components called modules. Applications based on the NetBeans
platform (including the NetBeans IDE) can be extended by third party developers. The
NetBeans Platform is a reusable framework for simplifying the development of Java Swing
desktop applications. The NetBeans IDE bundle for Java SE contains what is needed to start
developing NetBeans plug-in and NetBeans Platform based applications; no additional SDK
is required. Applications can install modules dynamically. Any application can include the
Update Center module to allow users of the application to download digitally-signed
upgrades and new features directly into the running application. Reinstalling an upgrade or a
new release does not force users to download the entire application again. The platform offers
reusable services common to desktop applications, allowing developers to focus on the logic
specific to their application.