Download pdf - Mining E-Business to Enhance Market Strategies

MINING THE E-BUSINESS TO ENHANCE THE MARKET

STRATEGIES OF A COMPANY

ENROLMENT No. - 8103532 8103592

NAME OF THE STUDENT - SHRADDHA SINGH DHRUV GOEL

NAME OF THE SUPERVISIOR - Mrs.ARTI GUPTA

May- 2012

Submitted in partial fulfilment of the Degree of

Bachelor of Technology

In

Computer Science Engineering

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING &

INFORMATION TECHNOLOGY

JAYPEE INSTITUTE OF INFORMATION TECHNOLOGY, NOIDA

TABLE OF CONTENTS

Chapter No. Topics Page No.

Cover Page

Student Declaration II

Certificate From The Supervisor III

Acknowledgement IV

Abstract V

List Of Figures VI

List Of Tables VII

List Of Symbols And Acronyms VIII

Chapter -1 Introduction 1-8

1.1 General Introduction

1.2 Problem Statement

1.3 Empirical Study

1.4 Current and Open Problems

1.5 Approach To Problem In Terms Of Technology /Platform

To Be Used

1.6 Support For Novelty/ Significance Of Problem

1.7 Solution Approach

Chapter -2 Literature Survey 9-18

2.1 Summary Of Papers

2.2 Diagrammatic Integrated Summary Of The Literature

Studied

Chapter -3 Analysis, Design And Modelling 19-44

3.1 Overall Description Of The Project

3.2 Specific Requirements

3.2.1 External Interfaces

3.2.2 Functions

3.2.3 Performance Requirements

3.2.4 Logical Database Requirements

3.2.5 Design Constraints

3.2.6 Software Attributes (H/W, S/W)

3.3 Design Diagrams

3.3.1Use Case Diagrams

3.3.2 Class Diagrams / Control Flow Diagrams

3.3.3 Sequence Diagram/Activity Diagrams

3.4 Risk Analysis

3.5 Risk Mitigation Plan

Chapter-4 Implementation And Testing 45-62

4.1 Implementation Details And Issues

4.1.1 Implementation

4.1.2 Debugging

4.1.3 Error And Exception Handling

4.2 Risk Management

4.3 Testing

4.3.1 Testing Plan

4.3.2 Features To Be Tested

4.3.3 Features Not Be Tested

4.3.4 Approach Taken For Testing 4.3.3.

4.3.5 Item Pass/Fail Criteria

4.3.6 Test Cases: For All Features To Be Tested

Chapter -5 Conclusion 63-64

5.1 Conclusion

5.2 Future Work

References 65.

Appendices 66-67

Appendix A Work Plan

Appendix B Description of Tool

DECLARATION

We hereby declare that this submission is our own work and that, to the best of our

knowledge and belief, it contains no material previously published or written by another

person nor material which has been accepted for the award of any other degree or diploma of

the university or other institute of higher learning, except where due acknowledgment has

been made in the text.

Place: Signature

Name: Shraddha Singh Dhruv Goel

Date: Enrolment No: 8103532 8103592

CERTIFICATE

This is to certify that the work titled “Mining The E-Business To Enhance The Market

Strategies Of A Company” submitted by Dhruv Goel & Shraddha Singh in partial

fulfilment for the award of Degree of Bachelor of Technology of Jaypee Institute of

Information Technology, Noida has been carried out under my supervision. This work has

not been submitted partially or wholly to any other university or institute for the award of this

or any other degree or diploma.

Signature of Supervisor

Name of Supervisor Mrs.Arti Gupta

Designation Lecturer

Date

ACKNOWLEDGEMENT

A project is an attempt by a student to put best of his skills and come out to conclude with

something productive or useful in understanding of the field. This project too has brought to

us with many ideas and knowledge of the topics covered.

We express our deepest gratitude to our supervisor Mrs.Arti Gupta for her invaluable

guidance and blessings. We are very grateful to her for providing us with an environment to

work on this project successfully. We would like to thank her for her unwavering support

during the entire course of this project work.

Signature of the student :

Name of the student :

Dhruv Goel Shraddha Singh

Enrolment No. : 8103592 8103532

Date :

ABSTRACT

The rapid growth of Internet is reshaping the industries and is giving a massive change to

business market. Traditional business is undergoing a major transformation into to the E-

business. Unfortunately the enormous size and hugely unstructured data on the web, even for

a single commodity, has become a cause of ambiguity for consumers. Extracting valuable

information from such an ever-increasing data is an extremely tedious task and is fast

becoming critical towards the success of businesses. Data mining is an emerging technology

aimed at discovering patterns in the underlying historical data and identifies trends within

data that go beyond simple analysis. Through the use of sophisticated algorithms, it provides

users an opportunity to identify key attributes of business processes and target opportunities.

A new dimension has been added to data mining by extending this technique to the realm of

e-business as it provides all the right ingredients for successful data mining. Data mining

techniques assist e-businesses to seek and retain the most profitable customers by analysing

customer-buying and traversing patterns collected online or offline. Essentially, e-business

companies can improve products quality or sales by anticipating problems before they occur

with the use of data mining techniques. Data mining, in general, is the task of extracting

implicit, previously unknown, valid and potentially useful information from data. Web

mining is the use of data mining techniques to automatically discover and extract information

from Web documents and services for obtaining useful information. Application of web

content mining can be very encouraging in the areas of Customer Relations Modelling,

billing records, product cataloguing and quality management.

Thus, in our project we have worked in the field of WEB TECHNOLOGY AND WEB

MINING and developed an efficient E-business process management system and also studied

and implemented techniques from the field of web content mining and studied their impact in

the area specific to business user needs focusing both on the customer as well as the

producer. Thus, our system aims at applying various data mining techniques on the business

data extracted from the web and analyse it which will in turn help in improving the

company‘s marketing strategies.

Signature

Signature

Name of the

student

Shraddha Singh Dhruv Goel Name of the

supervisor

Mrs.Arti Gupta

Date

Date

LIST OF SYMBOLS AND ACRONYMS

HTML –Hyper Text Mark-up Language

XML stands for EXtensible Mark-up Language.XML is a markup language much like

HTML .XML was designed to carry data, not to display data

ARFF-An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a

list of instances sharing a set of attributes. ARFF files have two distinct sections. The first

section is the Header information, which is followed the Data information.

Eg: @RELATION iris

@ATTRIBUTE sepallength NUMERIC

@ATTRIBUTE sepalwidth NUMERIC

@ATTRIBUTE petallength NUMERIC

@ATTRIBUTE petalwidth NUMERIC

@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

The Data of the ARFF file looks like the following:

@DATA

5.1,3.5,1.4,0.2,Iris-setosa

4.9,3.0,1.4,0.2,Iris-setosa

4.7,3.2,1.3,0.2,Iris-setosa

4.6,3.1,1.5,0.2,Iris-setosa

5.0,3.6,1.4,0.2,Iris-setosa

5.4,3.9,1.7,0.4,Iris-setosa

4.6,3.4,1.4,0.3,Iris-setosa

5.0,3.4,1.5,0.2,Iris-setosa

4.4,2.9,1.4,0.2,Iris-setosa

4.9,3.1,1.5,0.1,Iris-setosa

INTRODUCTION

GENERAL INTRODUCTION

This project is aimed on developing an efficient business process management system and the

techniques of web content mining to satisfy the customer‘s product hunt and to get useful

information so as to analyse the business data and further use it to improve the market

strategies of a company.

Web content mining aims to extract/mine useful information or knowledge from web page

contents. Web content mining is related but different from data mining and text mining. It is

related to data mining because many data mining techniques can be applied in Web content

mining. It is related to text mining because much of the web contents are texts. However, it is

also quite different from data mining because Web data are mainly semi-structured and/or

unstructured, while data mining deals primarily with structured data. Web content mining is

also different from text mining because of the semi-structure nature of the Web, while text

mining focuses on unstructured texts. Web content mining thus requires creative applications

of data mining and/or text mining techniques and also its own unique approaches. Internet is

probably the biggest world‘s database. Moreover, data is available using easily accessible

techniques. Often it is important and detailed data that let people achieve goals or use it in

various realms. Data is held in various forms: text, multimedia, database. Web pages keep

standard of html which makes it kind of structural form, but not sufficient to easily use it in

data mining. Typical website contains, in addition to main content and links, various stuff

like ads or navigation items. It is also widely known that most of the data in the Internet is

redundant – a lot of information appear in different sites, in more or less alike form. In the

Web mining domain, web content mining essentially is an analogue of data mining

techniques for relational databases, since it is possible to find similar types of knowledge

from the unstructured data residing in Web documents. The Web document usually contains

several types of data, such as text, image, audio, video, metadata and hyperlinks. Some of

them are semi-structured such as HTML documents or a more structured data like the data in

the tables or database generated HTML pages, but most of the data is unstructured text data.

The unstructured characteristic of Web data forces the Web content mining towards a more

complicated approach.

PROBLEM STATEMENT

E-commerce has changed the face of most business functions in competitive enterprises.

Web-Enabled Electronic Business is generating massive amount of data on customer

purchases, browsing patterns, usage times and preferences at an increasing rate.

Unfortunately the enormous size and hugely unstructured data on the web, even for a single

commodity, has become a cause of ambiguity for consumers. Gathering information from the

web and then Extracting valuable information to be able to make proper decisions from such

an ever increasing data is an extremely tedious task and is fast becoming critical towards the

success of businesses.

EMPIRICAL STUDY

Various tools and software are available for purpose of data mining and web content mining:

WEKA TOOL

Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine

learning software written in Java, developed at the University of Waikato, New

Zealand. Weka is free software available under the GNU General Public License. Weka

is a comprehensive set of advanced data mining and analysis tools. The strength of

Weka lies in the area of classification where it covers many of the most current

machine learning (ML) approaches. At its simplest, it provides a quick and easy way to

explore and analyze data. Weka is also suitable for dealing with large data where the

resources of many computers and or multi-processor computers can be used in

parallel. Weka also allows for data to be pulled directly from database servers as well

as web servers. Its native data format is known as the ARFF format.

WEKA consists of

• Explorer

• Experimenter

• Knowledge flow

• Simple Command Line Interface

• Java interface

Weka has a comprehensive set of classification tools. Many of these algorithms are

very new and reflect an area of active development. We will only be examining the tree

based classifiers but this is only a very small part of all the classification methods

available in Weka. There are 11 tree algorithms, and 71algorithms in all.

MOZENDA SOFTWARE

Intuitive software that allows you to mine data in just minutes.Mozenda is Software as

a Service company that enables users of all types to easily and affordably extract and

manage web data. With Mozenda, users can set up agents that routinely extract data,

store data, and publish data to multiple destinations. Once information is in the

Mozenda systems users can format, repurpose, and mash up the data to be used in other

online/offline applications.

CURRENT AND OPEN PROBLEMS

In today‘s era where the entire world has become a global village and the driving force is

internet having e-business to internet blogs to search engines, the major questions in front of

the business users is while they would like to retain the existing customers and also would

like to understand the patterns and trends of customer behaviour so that their decisions can be

supported with facts represented with visualizations and appropriate reporting made possible

with web mining. Also, there is a huge competition amongst the companies and in order to be

ahead of others in terms of products one is selling and also to identify the strong and weak

parts of the competitors.

Thus, some relevant problems as listed as follows:

Very high data volumes and data flow rates

Complex, structured, semi-structured, and unstructured data

A growing trend among companies, organizations and individuals alike to gather

information to utilize it for their interest.

Need to unearth hidden relationships among various attributes of data and between

several snapshots of data over a period of time. These hidden patterns have enormous

potential in predictions and personalisation in e-commerce

Need of organized data for analysis in order to improve market strategies

Information Extraction for Catalogue Creation, Service Discovery

APPROACH TO THE PROBLEM IN TERMS OF TECHNOLOGY

USED

INTRODUCTION TO .NET Framework

The .NET Framework is a new computing platform that simplifies application development

in the highly distributed environment of the Internet. The .NET Framework is designed to

fulfill the following objectives:

To provide a consistent object-oriented programming environment whether object code is

stored and executed locally, executed locally but Internet-distributed, or executed

remotely.

To provide a code-execution environment that minimizes software deployment and

versioning conflicts.

To provide a code-execution environment that guarantees safe execution of code,

including code created by an unknown or semi-trusted third party.

To provide a code-execution environment that eliminates the performance problems of

scripted or interpreted environments.

To make the developer experience consistent across widely varying types of applications,

such as Windows-based applications and Web-based applications.

To build all communication on industry standards to ensure that code based on the .NET

Framework can integrate with any other code.

.NET FRAMEWORK CLASS LIBRARY

The .NET Framework class library is a collection of reusable types that tightly integrate with

the common language runtime. The class library is object oriented, providing types from

which your own managed code can derive functionality. This not only makes the .NET

Framework types easy to use, but also reduces the time associated with learning new features

of the .NET Framework. In addition, third-party components can integrate seamlessly with

classes in the .NET Framework.For example, the .NET Framework collection classes

implement a set of interfaces that you can use to develop your own collection classes. Your

collection classes will blend seamlessly with the classes in the .NET Framework.As you

would expect from an object-oriented class library, the .NET Framework types enable you to

accomplish a range of common programming tasks, including tasks such as string

management, data collection, database connectivity, and file access. In addition to these

common tasks, the class library includes types that support a variety of specialized

development scenarios. For example, you can use the .NET Framework to develop the

following types of applications and services:

Console applications.

Scripted or hosted applications.

Windows GUI applications (Windows Forms).

ASP.NET applications.

XML Web services.

Windows services.

ACTIVE SERVER PAGES.NET

ASP.NET is a programming framework built on the common language runtime that can be

used on a server to build powerful Web applications. ASP.NET offers several important

advantages over previous Web development models:

Enhanced Performance. ASP.NET is compiled common language runtime code running

on the server.

World-Class Tool Support. The ASP.NET framework is complemented by a rich

toolbox and designer in the Visual Studio integrated development environment.

WYSIWYG editing, drag-and-drop server controls, and automatic deployment are just a

few of the features this powerful tool provides.

Power and Flexibility. Because ASP.NET is based on the common language runtime,

the power and flexibility of that entire platform is available to Web application

developers. The .NET Framework class library, Messaging, and Data Access solutions

are all seamlessly accessible from the Web. ASP.NET is also language-independent, so

you can choose the language that best applies to your application or partition your

application across many languages.

Simplicity. ASP.NET makes it easy to perform common tasks, from simple form

submission and client authentication to deployment and site configuration.

Manageability. ASP.NET employs a text-based, hierarchical configuration system,

which simplifies applying settings to your server environment and Web applications.

Scalability and Availability. ASP.NET has been designed with scalability in mind, with

features specifically tailored to improve performance in clustered and multiprocessor

environments.

Customizability and Extensibility. ASP.NET delivers a well-factored architecture that

allows developers to "plug-in" their code at the appropriate level.

Security. With built in Windows authentication and per-application configuration, you

can be assured that your applications are secure.

ASP.NET WEB FORMS

The ASP.NET Web Forms page framework is a scalable common language runtime

programming model that can be used on the server to dynamically generate Web pages.

ASP.NET Web Forms pages are text files with an .aspx file name extension. They can be

deployed throughout an IIS virtual root directory tree. When a browser client requests .aspx

resources, the ASP.NET runtime parses and compiles the target file into a .NET Framework

class. This class can then be used to dynamically process incoming requests. ASP.NET

provides syntax compatibility with existing ASP pages. This includes support for <% %>

code render blocks that can be intermixed with HTML content within an .aspx file. These

code blocks execute in a top-down manner at page render time.

INTRODUCTION TO ASP.NET SERVER CONTROLS

In addition to (or instead of) using <% %> code blocks to program dynamic content,

ASP.NET page developers can use ASP.NET server controls to program Web pages. Server

controls are declared within an .aspx file using custom tags or intrinsic HTML tags that

contain a runat="server" attributes value. Intrinsic HTML tags are handled by one of the

controls in the System.Web.UI.HtmlControls namespace. Any tag that doesn't explicitly

map to one of the controls is assigned the type of

System.Web.UI.HtmlControls.HtmlGenericControl. Server controls automatically

maintain any client-entered values between round trips to the server. This control state is not

stored on the server (it is instead stored within an <input type="hidden"> form field that is

round-tripped between requests). Note also that no client-side script is required.

C#.NET

ADO.NET is an evolution of the ADO data access model that directly addresses user

requirements for developing scalable applications. It was designed specifically for the web

with scalability, statelessness, and XML in mind. ADO.NET uses some ADO objects, such as

the Connection and Command objects, and also introduces new objects. Key new

ADO.NET objects include the DataSet, DataReader, and DataAdapter. Some objects are:

Connections. For connection to and managing transactions against a database.

Commands. For issuing SQL commands against a database.

Data Readers. For reading a forward-only stream of data records from a SQL Server data

source.

Datasets. For storing, Removing and programming against flat data, XML data and

relational data.

Data Adapters. For pushing data into a Dataset, and reconciling data against a database.

When dealing with connections to a database, there are two different options: SQL Server

.NET Data Provider (System.Data.SqlClient) and OLE DB .NET Data Provider

(System.Data.OleDb). In these samples we will use the SQL Server .NET Data Provider.

These are written to talk directly to Microsoft SQL Server.

SQL SERVER

SQL Server stores each data item in its own fields. In SQL Server, the fields relating to a

particular person, thing or event are bundled together to form a single complete unit of data,

called a record (it can also be referred to as raw or an occurrence). Each record is made up of

a number of fields. No two fields in a record can have the same field name. During a SQL

Server Database design project, the analysis of your business needs identifies all the fields or

attributes of interest. If your business needs change over time, you define any additional

fields or change the definition of existing fields.

JAVA APPLET

Applets are used to provide interactive features to web applications that cannot be provided

by HTML alone. They can capture mouse input and also have controls like buttons or check

boxes. In response to the user action an applet can change the provided graphic content. This

makes applets well suitable for demonstration, visualization and teaching. There are online

applet collections for studying various subjects. An applet can also be a text area only,

providing, for instance, a cross platform command-line interface to some remote system. If

needed, an applet can leave the dedicated area and run as a separate window. A Java applet

extends the class java.applet.Applet.

SUPPORT FOR THE NOVELTY OF THE PROBLEM

Why E-business???

In e-commerce websites you have the ability to sell, advertise, and introduce different kinds

of services and products in the web. E-commerce websites have the advantage of reaching a

large number of customers regardless of distance and time limitations. Furthermore, an

advantage of e-commerce over traditional businesses is the faster speed and the lower

expenses for both ecommerce website owners and customers in completing customers‘

transactions and orders Retail websites aim to inspire, reflect a good image about the business

and improve it online. An important factor in having a successful retail website is to know

your competitors. On one hand, by identifying their points of strength and trying to get

benefit of them by improving those points and adopting powerful strategies. On the other

hand, identifying weakness points of your competitors and avoid them is a good practice in

having a successful retail website.

Web Mining versus Data Mining

Web mining is the use of data mining techniques to automatically discover and extract

information from Web documents and services. When comparing web mining with traditional

data mining, there are three main differences to consider:

1. Scale – In traditional data mining, processing 1 million records from a database would

be large job. In web mining, even 10 million pages wouldn‘t be a big number.

2. Access – When doing data mining of corporate information, the data is private and

often requires access rights to read. For web mining, the data is public and rarely

requires access rights

3. Structure – A traditional data mining task gets information from a database, which

provides some level of explicit structure. A typical web mining task is processing

unstructured or semi-structured data from web pages. Even when the underlying

information for web pages comes from a database, this often is obscured by HTML

mark-up

Thus, Web Mining can be used to support enterprises to create marketable products.

SOLUTION APPROACH

Developing Business Management Software

Implementing Web Crawler

Implementing Web Extractor

Implementation of Data Mining Techniques

Developing Business Management Software

A business Process model for Marketing team in order to service the target groups about its

products and services and also be able to make them place online orders that can be viewed

by the business managers ,in other words to automate a whole retail store for the sale of

products and providing the customers with best of services. The standards of security and

data protective mechanism have been given a big choice for proper usage. The application

takes care of different modules and their associated reports, which are produced as per the

applicable strategies and standards that are put forwarded by the administrative staff.

Implementing Web crawler for collection of web data

A web crawler based on path incremental crawling that applies breadth first search for

searching the linked pages to a URL. It starts to search as soon the crawl button on its

interface is pressed. The crawler application is designed in C# and the searching algorithm

based on the pseudo code provided in a research paper is also implemented in C# in

Microsoft Visual Studio 2010.

Implementing Web Extractor/Parser

One of the critical problems in building an extractor is defining a set of extraction rules that

precisely define how to locate the information on the page. For any given item to be extracted

from a page, one needs an extraction rule to locate both the beginning and end of that item.

Implementation of Data Mining Techniques

The data mining techniques will be applied on the data sets so extracted in order to retrieve

useful information and solve the required queries related to customers in order to enhance the

market strategies and combat the issue of competition by comparing various products and

services. The data can be categorized on the basis of similarity and relationships. The

categorization can be obtained by using classification techniques and Association is an

exploratory method of discovering previously unknown relationships. Thus, applying data

mining techniques to the business data will lead us to achieve the following:

Build unique market segments identifying the attributes of high value prospects,

Select promotional strategies that best reach the client‘s Web customer segments

Analyze online sales to improve targeting of the client‘s high-value customers

Test and determine which marketing activities have the greatest impact

Identify client customers most likely to be interested in their new products

LITERATURE SURVEY

SUMMARY OF PAPERS

TITLE RESEARCH ON DATA MINING IN E-BUSINESS

AUTHOR

Luo Hanyang, Shenzhen Graduate School,Harbin Institute of

Technology,Shenzhen, P.R.China,

Gao Jinling, Ji Wenli,College of Management,Shenzhen University

Shenzhen, P.R.China

YEAR OF

PUBLICATION

22 December 2008

PUBLISHING

DETAILS

Computer Science and Software Engineering, 2008 International

Conference

SUMMARY Data mining is an emerging technology and can be applied for

searching valuable business information from e-business as huge

details available in website‘s background database. The architecture

and sources of information in an e-business websites like server logs

,Customer Register information are made familiar with a brief

mention of data mining techniques applicable in such a scenario. It

lays emphasis on the main goal of data mining in E-business which is

to mine the customer visiting information, to understand customers‘

browse action and mode, and find useful market information and

provide personalized services. Data mining adopts many techniques,

the main methods: discrimination, association analysis, classification

and prediction, cluster analysis and evolution analysis.

WEB LINK http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4722629&tag=1

TITLE DATA MINING ON SYMBOLIC KNOWLEDGE EXTRACTED

FROM THE WEB

AUTHOR Rayid Ghani , Rosie Jones ,Dunja Mladenić,Kamal Nigam,Seán

Slattery, School of Computer Science

Department for Intelligent Systems,Carnegie Mellon University,J.

Stefan Institute,Pittsburgh

YEAR OF

PUBLICATION

2000

PUBLISHING

DETAILS

Workshop on Text Mining at the Sixth ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining (KDD-2000),

2000

SUMMARY As a part of E-business, since it is crucial to know about competitors,

one needs to about details like products and services by them in

various domains .The Paper discusses about creating a dataset which

can be built by spidering sources on web and then applying data

mining techniques on it .A brief overview of data mining Techniques

applicable on corporate databases is highlighted. It discusses the need

of a Web crawler for extracting information from company‘s websites

and also need of a wrapper to extract information to augment crawler‘s

information. It is only after an available dataset that data mining

techniques such as clustering, classification and association can be

applied and interesting regularities can be discovered in a company‘s

dataset according to the requirements.

WEB LINK http://www.kamalnigam.com/papers/shield-kddws00

TITLE INTEGRATING E-COMMERCE AND DATA MINING:

ARCHITECTURE AND CHALLENGES

AUTHOR Suhail Ansari, Ron Kohavi, Llew Mason, and Zijian Zheng

YEAR OF

PUBLICATION

2000

PUBLISHING

DETAILS

Appeared in,WEBKDD‘2000 workshop: Web Mining for E-

Commerce -- Challenges and Opportunities

Appeared in ICDM'01: The 2001 IEEE International Conference on

Data Mining

SUMMARY The paper discusses integration of data mining and E-business

,mainly focusing on E-business being a killer domain for data mining

.An architecture that successfully integrates data mining with an e-

commerce system has been proposed consisting of three main parts as

of Business Data definition, Customer Interaction and Analysis

.Business Data Definition discusses the data and meta data associated

with E -business and to be able to define rich set of attributes for

example products can have attributes like size, colour etc. For a

business to be successful, customer interaction plays a major role and

this gives rise to a need of an efficient e-business website .The third

component lays emphasis on analysis of data collected by various data

mining techniques concerning mainly on customer data .It is through

an analysis tool that reports can be generated to be able to get varied

knowledge about different point like top selling and worst selling

products etc. Finally several challenging problems that need to be

addressed for further enhancement of this architecture were

highlighted.

WEB LINK

http://ai.stanford.edu/~ronnyk/icdmIntegratingEcom

TITLE DATA MINING TECHNIQUES AND APPLICATIONS

AUTHOR Mrs. Bharati M. Ramageri, Lecturer,Modern Institute of Information

Technology and Research,Department of Computer Application,

Yamunanagar, Nigdi

Pune, Maharashtra

YEAR OF

PUBLICATION

2009

PUBLISHING

DETAILS

Indian Journal of Computer Science and Engineering

Vol. 1 No. 4 301-305

SUMMARY Data mining is a process which finds useful patterns from large amount

of data. The goal of this technique is to find patterns that were

previously unknown. Once these patterns are found they can further be

used to make certain decisions for development of their businesses.

Data mining techniques and algorithms such as classification,

clustering etc., helps in finding the patters to decide upon the future

trends in businesses to grow. Classification is the most commonly

applied data mining technique, which employs a set of pre-classified

examples to develop a model that can classify the population of

records at large Clustering can be said as identification of similar

classes of objects. Data mining has wide application domain almost in

every industry where the data is generated that‘s why data mining is

considered one of the most important frontiers in database and

information systems and one of the most promising interdisciplinary

developments in Information Technology.

WEB LINK

http://www.ijcse.com/docs/IJCSE10-01-04-51

TITLE A SURVEY ON WEB CONTENT MINING AND EXTRACTION OF

STRUCTURED AND SEMISTRUCTURED DATA

AUTHOR Kshitija Pol, Nita Patil, Shreya Patankar, Chhaya Das,Datta Meghe

College of Engineering,Airoli , Navi Mumbai-400708

YEAR OF

PUBLICATION

2008

PUBLISHING

DETAILS

Emerging Trends in Engineering and Technology, 2008. ICETET '08.

First International Conference

Nagpur, Maharashtra

SUMMARY

Information available on the web is mostly in the form of unstructured

data and as the data on the web is growing at an explosive rate it has

lead to problems such as extracting potentially useful knowledge and

learning about customers and individuals. This paper discusses

techniques to represent such data in structured form as table that can be

queried for further information by using Web content mining .it

explains in details the unstructured, structured and semi structured data

of the web and techniques for extraction by means of a web crawler

and web scrapper highlighting example of building a structured XML

document from web page data. It also discusses the various problems

and major challenges of web content mining.


TITLE WEB MINING IN E-COMMERCE

AUTHOR Istrate Mihai

YEAR OF

PUBLICATION

2009

PUBLISHING

DETAILS

Annals of Faculty of Economics, 2009

vol. 4 , issue 1

SUMMARY The web is a very good place to run successful business and it is

important to have a successful website to serve as a sales and

marketing tool. One of the effective technologies used for this purpose

is Data Mining .Web mining is the usage of data mining techniques to

extract interesting information from web data.Web Content is the

mining of data a web page contains like list of products and services.

There is a lot of information need to be defined before starting

building the e-commerce website such as identifying business goals

and, if the website supposed to attract new customers or increase the

sales of current customers, identify if the proposed website will

increase the business overall profit, and also to identify the most

suitable tools and techniques need to be used/followed in order to

target those requirements. Retail websites aim to inspire, reflect a

good image about the business and improve it online. An important

factor in having a successful retail website is to know your

competitors and identifying weak and strong points of your

competitors and accordingly implementing them in one‘s own is a

good practice in having a successful retail website.

WEB LINK http://steconomice.uoradea.ro/anale/volume/2009/v4-management-

and-marketing/196.pdf

http://steconomice.uoradea.ro/anale/volume/2009/v4-management-and-marketing/196.pdf

http://steconomice.uoradea.ro/anale/volume/2009/v4-management-and-marketing/196.pdf

TITLE IMPLEMENTATION OF WEB CRAWLER

AUTHOR

Pooja Gupta ,Assistant Professor, Linagay‘s University

Mrs. Kalpana Johari, Sr. Lecture, Center for Development of

Advanced Computing, Noida

YEAR OF

PUBLICATION

16-18 Dec. 2009

PUBLISHING

DETAILS

Emerging Trends in Engineering and Technology (ICETET), 2009

2nd International Conference

Location :Nagpur

SUMMARY Web crawler continuously keeps on crawling the web and finds any

new web pages that have been added to the web. They continue

visiting the web until local resources such as storage are exhausted.

The paper shed some light on the design of the crawler and also

various implementation techniques. Also, in this paper, pattern

recognition is applied on the crawler like this, When we start the

crawler it will give me the links related to the keyword. It will then

read the web pages that are extracted from the links and while it will

read the web page it will extract only the content. Here content means

only the text that is available on the web page.


INTERGRATED SUMMARY OF LITERATURE

THE WEB: OPPORTUNITIES & CHALLENGES

Web offers an unprecedented opportunity and challenge to data mining

The amount of information on the Web is huge, and easily accessible.

The coverage of Web information is very wide and diverse One can find information

about almost anything.

Information/data of almost all types exist on the Web, e.g., structured tables, texts,

multimedia data, etc.

Much of the Web information is semi-structured due to the nested structure of HTML

code.

Much of the Web information is linked. There are hyperlinks among pages within a

site, and across different sites.

Much of the Web information is redundant. The same piece of information or its

variants may appear in many pages.

The Web is noisy. A Web page typically contains a mixture of many kinds of

information, e.g., main contents, advertisements, navigation panels, copyright notices,

etc.

The Web is also about services. Many Web sites and pages enable people to perform

operations with input parameters, i.e., they provide services.

The Web is dynamic. Information on the Web changes constantly. Keeping up with the

changes and monitoring the changes are important issues.

Above all, the Web is a virtual society. It is not only about data, information and

services, but also about interactions among people, organizations and automatic

systems, i.e., communities.

MINING THE WEB

When extracting Web content information using web mining, there are four typical steps.

1. Collect – fetch the content from the Web

2. Parse – extract usable data from formatted data (HTML, PDF, etc)

3. Analyze – tokenize, rate, classify, cluster, filter, sort, etc.

4. Produce – turn the results of analysis into something useful (report, search index,

etc)

CRAWLING

Web crawler (also known as a Web spider or Web robot) is a program or automated script

which browses the World Wide Web in a methodical and automated manner. This process is

called Web crawling or spidering. Many legitimate sites, in particular search engines, use

spidering as a means of providing up-to-date data. Following are some reasons to use a web

crawler:

To maintain mirror sites for popular Web sites.

To test web pages and links for valid syntax and structure.

A typical web crawler starts by parsing a specified web page: noting any hypertext links on

that page that point to other web pages. The Crawler then parses those pages for new links,

and so on, recursively. The crawler simply sends HTTP requests for documents to other

machines on the Internet, just as a web browser does when the user clicks on links. All the

crawler really does is to automate the process of following links. There are two important

characteristics of the Web that generate a scenario in which Web crawling is very difficult:

1. Large volume of Web pages.

2. Rate of change on web pages.

The difficulties in implementing efficient web crawler clearly state that bandwidth for

conducting crawls is neither infinite nor free. So, it is becoming essential to crawl the web in

not only a scalable, but efficient way, if some reasonable amount of quality or freshness of

web pages is to be maintained. This ensures that a crawler must carefully choose at each step

which pages to visit next. Thus the implementer of a web crawler must define its behaviour.

Defining the behaviour of a Web crawler is the outcome of a combination of below

mentioned strategies:

Selecting the better algorithm to decide which page to download.

Strategizing how to re-visit pages to check for updates.

Strategizing how to avoid overloading websites.

We intend the crawler to download as many resources as possible from a particular Web site.

That way a crawler would ascend to every path in each URL that it intends to crawl. For

example, when given a seed URL of http://foo.org/a/b/page.html, it will attempt to crawl

/a/b/, /a/, and /. The advantage with Path-ascending crawler is that they are very effective in

finding isolated resources, or resources for which no inbound link would have been found in

regular crawling. Thus, a crawler must have a good crawling strategy, as noted in the

previous sections, but it also needs a highly optimized architecture.

EXTRACTING DATA FROM WEB PAGE

In order to gather data in structured form from the highly unstructured web data, we need to

extract the contents of a web page excluding the advertisements and other needless

information and gather with us important information such as product catalogue of various

websites etc. A wrapper is a piece of software that enables a semi structured Web source to

be queried as if it were a database. These are sources where there is no explicit structure or

schema, but there is an implicit underlying structure. One of the critical problems in building

a wrapper is defining a set of extraction rules that precisely define how to locate the

information on the page. For any given item to be extracted from a page, one needs an

extraction rule to locate both the beginning and end of that item. Since, in our framework,

each document consists of a sequence of tokens (e.g., words, numbers, HTML tags, etc), this

is equivalent to finding the first and last tokens of an item. A key idea underlying our work is

that the extraction rules are based on ―landmarks‖ (i.e., groups of consecutive tokens) that

enable a wrapper to locate the start and end of the item within the page. XML has made it

possible to improve its presentation and redefine the way in which documents and data were

exchanged. Most of the websites are in html and need to be converted to xml as data sets

available in xml are converted to CSV (Comma Separated Values) or ARFF (Attribute

Relational File Format). Conversion to these file formats makes it easy to use XML datasets.

One can exploit XML hierarchy levels using these file formats. An ARFF (Attribute-

Relational File Format) file is an ASCII text file that describes a list of instances sharing a set

of attributes. ARFF files have two distinct sections. The first section is the Header

information, which is followed the Data information. The Header of the ARFF file contains

the name of the relation, a list of the attributes (the columns in the data), and their types. The

CSV file is used and is stored in the database. XML files can also be stored directly to the

database at different levels. An XML document along with its associated schema is input into

an XML parser. The parser checks that the document is well formed and, if the schema is also

available, checks that the XML is valid according to what has been defined in the schema.

Because the schema is also an XML document, it is validated recursively against another

schema, respectively. The parser then provides access methods for another application to

access the data that was contained within the original XML document.

DATA MINING TECHNIQUES

Data mining is needed in many fields to extract the useful information from the large amount

of database. Large amount of data is maintained in every field to keep different records.

Scientific data, medical data, demographic data, financial data, marketing data etc are the

type of database maintained in different fields. So, different ways were found to

automatically analyze the data, to summarize it, to discover and characterize trends in it and

to automatically flag anomalies. Various techniques were introduced by the different

researchers. These techniques were used to do classification, to do clustering, to find

interesting patterns etc. Data mining is the discovery of knowledge and useful information

from the large amounts of data stored in databases. It is referred to as knowledge discovery

from databases (KDD), is the automated or convenient extraction of patterns representing

knowledge implicitly stored in large databases. Data mining tools predict future trends and

behaviours, allowing businesses to make proactive, knowledge driven decisions. Data mining

is becoming an increasingly important tool to transform these data into information. Data

mining can also be referred as knowledge mining or knowledge discovery from data. Many

techniques are used in data mining to extract patterns from large amount of database, for

example: Association rule Analysis, Classification.

CLASSIFICATION METHODS

Classification is a data mining technique used to predict group membership for data

instances.

Naïve Bayes Classifier: The Naïve Bayes classifier works on a simple, but comparatively

intuitive concept. Also, in some cases it is also seen that Naïve Bayes outperforms many

other comparatively complex algorithms. It makes use of the variables contained in the data

sample, by observing them individually, independent of each other. The Naïve Bayes

classifier is based on the Bayes rule of conditional probability. It makes use of all the

attributes contained in the data, and analyses them individually as though they are equally

important and independent of each other.

J48 Decision Trees: A decision tree is a predictive machine-learning model that decides the

target value (dependent variable) of a new sample based on various attribute values of the

available data. The internal nodes of a decision tree denote the different attributes; the

branches between the nodes tell us the possible values that these attributes can have in the

observed samples, while the terminal nodes tell us the final value (classification) of the

dependent variable. The attribute that is to be predicted is known as the dependent variable,

since its value depends upon, or is decided by, the values of all the other attributes. The other

attributes, which help in predicting the value of the dependent variable, are known as the

independent variables in the dataset. The J48 Decision tree classifier follows the following

simple algorithm. In order to classify a new item, it first needs to create a decision tree based

on the attribute values of the available training data. So, whenever it encounters a set of items

(training set) it identifies the attribute that discriminates the various instances most clearly.

This feature that is able to tell us most about the data instances so that we can classify them

the best is said to have the highest information gain. Now, among the possible values of this

feature, if there is any value for which there is no ambiguity, that is, for which the data

instances falling within its category have the same value for the target variable, then we

terminate that branch and assign to it the target value that we have obtained. For the other

cases, we then look for another attribute that gives us the highest information gain. Hence we

continue in this manner until we either get a clear decision of what combination of attributes

gives us a particular target value, or we run out of attributes. In the event that we run out of

attributes, or if we cannot get an unambiguous result from the available information, we

assign this branch a target value that the majority of the items under this branch possess. Now

that we have the decision tree, we follow the order of attribute selection as we have obtained

for the tree. By checking all the respective attributes and their values with those seen in the

decision tree model, we can assign or predict the target value of this new instance.

ASSOCIATION RULES

Association rule mining, one of the most important and well researched techniques of data

mining, It aims to extract interesting correlations, frequent patterns, associations or casual

structures among sets of items in the transaction databases or other data repositories.

Let I = {i1, i2, …, im}: a set of itemsbe a set of m distinct attributes, T be transaction that

contains a set of items such that T I. An association rule is an implication of the form X

Y, where X, Y I are sets of items called item sets, and X Y = .X is called antecedent

while Y is called consequent, the rule means X implies Y. There are two important basic

measures for association rules, support(s) and confidence(c). Since the database is large and

users concern about only those frequently purchased items, usually thresholds of support and

confidence are pre-defined by users to drop those rules that are not so interesting or useful.

The two thresholds are called minimal support and minimal confidence respectively,

additional constraints of interesting rules also can be specified by the users. The two basic

parameters of Association Rule Mining (ARM) are: support and confidence.

Support(s) of an association rule is defined as the percentage/fraction of records that contain

X U Y to the total number of records in the database. The count for each item is increased by

one every time the item is encountered in different transaction T in database D during the

scanning process. It means the support count does not take the quantity of the item into

account. For example in a transaction a customer buys three bottles of milk but we only

increase the support count number of {milk} by one, in another word if a transaction contains

a item then the support count of this item is increased by one. Support(s) is calculated by the

following formula:

From the definition we can see, support of an item is a statistical significance of an

association rule. Suppose the support of an item is 0.1%, it means only 0.1 percent of the

transaction contain purchasing of this item. The retailer will not pay much attention to such

kind of items that are not bought so frequently, obviously a high support is desired for more

interesting association rules. Before the mining process, users can specify the minimum

support as a threshold, which means they are only interested in certain association rules that

are generated from those item sets whose supports exceed that threshold. However,

sometimes even the item sets are not so frequent as defined by the threshold, the association

rules generated from them are still important. For example in the supermarket some items are

very expensive, consequently they are not purchased so often as the threshold required, but

association rules between those expensive items are as important as other frequently bought

items to the retailer.

Confidence(c) of an association rule is defined as the percentage/fraction of the number of

transactions that contain X U Y to the total number of records that contain X, where if the

percentage exceeds the threshold of confidence an interesting association rule X=>Y can be

generated.

Confidence is a measure of strength of the association rules, suppose the confidence of the

association rule X=>Y is 80%, it means that 80% of the transactions that contain X also

contain Y together, similarly to ensure the interestingness of the rules specified minimum

confidence is also pre-defined by users.

APRIORI ALGORITHM: In computer science and data mining, Apriori is a classic

algorithm for learning association rules. Apriori is designed to operate on databases

containing transactions (for example, collections of items bought by customers, or details of a

website frequentation). The algorithm attempts to find subsets which are common to at least a

minimum number C (the cutoff, or confidence threshold) of the item sets. Apriori uses a

"bottom up" approach, where frequent subsets are extended one item at a time (a step known

as candidate generation, and groups of candidates are tested against the data. The algorithm

terminates when no further successful extensions are found. Apriori uses breadth-first search

and a hash tree structure to count candidate item sets efficiently.

APRIORI ADVANTAGES/DISADVANTAGES

Advantages

o Uses large item set property

o Easily parallelized

o Easy to implement

Disadvantages

o Assumes transaction database is memory resident.

o Requires many database scans.

ANALYSIS, DESIGN AND MODELLING

OVERALL DESCRIPTION OF THE PROJECT

System Interface:

This Project aims at improving the market strategies of an E-business website and also to

provide its customers better option to select a product from the various choices.

Figure 1: Process Management System environment

The Business Process Management System will be a hierarchical based system for the

marketing team. The system after careful analysis has been identified to be presented with the

following modules:

Administrator:-In this module the Administrator has the privileges to add all the Target

Groups, Newsletters, and Metrics. He can search all the info about the Target Groups. And he

will assign the work to the Target Group person (Group Manager).

Target Groups:-In this module the Target Groups person has the task given by admin.

Target group persons are Group Manager, Group Leader, and Group Member.

Newsletters:-In this module admin will provide all the product information in a form of

advertisements and that will be visible to all the group members.

Metrics:-In this module all the Target Group and admin can give the survey to each other

based on all the products and customers.

Reports:-This module contains all the information about the reports generated by the admin

based on the particular user, particular quotation, all customers or users, all quotation

generated by the users.

Authentication: - This module contains all the information about the authenticated user.

User without his username and password can‘t enter into the login if he is only the

authenticated user then he can enter to his login and he can see the quotation and give the

quotation for the particular products.

Web crawlers typically identify themselves to a Web server by using the User-agent field of an

HTTP request. Web site administrators typically examine their Web servers‘ log and use the user

agent field to determine which crawlers have visited the web server and how often. The user agent

field may include a URL where the Web site administrator may find out more information about the

crawler.. It is important for Web crawlers to identify themselves so that Web site administrators can

contact the owner if needed. In some cases, crawlers may be accidentally trapped in a crawler trap or

they may be overloading a Web server with requests, and the owner needs to stop the crawler.

Identification is also useful for administrators that are interested in knowing when they may expect

their Web pages to be indexed by a particular search engine.

Web scraping or Web data extraction is a computer software technique of extracting

information from websites. Usually, such software programs simulate human exploration of

the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP),

or embedding certain full-fledged Web browsers, such as Internet Explorer or Mozilla

Firefox. Web scraping focuses more on the transformation of unstructured data on the Web,

typically in HTML format, into structured data that can be stored and analyzed in a central

local database or spreadsheet that can be used to compare various products sales and other

relevant information to improve our strategies.

User interface:

It is essential to consult the system users and discuss their needs while designing the user

interface. User interface systems can be broadly classified as:

1. User initiated interface: The user is in charge, controlling the progress of the

user/computer dialogue. In the computer-initiated interface, the computer selects the next

stage in the interaction.

2. Computer initiated interfaces: In the computer initiated interfaces the computer guides the

progress of the user/computer dialogue. Information is displayed and the user response of

the computer takes action or displays further information.

User initiated interfaces:

User initiated interfaces fall into two approximate classes:

1. Command driven interfaces: In this type of interface the user inputs commands or

queries which are interpreted by the computer.

2. Forms oriented interface: The user calls up an image of the form to his/her screen and

fills in the form. The forms oriented interface is chosen because it is the best choice.

3. A form named ‗Web Frame‘ is being used for crawler. It contains an entry field for

the user where the user can type his\her valid URL (web address), including the

"http://" portion, in the text field at the top of the application window

4. A button for search/stop is being provided in the form with the help of which, the user

can either retrieve the search results or it can stop the crawler processing whenever

required. Hence for searching a URL, the search button is clicked.

5. A screen containing a search panel providing area for the user to input his search

query.

6. A results page which lists the links of the documents relevant to the given query

Computer-Initiated Interfaces:

The following computer – initiated interfaces were used:

1. The menu system for the user is presented with a list of alternatives and the user

chooses one of alternatives.

2. Questions – answer type dialog system where the computer asks question and takes

action based on the basis of the users reply.

Communication Interfaces:

The crawler module of the search engine software uses the HTTP protocol to download the

pages from WWW. The user uses the search engine through browser.

Right from the start the system is going to be menu driven, the opening menu displays the

available options. Choosing one option gives another popup menu with more options. In this

way every option leads the users to data entry form where the user can key in the data.

Error Message Design:

The design of error messages is an important part of the user interface design. As user is

bound to commit some errors or other while designing a system the system should be

designed to be helpful by providing the user with information regarding the error he/she has

committed.

SPECIFIC REQUIREMENTS

EXTERNAL INTERFACE REQUIREMENTS:

The whole system is controlled by various user initiated and computer initiated interfaces

mainly used for the purpose of login into the system and to be able to perform the various

operations in order to carry on the marketing process.

Name of the item Login

Description of purpose The Actor will give the user name and password to the

system. The system will verify the authentication.

Participating actors Admin ,User

Entry conditions The actor will enter the system by using username and

password

Exit Conditions If un authenticated should be exited

Quality Requirements Password must satisfy the complexity requirements

Name of the item Admin Registration

Description of purpose The Admin will submit all the details and place in the

application.

Participating actors Admin

Entry conditions Must satisfy all the norms given by the interface site.

Exit Conditions Successful or Un successful completion of creation of

account.

Quality Requirements All fields are mandatory.

Name of the item User Registration

Description of purpose The User must enter all his personal details.

Participating actors User

Entry conditions View Home page

Exit Conditions Registered User should be successfully logged out. Error

Message should be displayed on Un successful creation.

Quality Requirements Best Error Handling techniques. Check on Mandatory fields.

Name of the item Web crawling

Description of purpose The crawler starts to search the linked web pages from the

given url .


Entry conditions View Crawler application

Exit Conditions Error messages displayed if difficulty in searching

Quality Requirements Proper input format of the Url .Check on

Mandatory fields.

Name of the item Web Scrapping

Description of purpose The scrapper application begins extracting content from the

web pages on giving the desired address of the page.


Entry conditions View scrapper application

Exit Conditions Error messages displayed if difficulty in extracting

Quality Requirements Proper input format of the Url of the web page. Check on

Mandatory fields.

Name of the item Data Mining Application

Description of purpose To apply techniques of data mining in order to gain

knowledge from business data


Entry conditions Data set in arff format or xml format

Exit Conditions Error message displayed if improper format

Quality Requirements Proper input of the dataset.

FUNCTIONAL REQUIREMENTS

INPUT DESIGN:

Input design is a part of overall system design. The main objective during the input design is

as given below:

To produce a cost-effective method of input.

To achieve the highest possible level of accuracy.

To ensure that the input is acceptable and understood by the user.

INPUT STAGES:

The main input stages can be listed as below:

Data recording

Data transcription

Data conversion

Data verification

Data control

Data transmission

Data validation

Data correction

INPUT TYPES:

It is necessary to determine the various types of inputs. Inputs can be categorized as follows:

External inputs, which are prime inputs for the system.

Internal inputs, which are user communications with the system.

Operational, which are computer department‘s communications to the system.

Interactive, which are inputs entered during a dialogue.

INPUT MEDIA:

At this stage choice has to be made about the input media. To conclude about the input

media consideration has to be given to:

Type of input

Flexibility of format

Speed

Accuracy

Verification methods

Rejection rates

Ease of correction

Storage and handling requirements

Security

Easy to use

Portability

Keeping in view the above description of the input types and input media, it can be said that

most of the inputs are of the form of internal and interactive. Input data is to be the directly

keyed in by the user, the keyboard can be considered to be the most suitable input device.

OUTPUT DESIGN

Outputs from computer systems are required primarily to communicate the results of

processing to users. They are also used to provide a permanent copy of the results for later

consultation. The various types of outputs in general are:

External Outputs, whose destination is outside the organization.

Internal Outputs whose destination is within organization and they are the

User‘s main interface with the computer.

Operational outputs whose use is purely within the computer department.

Interface outputs, which involve the user in communicating directly.

OUTPUT DEFINITION

The outputs should be defined in terms of the following points:

Type of the output

Content of the output

Format of the output

Location of the output

Frequency of the output

Volume of the output

Sequence of the output

It is not always desirable to print or display data as it is held on a computer. It should be

decided as which form of the output is the most suitable. For Example

Will decimal points need to be inserted

Should leading zeros be suppressed.

OUTPUT MEDIA:

In the next stage it is to be decided that which medium is the most appropriate for the output.

The main considerations when deciding about the output media are:

The suitability for the device to the particular application.

The need for a hard copy.

The response time required.

The location of the users

The software and hardware available.

ERROR AVOIDANCE:

At this stage care is to be taken to ensure that input data remains accurate form the stage at

which it is recorded up to the stage in which the data is accepted by the system. This can be

achieved only by means of careful control each time the data is handled.

ERROR DETECTION:

Even though every effort is make to avoid the occurrence of errors, still a small proportion of

errors are always likely to occur, these types of errors can be discovered by using validations

to check the input data.

DATA VALIDATION:

Procedures are designed to detect errors in data at a lower level of detail. Data validations

have been included in the system in almost every area where there is a possibility for the user

to commit errors. The system will not accept invalid data. Whenever an invalid data is

keyed in, the system immediately prompts the user and the user has to again key in the data

and the system will accept the data only if the data is correct. Validations have been included

where necessary.

The system is designed to be a user friendly one. In other words the system has been

designed to communicate effectively with the user. The system has been designed with

popup menus.

PERFORMANCE REQUIREMENTS

The product will be sensitive to bottlenecks, particularly if the number of access to the

services of the system is large and becomes difficult to manage. The number of crawlers

working at a time is dynamically created depending on the available bandwidth. The average

response time for a user is 0.36 seconds. The expected accuracy of output is 90%

SAFETY REQUIREMENTS

If the speed of crawler is higher than that the web server can handle then it may lead the web

server to crash. Hence a website developer should specify the speed supported.

LOGICAL DATABASE REQUIREMENTS

The product makes use of the inbuilt database of the Microsoft Visual Studio 2010 which

consists of tables. The tables will store the attributes related to each tool available on the tool

box. In case of web crawler the process responds to the client message requesting a list of

URLs to retrieve. The retriever process makes up many connections to web servers

simultaneously and downloads contents. The Retrieved contents are stored in a local disk of

the client. The retriever returns two lists, retrieved URLs and found URLs. The found URLs

are the links which have been found in retrieved pages. It also extracts new URLs which have

not been retrieved yet and enqueue them. And when we make use of the scrapper to extract

the various contents from the web pages the content is stored in the form of a spreadsheet and

the huge data extracted will act as a database.

GENERAL CONSTRAINTS AND ASSUMPTIONS

DESIGN CONSTRAINTS:

The product will face the following design constraints:

There is limited access to target group members as compared to administrator

The web crawler may not be successful to crawl all the web sites depending on their

structure; same constraint goes with web scrapper.

OTHER CONSTRAINTS:

This product is a web based application; hence a major constraint on the performance will

be due to the bandwidth of the server‘s web connection. A faster bandwidth will result in

faster crawling of web pages.

SYSTEM ATTRIBUTES

Hardware Requirements:

PC with 160 GB hard-disk and 1 GB RAM

Software Requirements:

WINDOWS OS (Vista)

Visual Studio .Net 2005 Enterprise Edition

Internet Information Server 5.0 (IIS)

Visual Studio .Net Framework (Minimal for Deployment)

SQL Server 2000 Enterprise Edition

DESIGN DIAGRAMS

DATA FLOW DIAGRAMS

Figure2: Business Process Management

Figure 3: Login Details in System

Figure 4: Web Crawler

Figure 5: Web scrapper

Figure 6:Steps to perform Apiori association mining

USE CASE DIAGRAMS

Check Prerequisites

Met

Add Group

Manager

Place an Order View Customer

Details

Generate Analysis

Report

Update order

Info

Update Product

Details

Confirm

Order

<<includes>>

Admin

Add

customer

Customer

Target

Group

Database

Product Services

Use-case 1

Login

Admin

Group Leader

Group Manager

Database

* *

*

*

*

*

* *

Group Member

*

*

Use case -2

Use case -3

Usecase -4

Usecase-5

Pass URL to

download

Download the page

Apply extractor

on web page User

User

File

ACTIVITY DIAGRAMS

Registration Activity Diagram

Get The Details

Validate Details

[Enter User Name and Password]

Get Details[Enter Details]

[submit]

[submit]

Validate Data

Accepted

[Success Fully Registered]

Login Activity Diagram

Get Details

Validate Data

[Enter User Name and Password]

[Submit]

Rejected AcceptedyesNo

Admin Activity Diagram

Get the Data

Validate Data

[Enter Login Details]

Get the Data Get the Data

[ProcessProject Details] [Generate Reports]

Validate Data

no

yes

no

[submit]

[submit]

Validate Details

yes

yes

no

Employee Activity Diagram

Get the Data

Validate Data

[Enter Login Details]

Get the Data Get the Data

[View Task Allocation]

Validate Data

no

yes

no

[submit]

[submit]

Validate Details

yes

yes

no

Submit project status

Web crawler activity Diagram

Web extractor activity Diagram

E-R DIAGRAMS

SEQUENCE DIAGRAMS

SEQUENCE DIAGRAM FOR ADMINISTRATOR OF THE SYSTEM

SEQUENCE DIAGRAM FOR ADDING EMPLOYEE

Home Page Login Page Admin Home Page

AdminUse URL

Press Login Button

If Yes Goes to Admin Home Page

Validate If NOT

Come Back to Login Page

Add Employee Info Database Confirm Page

Click on Link for Add Employee Page

Press Button for Saving Data

if Validation NOT OK

Back to Add Employee Info Page

If OK Then go to Confirmation Page

Time

Admin

Home Page Databas

e

Admin Home Page

Use URL

Press login button

If Yes Goes to its Home Page

If No Come Back to Home Page Validate if NO

SEQUENCE DIAGRAM FOR ADDING PRODUCTS

Home Page Login Page Admin Home Page

AdminUse URL

Press Login Button

If Yes Goes to Admin Home Page

Validate If NOT

Come Back to Login Page

Add Product Info Database Confirm Page

Click on Link for Add Product Page

Press Button for Saving Data

if Validation NOT OK

Back to Add Product Info Page

If OK Then go to Confirmation Page

SEQUENCE DIAGRAM FOR WEB CRAWLER

SEQUENCE DIAGRAM FOR WEB PARSER

RISK ANAYLSIS

Risk

Id

Classification Description Risk Area

R-01

R

E

Q

U

I

R

E

M

E

N

T

S

Stability It refers to the degree to which

the requirements are changing.

The attribute also includes issues

that arise from the inability to

control rapidly changing

requirements

We estimate 3 possible

projected changes to the

requirements. These will be

as a result of our realization

of what is required and not

required as we get further

into implementation, as well

as a result of interaction with

the customer and verification

of the customer‘s

Requirements.

R-02 Feasibility The feasibility attribute refers to

the difficulty of implementing a

single technical or operational

Requirement, or of

simultaneously meeting

conflicting requirements.

Presently no such issue has

arisen as the system is a web

interface that can be

implemented with very low

risk estimates and provides

easy access to users. The

system meets the

organization‘s operating

requirements and is also

economically feasible.

R-03

D

E

S

I

G

N

Functionality It covers functional requirements

that may not submit to a feasible

design, or use

of specified algorithms or

designs without a high degree of

certainty.

The techniques of crawling

used in implementation

slightly differ from the ones

that were formulated at the

time of design as they had a

higher degree of certainty to

satisfy the

Source requirements.

R-04 Performance The performance attribute refers

to time-critical performance:

user and real-time response

requirements,

throughput requirements,

performance analyses, and

performance modelling

throughout the development

cycle

The Performance of the

Process Management system

may decrease as the number

of user transaction increases

and also in case of web

crawler the searching time

may vary depending on the

network traffic.

RISKS CATEGORY PROBABILITY IMPACT RE (P*I)

Changes in

Requirements

Stability 20% 5 1

Requirements are

not properly stated

Clarity 40% 5 2

Technology will not

Meet Expectations

Feasibility 25% 3 0.75

Lack of

Development

Experience

Coding &

Implementation

20% 3 0.6

More stress of users

than expected

Safety 20% 1 0.2

Less reuse than

expected

Reliability 20% 1 0.2

Lack of Database

Stability

Capacity 40% 3 1.2

Too many

development errors

Testing 50% 3 1.5

Poor Quality

Documentation

Maintainability 35% 1 0.35

Low estimation of

time

Scale 50% 1 0.5

Poor Comments in

Code

Maintainability 20% 1 0.2

Impact values

High – 5

Medium -3

Low -1

RISK MITIGATION PLAN

Risk

Mitigation Plan

Stability Re-evaluate user requirements by interacting

with the user

Document requirement and operational

procedure deviations

Clarity

Request for information

Functionality and Performance Evaluate through prototyping.

Consult other users with similar requirements

to see what their experiences have been with

the product.

Safety Issues Analyze the vulnerability of the system due

to untrusted components and determine if the

system can be designed to reduce the

vulnerability to an acceptable level

Capacity and Maintainability Use market research to determine size

and satisfaction of customer base.

Conduct demonstrations, prototyping

before final selection.

Consult other users with similar

requirements.

Security Select certified products in accordance

with the system requirements if such

products are available..

Design the system to encapsulate the non-

secure products and limit the

vulnerabilities they create.

Morale Task statements should specify:

- Do early and frequent prototyping

- Do continuous market research

IMPLEMENTATION AND TESTING

IMPLEMENTATION

Business process management system

It was implemented in ASP.NET Technologies. Using the constructs of MS-SQL Server and

all the user interfaces has been designed using the ASP.Net technologies. The database

connectivity is planned using the ―SQL Connection‖ methodology.

Web crawler

Web crawler is implemented as a project in Microsoft Visual Studio and was implemented

with efficient connections with the web using the concerned constructs and functions

available. Many inbuilt classes in Visual studio were also made in use.

Web extractor

The extraction techniques were implemented in order to get the html contents in xml

formed.One is console application developed in order to get all the contents in one xml

document and other is a windows form application that is developed with the regular

expressions techniques to extract the content. It makes use of Regex inbuilt class in Visual

Studio.

Implementation of data mining techniques

Data mining techniques are used to find some useful knowledge from the xml format. We

have implemented the Apiori association algorithm which takes into input a xml file and

develops the association rule that can be analysed later to get the useful knowledge. In order

to get a deep knowledge of the techniques of data mining applied in the field of e-business,we

have implemented applets in JAVA that make use of the weka tool package used for mining

,it takes into input the available the datasets in ARFF format and generate the expected output

that can be used for analysis.

ERROR AND EXCEPTION HANDLING

ASP.NET and .NET support a rich error-handling architecture that provides a flexible way to

catch/handle errors at multiple levels within an application. Specifically, you can catch and

handle a runtime exception with a class, within a page, or on the global application level

using the Application_Error event handler within the Global.asax class.

When the database is down or if the credentials in the connection string are invalid

then the method throws a SqlException. Exceptions were handled by the use of

try/catch/finally blocks.

System.NullReferenceException: Object reference not set to an instance of an object.

Compiler Error CS0071: An explicit interface implementation of an event must use

event accessor syntax.When explicitly implementing an event that was declared in an

interface, you must use manually provide the add and remove event accessors that are

typically provided by the compiler.

Sys.WebForms.PageRequestManagerParserErrorException: The message received

from the server could not be parsed. Common causes for this error are when the

response is modified by calls to Response.Write(), response filters, HttpModules, or

server trace is enabled. Details: Error parsing near '.It was solved by removing the

button from the update panel.

System.Web.HttpException: An unhandled exception was generated during the

execution of the current web request.

Logic errors were a hurdle in getting the expected results.

For an ArgumentOutOfRangeException exception, the handler writes some text on

the page, provides a link back to the page, logs the error, and notifies system

administrators. For an InvalidOperationException exception, the handler simply

transfers the exception to the Generic Error Page. For any other kind of exception, the

handler does nothing, which allows your site to automatically redirect to the generic

page specified in the Web.config file.

RISK MANAGEMENT

The relevant stages in risk management are risk identification, risk planning and risk

monitoring. Of course, no retrospective assessment could be entirely accurate as risk

management is a process that starts at the beginning of the project and continues

throughout.For risk identification, it shall be examined what possible risks could have

occurred as well as identifying the setbacks that did occur. For risk planning, it will be seen

which risks could have been and can be avoided and what contingency plans could and have

been made. For risk monitoring, it shall be looked out how we have monitored the risks

throughout the project and how we shall continue to do this.

http://msdn.microsoft.com/en-us/library/8627sbea.aspx

WEIGHTED INTERRELATIONSHIP GRAPH

Performance

Stability

Clarity

Feasibility

Coding &

Implementation

Safety

Open Source Code

External inputs

Personnel Related

Scale

Maintainability

Testing

Reliability

Capacity

9

3

3

9

3

3

3

9

3

1

1

1

3

1

3

3

3

1

Risk Area Wise Total Weighting Factor

SNo. Risk Area # of

Risk

Statements Weights (In +

Out)

Total

Weight

Priority

1 Performance 9 9+3+3+3 18 1

2 Stability 4 9+3 12 6

3 Clarity 3 3+3 6 8

4 Feasibility 2 3+9 12 9

5 Coding&

Implementation

8 9+9+3+3 24 3

6 Safety 5 3+3 6 7

7 Reliability 3 1+3+3 7 5

8 Open Source Code 4 3+3+1 7 4

9 Testing 4 9+3+1 13 10

10 External input 3 1+3+3 7 2

TESTING

TYPE OF TEST WILL THE

TEST BE

PERFORMED?

COMMENTS/EXPLANAT

IONS

SOFTWARE

COMPONENT

Requirements

Testing

Yes Requirements are testable,

clear, consistent and complete

with the specifications. They

should not even be

ambiguous, incomplete or

invalid. Ideal requirements

should clearly define

expected behaviour under

normal usage and exceptional

workflows.

It is done before

implementation to check

that they are written in a

simple manner

emphasizing the

business need only,

without forcing

implementation methods

Unit Testing Yes Individual units of source

code are tested and the goal is

to isolate each part of the

program and show that the

individual parts are correct.

Write test cases for all

functions and methods

so that whenever a

change causes a fault, it

can be quickly identified

Unit testing by definition

only tests the functionality of

the units themselves.

Therefore, it will not catch

integration errors or broader

system-level errors.

and fixed.

On all the classes that

are independent and not

linked with database and

other classes.

Integration

Testing

Yes Objective of Integration

testing is to make sure that

the interaction of two or more

components produces results

that satisfy functional

requirement. In integration

testing, test cases are

developed with the express

purpose of exercising the

interface between the

components. Integration

testing is complete when you

make sure that all the

interfaces where components

interact with each other are

covered.

Assumptions made on

receiving data from

different components

and passing data to

different components.

Integration testing tests

a class while it is

integrated with other

classes and those linked

with the MySQL

database like the target

group data and orders

that will be placed.

Performance

Testing

Yes Designed to test runtime

performance of software

within the context of an

integrated system. It is used

to determine the speed or

effectiveness of a computer,

network, software program or

device.

Test application

performance on different

internet connection

speed.

Performed on the Web

crawler and extractor

applications to check

how effectively they

search and produce the

results and evaluate

qualitative attributes

such as reliability,

scalability and

interoperability.

Stress Testing Yes Greater emphasis on

robustness, availability, and

error handling under a heavy

load, rather than on what

would be considered correct

behaviour under normal

circumstances.

Identify the maximum

expected number of

users during peak load

conditions for the

application.

Compliance

Testing

Yes It is basically an audit of a

system carried out against a

known criterion. It is related

with the IT standards

followed by the company and

it is the testing done to find

the deviations from the

Verification that the

intended system under

development meets the

configuration and

lockdown standards

requested by the

customer.

company prescribed

standards.

Database Servers : Microsoft SQL Server

Operating Systems: Microsoft Windows

Vista, Microsoft

Windows Server 2003.

Security Testing Yes Attempts to verify the

protection mechanisms built

into the system It is an

indispensable part of Web

application development life

cycle due to increase in

privacy breaches in

businesses and organizations.

.

Test by pasting internal

url directly into browser

address bar without

login. Internal pages

should not open.

Try some invalid inputs

in input fields like login

username, password,

input text boxes. Check

the system reaction on

all invalid inputs.

Load Testing Yes Load testing helps to identify

the maximum operating

capacity of the application

and any bottlenecks that

might be degrading

performance.

Response time: For

example, the product

catalogue must be

displayed in less than 3

seconds.

Resource utilization A frequently overlooked

aspect is the amount of

resources your

application is

consuming, in terms of

processor, memory, disk

input output (I/O), and

network I/O.

Volume Testing Yes Volume testing refers to

testing a software application

with a certain amount of data.

Application is tested

with a specific database

size by expanding the

database to a particular

size and then tests the

application‘s

performance on it.

Functionality

Testing

Yes Functionality testing is done

for all the links in web pages,

database connection, forms

used in the web pages for

submitting or getting

information from user,

Test the outgoing links

from all the pages from

specific domain under

test and test all internal

links.

Test links jumping on

the same pages.

Test links used to send

the email to admin or

other users from web

pages.

Check if data is

retrieved correctly and

also updated correctly

from the database

TEST TEAM DETAILS

ROLE NAME SPECIFIC RESPONSIBILITIES

Software Tester Dhruv Goel Performance, Unit,Integration Testing

Software Tester Shraddha Singh Performance,Unit ,Integration Testing

Software Tester Dhruv, Shraddha Requirement Testing ,Unit Testing, Stress & Load

Testing

TEST SCHEDULE

ACTIVITY START DATE COMPLETION

DATE

HOURS COMMENTS

Requirements

were Analysed

10/09/2011 10/09/2011 2 hrs Gathered requirements

were found to be clear,

consistent and complete

Requirements

were Analysed

10/10/2011 10/10/2011 2 hrs Gathered requirements

were found to be clear,

consistent and complete

Login as

administrator with

username as admin

and password as

admin

10/20/2011 10/20/2011 30 min Login Successful

Admin Employee

details

10/29/2011 10/29/2011 45 min Employee details were

added and account

created. Also, view all

employees details

Tasks list and order

information

11/03/2011 11/03/2011 1 hr View Assign tasks and

view order details, order

status

Web crawler

connection with

URL‘s

11/11/2011 11/11/2011 30 min Crawler could connect

with the web

Crawling process

output

11/15/2011 11/15/2011 1 hour Web pages viewed

sequentially

Parsing of html into

xml for amazon.com

specifically

11/18/2011 11/18/2011 1 hr Xml generated

Web extraction

based on regular

expressions

11/20/2011 11/20/2011 45 min Got required results for

every module like images,

keywords, phones etc.

Apiori application 25/11/2011 25/11/2011 30 min Results verified but with

simple xml

Data mining Applets 30/11/2011 30/11/2011 45 min Got expected results with

both algorithms

TEST ENVIRONMENT

SOFTWARE ITEMS

Operating System: Windows Vista

Visual Studio .Net 2005 Enterprise Edition

Internet Information Server 5.0 (IIS)

SQL Server 2000 Enterprise Edition

HARDWARE ITEMS

HDD 20 GB Hard Disk Space and Above

RAM 512MB and Above

FEATURES TO BE TESTED

Admin login

Employee Login

Customer Login

Customer and Group member Interaction via email

Add/update employee information

Add/update customer information

Search / Lookup employee information

Escape to return to Main Menu

Security features

Scaling to 700 employee records

Error messages

Report Printing

Screen mappings (GUI flow). Includes default settings

Order placement by customer

Order confirmation by Group manager/Administrator

Check the resources are efficiently used like processor and network bandwidth.

Web crawler should operate in continuous mode: it should obtain fresh copies of

previously fetched pages.

Web crawler is searching and downloading the web pages efficiently

Extractor is getting the contents of the HTML webpage properly in xml format The

data mining algorithms are generating the results in proper manner as expected.

FEATURES NOT TO BE TESTED

Order Entry processes.

Only the Data Interface of the Order Entry application will be verified. Changes to the

interface to support Reassigned Sales are not anticipated to have an impact on the Order

Processing application. Order Entry is a separate application sharing the data interface only,

orders will continue to process in the same manner.

PC based dataset analysis applications using products data.

These applications are completely under the control of the administrator and are outside the

scope of this project. The necessary data base format information will be provided to the

customers to allow them to extract data. Testing of their applications is the responsibility of

the application maintainer/developer.

Business Analysis functions.

These applications are completely under the control of the management support team and are

outside the scope of this project. The necessary data base format information will be provided

to the support team to allow them to extract data. Testing of their applications is the

responsibility of the application maintainer/developer.

APPROACH FOR TESTING

Unit Testing: Unit testing will be done by the developer and will be approved by the

development team leader.

Validation Testing: At the end of integration testing software ids completely assembled as a

package. Validation testing is the next stage which can be defined as successful when the

software functions in the manner reasonably expected by the customer. Reasonable

expectations are those defined in the software requirements specifications. Information

contained in those sections form a basis for validation testing approach.

System Testing: System testing is actually a series of different tests whose primary purpose is

to fully exercise the computer-based system. Although each test has a different purpose, all

work to verify that all system elements have been properly integrated to perform allocated

functions.

Recovery Testing: It is a system test that forces the system to fail in a variety of ways and

verities that the recovery is properly performed.

Security Testing: Attempts to verify the protection mechanisms built into the system.

Performance Testing: This method is designed to test runtime performance of software

within the context of an integrated system.

Stress Testing: Stress testing can be defined as performing the sequences of actions at larger

than normal volumes, at faster than normal speeds and for longer than normal periods of time

as a method to accelerate the rate of finding defects and verify the robustness of the product.

Stress testing in its simplest form is any test that repeats a set of actions over and over with the

purpose of "breaking the product".

ITEM PASS/FAIL CRITERIA

The whole project consists of four modules from the business project management system to

the data analysis with data mining techniques. The system is declared pass when it

successfully allows the administrator, group members and customers to login in to the system

with the respective restrictions applicable to all. The customer is able to successfully select

products and place their order with the respective group manager.Further, the group members

are able to view the orders and accordingly accept or reject them depending on the

availiblity.Moreover,the mining process consisting of gathering data by means of the crawler

and further generating useful knowledge from the extracted web data .Thus, the output of the

algorithms shows that the system runs successfully.

TEST CASES Test

S.No. Input Expected behaviour Status

P = Passed

F = Failed

1 Login as

administrator with

username as admin

and password as

admin

Home page for

administrator should be

displayed

Passed

2

Admin Employee

details

Add employee details and

create account type view all

employees details

Passed

3

Admin Groups Add new group details and

group Schedule of

employees

Passed

4

Product details

Adding new product and

viewing all products details

Passed

5

Reports It Should Display all the

details of Matrices of

Customers

Passed

6

Search Admin can search all types

of group search and Product

Search

Passed

7 Group manager

details

Group manager can

maintain all details of new

leaders and assign tasks and

Assigned all tasks.

Passed

8 Group leaders

details

Group leaders can maintain

all details of new leaders

and assign tasks and

Assigned all tasks. View all

members

Passed

9

Tasks list and order

information

View Assign tasks and view

order details, order status

Passed

10 Group Members

details

View All the details of

customers

Passed

11

Order details View all the details of check

order list and order status

Passed

12

Validate the user

inputs

Validate all input and output

details

Passed

13

Target information View all the customer target

information of products.

Passed

14

Customer details Register all the details of

customer and order the

product and check order

status

Passed

15 Web crawler Starts to download webpage

on taking url as input

Passed

16 Extraction HTML to

XML

Takes html webpage as

input and generates an xml

document

Passed

17 Converting XML to

CSV to ARFF

Xml document cannot be

converted in all cases due to

the structure that could not

be parsed

Failed

18 Data mining

application

Takes xml or arff format as

input and gives the mining

result on the algorithm used

Passed

CONCLUSION

It has been our great pleasure work on this exciting and challenging project. This project

proved good for us as it provided practical knowledge of not only programming in ASP.NET

and C#.NET web based application and no some extent Windows Application and SQL

Server, Crawling Techniques and Parsing and various data representation formats but also

about all handling procedure related with Business Process Management. It also provides

knowledge about the latest technology used in developing web enabled application and client

server technology that will be great demand in future. This will provide better opportunities

and guidance in future in developing projects independently.

BENEFITS:

The project is identified by the merits of the system offered to the user. The merits of this

project are as follows: -

It‘s a web-enabled project.

This project offers user to enter the data through simple and interactive forms. This is

very helpful for the client to enter the desired information through so much simplicity.

The user is mainly more concerned about the validity of the data, whatever he is entering.

There are checks on every stages of any new creation, data entry or updation so that the

user cannot enter the invalid data, which can create problems at later date.

Sometimes the user finds in the later stages of using project that he needs to update some

of the information that he entered earlier. There are options for him by which he can

update the records. Moreover there is restriction for his that he cannot change the primary

data field. This keeps the validity of the data to longer extent.

User is provided the option of monitoring the records he entered earlier. He can see the

desired records with the variety of options provided by him.

From every part of the project the user is provided with the links through framing so that

he can go from one option of the project to other as per the requirement. This is bound to

be simple and very friendly as per the user is concerned. That is, we can say that the

project is user friendly which is one of the primary concerns of any good project.

Data storage and retrieval will become faster and easier to maintain because data is stored

in a systematic manner and in a single database.

Decision making process would be greatly enhanced because of faster processing of

information since data collection from information available on computer takes much less

time then manual system.

Allocating of sample results becomes much faster because at a time the user can see the

records of last years.

Easier and faster data transfer through latest technology associated with the computer and

communication.

Through these features it will increase the efficiency, accuracy and transparency,

LIMITATIONS:

The size of the database increases day-by-day, increasing the load on the database back

up and data maintenance activity.

Training for simple computer operations is necessary for the users working on the

system.

Certain websites have all the more different structure and thus, the xml document

generated by the html parser proves to be a hurdle in the way of getting a proper dataset

in ARFF format to be analyzed using the data mining application.

FUTURE WORK

The system is just in its initial phase and has to be made comparable to other E-business

site dominant in the market and provide the best of the facilities to its customers and also

for the member teams

This System being web-based and an undertaking of Cyber Security Division, needs to be

thoroughly tested to find out any security gaps.

A console for the data centre may be made available to allow the personnel to monitor on

the sites which were cleared for hosting during a particular period.

Moreover, it is just a beginning; further the system may be utilized in various other types

of auditing operation viz. Network auditing or similar process/workflow based

applications.

We have just focussed on the Web content mining area of Web mining, it can further be

extended to web structure and web usage mining in order to automatically identify the

location of the customers buying the products by means of networking and IP address Log.

Currently, a lot of research work is active in this field.

REFERENCES

1. Luo Hanyang, Gao Jinling, Ji Wenli,Research on data mining in E-

business,International Conference on Computer Science and Software

Engineering,2008

2. Ansari S., Kohavi R., Mason L., Zijian Zheng. Integrating E-commerce and data

mining: architecture and challenges, Proceedings of IEEE International Conference on

Data Mining, 2001:27 – 34

3. Bing Liu, Kevin Chen_Chuan Chang,‖Editorial Issue on Web Content Mining‖,

issue2, 2004.

4. Rayid Ghani , Rosie Jones , Dunja Mladenić, Kamal Nigam , Seán Slattery ―Data

Mining on Symbolic Knowledge Extracted from the Web‖,2009

5. M. Craven, D. DiPasquo, D. Freitag, A. McCallum,T. Mitchell, K. Nigam, and S.

Slattery. Learning to construct knowledge bases from the world wide web, 2000

6. Istrate Mihai‖ Web Mining In E-Commerce‖,2008

7. N. Girija, 2006. Web Mining. Publishers, ICFAI University Press, Hyderabad, INDIA.

8. Jiawei Han and Micheline Kamber (2006), Data Mining Concepts and Techniques,

published by Morgan Kauffman,2nd ed.

9. Tutorial on Web Scrapping.

http://www.codediesel.com/php/web-scraping-in-php-tutorial/

10. www.mozenda.com

11. www.kdnuggets.com/.

12. http://www.devbistro.com/articles/Misc/Implementing-Effective-Web-Crawler

13. http://www.codeproject.com/KB/IP/Crawler.aspx

14. http://www.progmic.com/2010/03/how-to-make-web-crawler-in-c#/

15. Dr. M.H. Dunham - http://engr.smu.edu/~mhd/dmbook/part2.ppt.

16. Dr. Lee, Sin-Min – San Jose State University

17. Mu-Yu Lu, SJSU

18. Database System Concepts, Silberschatz, Korth, Sudarshan

19. D.W. Embley*, D.M Campbell†, Y.S. Jiang, Y.-K. Ng, R.D. Smith, Li Xu

Department of Computer Science

20. Kotb, Y., Gondow, K., Katayama, T., XML Semantics. In: Scime, A.(Ed.), Web

Mining: Applications and Techniques, Idea, London. pp.169-188.

APPENDICES

APPENDIX A: WORK PLAN

S.No. START DATE END DATE WORK

DESCRIPTION

1 26/7/2011 31/7/2011 LITERATURE

SURVEY

2 4/8/2011 20/9/2011 Studied about the

various techniques

and approach

applicable

3 5/9/2011 15/9/2011 Worked on

developing the

business process

system different

modules with its

testing

4 5/10/2011 21/10/2011 Implemented the

web crawler class

+testing

5 5/11/2011 15/11/2011 Extraction tool

based on regular

expressions

6 7/11/2011 13/11/2011 Parser from html to

xml

7 15/11/2011 20/11/2011 Implemented the

Apiori algorithm

8 21/11/2011 23/11/2011 Gathering data sets

in arff format

9 22/11/2011 24/11/2011 Applets were

implemented for

data mining

algorithms +testing

10 26/11/2011 5/12/2011 Working on parsing

xml to arff

11 3/12/2011 6/12/2011 Debugging errors of

parsing

12 6/12/2011 7/12/2011 Testing of the whole

project

APPENDIX –B

DESCRIPTION OF TOOLS USED IN IMPLEMENTATION

Microsoft Visual Studio is an integrated development environment (IDE) from Microsoft. It

is used to develop console and graphical user interface applications along with Windows

Forms applications, web sites, web applications, and web services in both native code

together with managed code for all platforms supported by Microsoft Windows, Windows

Mobile, Windows CE, .NET Framework, .NET Compact Framework and Microsoft

Silverlight.Visual Studio includes a code editor supporting IntelliSense as well as code

refactoring. The integrated debugger works both as a source-level debugger and a machine-

level debugger. Other built-in tools include a forms designer for building GUI applications,

web designer, class designer, and database schema designer. It accepts plug-ins that enhance

the functionality at almost every level—including adding support for source-control systems

(like Subversion and Visual SourceSafe) and adding new toolsets like editors and visual

designers for domain-specific languages or toolsets for other aspects of the software

development lifecycle (like the Team Foundation Server client: Team Explorer).Visual

Studio supports different programming languages by means of language services, which

allow the code editor and debugger to support (to varying degrees) nearly any programming

language, provided a language-specific service exists. Built-in languages include C/C++ (via

Visual C++), VB.NET (via Visual Basic .NET), C# (via Visual C#), and F# (as of Visual

Studio 2010[3]

). Support for other languages such as M, Python, and Ruby among others is

available via language services installed separately. It also supports XML/XSLT,

HTML/XHTML, JavaScript and CSS. Individual language-specific versions of Visual Studio

also exist which provide more limited language services to the user: Microsoft Visual Basic,

Visual J#, Visual C#, and Visual C++.Microsoft provides "Express" editions of its Visual

Studio 2010 components Visual Basic, Visual C#, Visual C++, and Visual Web Developer at

no cost. Visual Studio 2010, 2008 and 2005 Professional Editions, along with language-

specific versions (Visual Basic, C++, C#, J#) of Visual Studio Express 2010 are available for

free to students as downloads via Microsoft's Dream Spark program.

NetBeans refers to both a platform framework for Java desktop applications, and an

integrated development environment (IDE) for developing with Java, JavaScript, PHP],

Python (no longer supported after Net Beans 7), Groovy, C, C++, Clojure, and others. The

NetBeans IDE 7.0 no longer supports Ruby and Ruby on Rails, but a third party has begun

work on a separate plug-in. The NetBeans IDE is written in Java and can run anywhere a

compatible JVM is installed, including Windows, Mac OS, Linux, and Solaris. A JDK is

required for Java development functionality, but is not required for development in other

programming languages. The NetBeans platform allows applications to be developed from a

set of modular software components called modules. Applications based on the NetBeans

platform (including the NetBeans IDE) can be extended by third party developers. The

NetBeans Platform is a reusable framework for simplifying the development of Java Swing

desktop applications. The NetBeans IDE bundle for Java SE contains what is needed to start

developing NetBeans plug-in and NetBeans Platform based applications; no additional SDK

is required. Applications can install modules dynamically. Any application can include the

Update Center module to allow users of the application to download digitally-signed

upgrades and new features directly into the running application. Reinstalling an upgrade or a

new release does not force users to download the entire application again. The platform offers

reusable services common to desktop applications, allowing developers to focus on the logic

specific to their application.

http://en.wikipedia.org/wiki/Integrated_development_environment

http://en.wikipedia.org/wiki/Microsoft

http://en.wikipedia.org/wiki/Console_application

http://en.wikipedia.org/wiki/Graphical_user_interface

http://en.wikipedia.org/wiki/Application_software

http://en.wikipedia.org/wiki/Windows_Forms

http://en.wikipedia.org/wiki/Windows_Forms

http://en.wikipedia.org/wiki/Web_site

http://en.wikipedia.org/wiki/Web_application

http://en.wikipedia.org/wiki/Web_service

http://en.wikipedia.org/wiki/Native_code

http://en.wikipedia.org/wiki/Managed_code

http://en.wikipedia.org/wiki/Microsoft_Windows

http://en.wikipedia.org/wiki/Windows_Mobile

http://en.wikipedia.org/wiki/Windows_Mobile

http://en.wikipedia.org/wiki/Windows_CE

http://en.wikipedia.org/wiki/.NET_Framework

http://en.wikipedia.org/wiki/.NET_Compact_Framework

http://en.wikipedia.org/wiki/Microsoft_Silverlight

http://en.wikipedia.org/wiki/Microsoft_Silverlight

http://en.wikipedia.org/wiki/Code_editor

http://en.wikipedia.org/wiki/IntelliSense

http://en.wikipedia.org/wiki/Code_refactoring

http://en.wikipedia.org/wiki/Code_refactoring

http://en.wikipedia.org/wiki/Microsoft_Visual_Studio_Debugger

http://en.wikipedia.org/wiki/GUI

http://en.wikipedia.org/wiki/Class_%28computing%29

http://en.wikipedia.org/wiki/Database_schema

http://en.wikipedia.org/wiki/Source_control

http://en.wikipedia.org/wiki/Subversion_%28software%29

http://en.wikipedia.org/wiki/Visual_SourceSafe

http://en.wikipedia.org/wiki/Domain-specific_language

http://en.wikipedia.org/wiki/Software_development_lifecycle

http://en.wikipedia.org/wiki/Software_development_lifecycle

http://en.wikipedia.org/wiki/Team_Foundation_Server

http://en.wikipedia.org/wiki/Programming_language

http://en.wikipedia.org/wiki/C_%28programming_language%29

http://en.wikipedia.org/wiki/C%2B%2B

http://en.wikipedia.org/wiki/Visual_C%2B%2B

http://en.wikipedia.org/wiki/VB.NET

http://en.wikipedia.org/wiki/Visual_Basic_.NET

http://en.wikipedia.org/wiki/C_Sharp_%28programming_language%29

http://en.wikipedia.org/wiki/Visual_C_Sharp

http://en.wikipedia.org/wiki/F_Sharp_%28programming_language%29

http://en.wikipedia.org/wiki/Microsoft_Visual_Studio#cite_note-2

http://en.wikipedia.org/wiki/M_%28programming_language%29

http://en.wikipedia.org/wiki/IronPython

http://en.wikipedia.org/wiki/IronRuby

http://en.wikipedia.org/wiki/XML

http://en.wikipedia.org/wiki/XSLT

http://en.wikipedia.org/wiki/HTML

http://en.wikipedia.org/wiki/XHTML

http://en.wikipedia.org/wiki/JavaScript

http://en.wikipedia.org/wiki/Cascading_Style_Sheets

http://en.wikipedia.org/wiki/DreamSpark

http://en.wikipedia.org/wiki/Platform_%28computing%29

http://en.wikipedia.org/wiki/Integrated_development_environment

http://en.wikipedia.org/wiki/Java_%28programming_language%29

http://en.wikipedia.org/wiki/JavaScript

http://en.wikipedia.org/wiki/PHP

http://en.wikipedia.org/wiki/PHP

http://en.wikipedia.org/wiki/Python_%28programming_language%29

http://en.wikipedia.org/wiki/Groovy_%28programming_language%29

http://en.wikipedia.org/wiki/C_%28programming_language%29

http://en.wikipedia.org/wiki/C%2B%2B

http://en.wikipedia.org/wiki/Clojure

http://en.wikipedia.org/wiki/Ruby_%28programming_language%29

http://en.wikipedia.org/wiki/Ruby_on_Rails

http://en.wikipedia.org/wiki/Java_Virtual_Machine

http://en.wikipedia.org/wiki/Java_Development_Kit

http://en.wikipedia.org/wiki/Software_component

http://en.wikipedia.org/wiki/Third_party_developer

http://en.wikipedia.org/wiki/Software_framework

http://en.wikipedia.org/wiki/Java_Swing

http://en.wikipedia.org/wiki/Digital_signature