Final proj 2 (1)

1.INTRODUCTION

1

1. INTRODUCTION

Data mining is the discovery of the unknown patterns from both heterogeneous and

homogeneous database. Secure Data Mining helps to discover association rules which are being

shared by homogeneous databases (same schema but the data is present on different entities).The

algorithm not only finds the union and intersection of association rules with support and

confidence which hold in the total database, while ensuring the data held by players to be

authenticated. It is estimated that the volume of data in the digital world increased from 161 hexa

bytes in 2007 to 998 hexa bytes in 2011 about 18 times the amount of information present in all

the books ever written and it continues to grow exponentially. This large amount of data has a

direct impact in Computer Data Inspection, which can be broadly defined as the discipline that

combines several elements of data and computer science to collect and analyze data from

computer systems in a way that is admissible as the data should have similarities between several

collected data fields. examining hundreds of thousands of files per computer. This activity

exceeds the expert’s ability of analysis and interpretation of data. Therefore, methods for

automated data analysis, like those widely used for machine learning and data mining, are of

paramount importance. In particular, algorithms for pattern recognition from the information

present in text documents are promising.

Clustering algorithms are typically used for exploratory data analysis, where there is little or

no prior knowledge about the data. This is precisely the case in several applications of Computer

Data Inspection, including the one addressed in our work. From a more technical View point, our

datasets consist of unlabeled objects the classes or categories of documents that can be found are

a priori unknown. Moreover, even assuming that labeled datasets could be available from

previous analyses, there is almost no hope that the same classes (possibly learned earlier by a

classifier in a supervised learning setting) would be still valid for the upcoming data, obtained

from other computers and associated to different investigation processes. More precisely, it is

likely that the new data sample would come from a different population. In this context, the use

of clustering algorithms, which are capable of finding latent patterns from text documents found

in seized computers, can enhance the analysis performed by the expert examiner. Clustering

algorithms have been studied for decades, and the literature on the subject is huge. Therefore, we

2

decided to choose a set of several representative algorithms in order to show the potential of the

proposed approach, namely: the partitional K-means and K-medoids, the hierarchical

Single/Complete/Average Link, and the cluster ensemble algorithm known as CSPA and also

Cosine similarity function. These algorithms were run with different combinations of their

parameters, resulting in various different algorithmic instantiations. Thus, as a contribution of

our work, we compare their relative performances on the studied application domain—using

different sample text data sets containing information like sports, food habits, culture and

animals.

1.1 Background and motivation

The main scope of this project is in computer Data analysis, hundreds of thousands of

files are usually examined. Much of the data in those files consists of unstructured text, whose

analysis by computer examiners is difficult to be performed. In this context, automated methods

of analysis are of great interest. In particular, algorithms for clustering documents can facilitate

the discovery of new and useful knowledge from the documents under analysis.

It is well-known that the number of clusters is a critical parameter of many algorithms and it is

usually a priori unknown. As far as we know, however, the automatic estimation of the number

of clusters has not been investigated in the Computer Data Analysis literature. Actually, we

could not even locate one work that is reasonably close in its application domain and that reports

the use of algorithms capable of estimating the number of clusters. Perhaps even more surprising

is the lack of studies on hierarchical clustering algorithms, which date back to the sixties.

1.2 Problem Statement

The problem statement is that in order to identify the documents that are stored in

remote locations inside a computer during computer inspection. As we know that there will be

computer inspection regularly in all the organizations in order to identify some sort of data, at

that time it is very difficult to identify the data through existing algorithms ,so we have proposed

a new system to identify the documents easily and cluster them with the matched attributes that

are present in the system.

3

2. LITERATURE SURVEY

4

2.LITERATURE SURVEY

Literature survey is the most important step in software development process. Before developing

the tool it is necessary to determine the time factor, economy n company strength. Once these

things are satisfied, ten next steps to determine which operating system and language can be used

for developing the tool. Once the programmers start building the tool the programmers need lot

of external support. This support can be obtained from senior programmers, from book or from

websites. Before building the system the above consideration r taken into account for developing

the proposed system

2.1 Cluster ensembles: A knowledge reuse framework for combining multiple

partitions

This project introduces the problem of combining multiple partitioning of a set of objects

into a single consolidated clustering without accessing the features or algorithms that determined

these partitioning. We first identify several application scenarios for the resultant 'knowledge

reuse' framework that we call cluster ensembles. The cluster ensemble problem is then

formalized as a combinatorial optimization problem in terms of shared mutual information. In

addition to a direct maximization approach, we propose three effective and efficient techniques

for obtaining high-quality combiners (consensus functions). The first combiner induces a

similarity measure from the partitioning and then reclusters the objects. The second combiner is

based on hyper graph partitioning. The third one collapses groups of clusters into meta-clusters

which then compete for each object to determine the combined clustering. Due to the low

computational costs of our techniques, it is quite feasible to use a supra-consensus function that

evaluates all three approaches against the objective function and picks the best solution for a

given situation. We evaluate the effectiveness of cluster ensembles in three qualitatively different

application scenarios: (i) where the original clusters were formed based on non-identical sets of

features, (ii) where the original clustering algorithms worked on non-identical sets of objects,

and (iii) where a common data-set is used and the main purpose of combining multiple

clusterings is to improve the quality and robustness of the solution. Promising results are

obtained in all three situations for synthetic as well as real data-sets.

5

http://www.blurtit.com/q876299.html



2.2 Evolving clusters in gene-expression data

Clustering is a useful exploratory tool for gene-expression data. Although successful

applications of clustering techniques have been reported in the literature, there is no method of

choice in the gene-expression analysis community. Moreover, there are only a few works that

deal with the problem of automatically estimating the number of clusters in bioinformatics

datasets. Most clustering methods require the number k of clusters to be either specified in

advance or selected a posteriori from a set of clustering solutions over a range of k. In both cases,

the user has to select the number of clusters. This project proposes improvements to a clustering

genetic algorithm that is capable of automatically discovering an optimal number of clusters and

its corresponding optimal partition based upon numeric criteria. The proposed improvements are

mainly designed to enhance the efficiency of the original clustering genetic algorithm, resulting

in two new clustering genetic algorithms and an evolutionary algorithm for clustering (EAC).

The original clustering genetic algorithm and its modified versions are evaluated in several runs

using six gene-expression datasets in which the right clusters are known a priori. The results

illustrate that all the proposed algorithms perform well in gene-expression data, although

statistical comparisons in terms of the computational efficiency of each algorithm point out that

EAC outperforms the others. Statistical evidence also shows that EAC is able to outperform a

traditional method based on multiple runs of k-means over a range of k.

2.3 Exploring data with self-organizing mapsThis project discusses the application of a self-organizing map (SOM), an unsupervised

learning neural network model, to support decision making by computer investigators and assist

them in conducting data analysis in a more efficient manner. A SOM is used to search for

patterns in data sets and produce visual displays of the similarities in the data. The project

explores how a SOM can be used as a basis for further analysis. Also, it demonstrates how SOM

visualization can provide investigators with greater abilities to interpret and explore data

generated by computer tools.

6

2.4 Digital text string searching: Improving information retrieval effectiveness

by thematically clustering search resultsCurrent digital text string search tools use match and/or indexing algorithms to search

digital evidence at the physical level to locate specific text strings. They are designed to achieve

100% query recall (i.e. find all instances of the text strings). Given the nature of the data set, this

leads to an extremely high incidence of hits that are not relevant to investigative objectives.

Although Internet search engines suffer similarly, they employ ranking algorithms to present the

search results in a more effective and efficient manner from the user's perspective. Current

digital forensic text string search tools fail to group and/or order search hits in a manner that

appreciably improves the investigator's ability to get to the relevant hits first (or at least more

quickly). This project proposes and empirically tests the feasibility and utility of post-retrieval

clustering of digital text string search results

This project is presented as a work-in-progress. A working tool has been developed and

experimentation has begun. Findings regarding the feasibility and utility of the proposed

approach will be presented , as well as suggestions for follow-on research.

2.5 Towards an integrated e-mail analysis frameworkDue to its simple and inherently vulnerable nature, e-mail communication is abused for

numerous illegitimate purposes. E-mail spamming, phishing, drug trafficking, cyber bullying,

racial vilification, child pornography, and sexual harassment are some common e-mail mediated

cyber crimes. Presently, there is no adequate proactive mechanism for securing e-mail systems.

In this context, this analysis plays a major role by examining suspected e-mail accounts to gather

evidence to prosecute criminals in a court of law. To accomplish this task, a forensic investigator

needs efficient automated tools and techniques to perform a multi-staged analysis of e-mail

ensembles with a high degree of accuracy, and in a timely fashion. In this article, we present our

e-mail forensic analysis software tool, developed by integrating existing state-of-the-art

statistical and machine-learning techniques complemented with social networking techniques. In

this framework we incorporate our two proposed authorship attribution approaches.

7

3. SYSTEM REQUIREMENS

8

3. SYSTEM REQUIREMENTS

3.1Requirement Analysis Document

Requirement Analysis is the first phase in the software development process. The main objective of the phase is to identify the problem and the problem and the system to be developed .The later phases are strictly dependent on this phase and hence requirements for the system analyst to be clearer, precise about this phase. Any inconsistency in this phase will lead to lot of problem in the other phases to be followed. Hence there will be several reviews before the final copy of the analysis is made on the system to be developed. After all the analysis is completed the system analyst will submit the details of the system to be developed in the form of a document called requirement specification.

The Requirement analysis task is a process of discovery, refinement, modeling and specifications. The software scope, initially established by a system engineer and refined during software project planning, is refined in detail. Models of required data, information and control flow and operational behavior are created. Alternative solution are analyzed and allocated to various software elements.

Both the developer and the customer take an active role in requirement analysis and specification. The customer attempts to reformulate a sometimes-nebulous concept of software function and performance into concrete detail. The developer acts as interrogator, consultant and problem solver. The communication content is very high. Changes for misinterpretation of misinformation abound.Ambiguity are probable.

Requirement analysis is a software engineering task that bridges the gap between the system level software allocation and software design. Requirement analysis enables the system engineer to specify the software function and performance indicate software interface with other system elements and establish constraints that software must meet. It allows the software engineer, often called analyst in this role, to refine the software allocation and build model of the data, functional and behavior domain and that will be treated by software.

Requirement analysis provides the software designer with models that can be translated into data, architectural, interface and procedural design. Finally, the requirement specification provides the developer and customer with the means to access quality once software.

9

3.1.1 Functional Requirements

The functional requirement of the system defines a function of software system or its

components. A function is described as set of inputs, behavior of a system and output.

The Functional Requirements are:

The functional requirements comprises of 3 parts.

1) Input

2) Output

3) Data Storage

1) Input

The following are the inputs that should be performed on your current application. They

are as follows: User selects a text file as input data set.

a. User selects a stop words button in order to remove the unwanted words(I.e. Words other

than Noun, Verb and Adverb)

b. User selects on Stemming Button for removing duplicate attributes

c. User click on calculation button in order to get the result

d. User click on K-means to generate clusters with id

e. User click on Distance calculation button in order to calculate distance between attributes.

f. User clicks on Incremental button to generate clusters.

g. User click on purity button to get the purity values of K-means and Inc clustering.

2) Output

The following are the steps that user will click for generating the output as a result.

a. User gets a message called as “Data Selected “

b. User gets a message called as “Stopwords removal completed”

c. User gets a message after he chooses a valid input file as “File Selected Successfully”.

d. User gets failed message if he chooses invalid input file type as “not a valid type”.

e. User gets the filtered words with no duplication when he click on stemming button.

f. User gets the cluster ids and cluster values after he click on K-means

g. User gets the distance matrix values after he click on generate distance matrix.

10

h. User gets the processed values of Cos similarity after he chooses the Inc clustering.

i. User gets the Graph for comparison of purity values of K-means and Inc clustering.

3) Data Storage

Here we use My Sql data base as data base storage function in order to store all the

registration details. In this project we use My Sql as back end because it has following

advantages like

It is GUI in Nature.

It is cross Platform (I.e. It can run and reside on any Operating System).

It has a feature called as Auto Commit.

It takes very less space for installing on any system(i.e. hardly less than 30 Mb).

3.1.2 Non-Functional Requirements

In non-functional requirements the following are the things that come under .They are as

follows:

1) Reusability: As we developed the application in java, the application can be re-used for

any one without having any restrictions in its usage. Hence it is re-Usable.

2) Portability: As the application is designed with java as programming language, we know

java can be run on any operating system. Hence the application is portable to run on any

operating system.

3) Extensibility: The application can be extended at any level if the user wish to extend that

in future this is done because java is a open source medium which doesn’t have any time

limits for expiry or renewal.

11

Requirements

Hardware Requirements

System Pentium IV 2.4 GHz

Hard Disk 40 MB

Floppy Drive 1.44 Mb

Monitor 15 VGA Colour

Mouse Logitech

Ram 512 Mb

3.1: Hardware Requirements

Software Requirements Operating system Windows XP

Coding Language Java Swings

Data Base MYSQL

3.2: Software Requirements

12

4. DESIGNING

13

4. DESIGNING

4.1 Design Considerations

Design Considerations is a process of problem solving and planning for a software

solution. After the purpose and specifications of software are determined, software developers

will design or employ designers to develop a plan for a solution. It includes low-level component

and algorithm implementation issues as well as the architectural view.

4.1.1 Assumptions and Dependencies Describe any assumptions or dependencies regarding the software and its use. These may

concern such issues as:

It is assumed that the system will be deployed on Windows 2007 or later operating

system. A working visual studio 2010 or above is necessary.

4.1.2 General ConstraintsThis project is a desktop based application, developed in java technology. A major

constraint is to provide security for the information. In our project we use symmetric

cryptography algorithm and cipher text and key follows different paths.

4.1.3 Development MethodsA system development methodology refers to the framework that is used to structure,

plan, and control the process of developing an information system. The following diagram

explains the stages.

14

Figure 4.1: Water-Fall Model

Requirement Analysis and Definition

All possible requirements of the system to be developed are captured in this phase.

Requirements are a set of functions and constraints that the end user (who will be using the

system) expects from the system. The requirements are gathered from the end user at the start of

the software development phase. These requirements are analyzed for their validity and the

possibility of incorporating the requirements in the system to be developed is also studied.

Finally, a requirement specification document is created which serves the purpose of guideline

for the next phase of the model.

System and Software Design

Before starting the actual coding phase, it is highly important to understand the requirements of the

end user and also have an idea of how should be the end product looks like. The requirement

specifications from the first phase are studied in this phase and a system design is prepared.

System design helps in specifying hardware and system requirements and also helps in defining

the overall system architecture. The system design specifications serve as an input for the next

phase of the model.

Implementation and Unit Testing

On receiving system design documents, the work is divided in modules/units and actual

coding is started. The system is first developed in small programs called units, which are

integrated in the next phase. Each unit is developed and tested for its functionality; this is

15

referred to as unit testing. Unit testing mainly verifies if the modules/units meet their

specifications.

Integration and System Testing

As specified above, the system is first divided into units which are developed and tested

for their functions. These units are integrated into a complete system during integration phase

and tested to check if all modules/units coordinate with each other and the system as a whole

behaves as per the specifications. After successfully testing the software, it is delivered to the

customer.

4.2 System Design

The DFD is also called as bubble chart. It is a simple graphical formalism that can be used

to represent a system in terms of input data to the system, various processing carried out on this

data, and the output data is generated by this system.

Figure 4.2: Data Flow Diagram

16

Preprocessing

Documents

Term Frequency

Similarity Calculation

Cluster Formation

Query Results

1. The data flow diagram (DFD) is one of the most important modeling tools. It is used to

model the system components. These components are the system process, the data used

by the process, an external entity that interacts with the system and the information flows

in the system.

2. DFD shows how the information moves through the system and how it is modified by a

series of transformations. It is a graphical technique that depicts information flow and the

transformations that are applied as data moves from input to output.

3. DFD is also known as bubble chart. A DFD may be used to represent a system at any

level of abstraction. DFD may be partitioned into levels that represent increasing

information flow and functional detail.

4.2.1 Proposed Architecture

Figure 4.3:Proposed Architecture

The architecture contains four modules. These are listed below

1. Pre-Processing Module

2. Calculating the number of clusters

3. Clustering techniques

4. Removing Outliers

17

4.2.1.1 Preprocessing Module:Before running clustering algorithms on text datasets, we performed some preprocessing

steps. In particular, stop words (prepositions, pronouns, articles, and irrelevant document

metadata) have been removed. Also, the Snow balls stemming algorithm for Portuguese words

has been used. Then, we adopted a traditional statistical approach for text mining, in which

documents are represented in a vector space model. In this model, each document is represented

by a vector containing the frequencies of occurrences of words, which are defined as delimited

alphabetic strings, whose number of characters is between 4 and 25. We also used a

dimensionality reduction technique known as Term Variance (TV) that can increase both the

effectiveness and efficiency of clustering algorithms. TV selects a number of attributes (in our

case 100 words) that have the greatest variances over the documents. In order to compute

distances between documents, two measures have been used, namely: cosine-based distance and

Levenshtein-based distance. The later has been used to calculate distances between file

(document) names only.

4.2.1.2 Calculating the number of Clusters:In order to estimate the number of clusters, a widely used approach consists of getting a

set of data partitions with different numbers of clusters and then selecting that particular partition

that provides the best result according to a specific quality criterion (e.g., a relative validity

index). Such a set of partitions may result directly from a hierarchical clustering dendrogram or,

alternatively, from multiple runs of a partitional algorithm (e.g., K-means) starting from different

numbers and initial positions of the cluster prototypes.

4.2.1.3 Clustering Techniques:The clustering algorithms adopted in our study—the partitional K-means and K-medoids, the

hierarchical Single/Complete/Average Link, and the cluster ensemble based algorithm known as

CSPA—are popular in the machine learning and data mining fields, and therefore they have been

used in our study. Nevertheless, some of our choices regarding their use deserve further

comments. For instance, K-medoids is similar to K-means. However, instead of computing

centroids, it uses medoids, which are the representative objects of the clusters. This property

makes it particularly interesting for applications in which (i) centroids cannot be computed; and

18

(ii) distances between pairs of objects are available, as for computing dissimilarities between

names of documents with the Levenshtein distance.

4.2.1.4 Removing Outliers:We assess a simple approach to remove outliers. This approach makes recursive use of the

silhouette. Fundamentally, if the best partition chosen by the silhouette has singletons (i.e.,

clusters formed by a single object only), these are removed. Then, the clustering process is

repeated over and over again—until a partition without singletons is found. At the end of the

process, all singletons are incorporated into the resulting data partition (for evaluation purposes)

as single clusters.

Input Design

The input design is the link between the information system and the user. It comprises the

developing specification and procedures for data preparation and those steps are necessary to put

transaction data in to a usable form for processing can be achieved by inspecting the computer to

read data from a written or printed document or it can occur by having people keying the data

directly into the system. The design of input focuses on controlling the amount of input required,

controlling the errors, avoiding delay, avoiding extra steps and keeping the process simple. The

input is designed in such a way so that it provides security and ease of use with retaining the

privacy. Input Design considered the following things:

• What data should be given as input?

• How the data should be arranged or coded?

• The dialog to guide the operating personnel in providing input.

• Methods for preparing input validations and steps to follow when error occur.

Objectives:

1. Input Design is the process of converting a user-oriented description of the input into a

computer-based system. This design is important to avoid errors in the data input process and

show the correct direction to the management for getting correct information from the

computerized system.

19

2. It is achieved by creating user-friendly screens for the data entry to handle large volume of

data. The goal of designing input is to make data entry easier and to be free from errors. The data

entry screen is designed in such a way that all the data manipulates can be performed. It also

provides record viewing facilities.

3. When the data is entered it will check for its validity. Data can be entered with the help of

screens. Appropriate messages are provided as when needed so that the user

will not be in maize of instant. Thus the objective of input design is to create an input layout that

is easy to follow

Output Design

A quality output is one, which meets the requirements of the end user and presents the

information clearly. In any system results of processing are communicated to the users and to

other system through outputs. In output design it is determined how the information is to be

displaced for immediate need and also the hard copy output. It is the most important and direct

source information to the user. Efficient and intelligent output design improves the system’s

relationship to help user decision-making.

1. Designing computer output should proceed in an organized, well thought out manner; the right

output must be developed while ensuring that each output element is designed so that people will

find the system can use easily and effectively. When analysis design computer output, they

should Identify the specific output that is needed to meet the requirements.

2. Select methods for presenting information.

3. Create document, report, or other formats that contain information produced by the system.

The output form of an information system should accomplish one or more of the following

objectives.

• Convey information about past activities, current status or projections of the

• Future.

• Signal important events, opportunities, problems, or warnings.

• Trigger an action.

20

• Confirm an action.

4.3 Unified Modeling Language

UML stands for Unified Modeling Language. UML is a standardized general-purpose

modeling language in the field of object-oriented software engineering. The standard is

managed, and was created by, the Object Management Group.

The goal is for UML to become a common language for creating models of object

oriented computer software. In its current form UML is comprised of two major

components: a Meta-model and a notation. In the future, some form of method or process

may also be added to; or associated with, UML.

The Unified Modeling Language is a standard language for specifying, Visualization,

Constructing and documenting the artifacts of software system, as well as for business

modeling and other non-software systems.

The UML represents a collection of best engineering practices that have proven

successful in the modeling of large and complex systems.

The UML is a very important part of developing objects oriented software and the

software develop

The UML uses mostly graphical notations to express the design of software projects.

4.3.1 ScenariosA scenario is “a narrative description of what people do and experience as

they try to make use of computer systems and applications”. A scenario is a concrete,

focused, informal description of single feature of the system from the viewpoint of a

single actor. Scenarios cannot replace use cases, as they focus on specific instances

and concrete events. However, scenarios enhance requirements elicitation providing a

tool that is understandable to users and clients.

Scenario 1:

21

Table 4.1:Scenario1 table

22

Use case Name User Selects a Text Documents

Participating Actors User

Flow of Events 1) User has to browse a text file as input data set

2) Click on Browse button to select the Dataset.

Entry Condition User has To browse for an input Dataset.

Exit Condition Selected Dataset are saved into a output Panel

and click on EXIT button to close the application

BrowseTextFile

checkOnBrowseButton for selection

selected data sets are saved & press EXIT button

user

Figure 4.4: User selects a text document

Scenario 2

23


Figure 4.5:Preprocessing

Scenario 3


24

Use case Name Preprocessing


Flow of Events 1) User has to browse a text file as input data set

2) User click on Stop words to remove the unwanted words and

phrases.

3) User then clicks on Stemming button inorder to remove the

duplicates.

Entry Condition User has to browse for an input Dataset.

Exit Condition Finally preprocessed data is saved onto the output panel and click

on EXIT button to close the application

Use case Name Term Frequency Calculation


Flow of Events 1) User after do preprocessing on input text file, he will go for

calculation button for clusters.

2) User creates term frequency calculation between all attributes and documents.

3) User gets the frequency values for all the documents parallel with attributes.


Exit Condition Finally term Frequency data is saved onto the output panel

and click on EXIT button to close the application

Browse for input data set

Preprocessing

click on calculation button for clusters

term frequency calculation

gets the frequency values

term frequency data is saved & click on EXIT button

User

Figure 4.6:Term Frequency Calculation

Scenario 4Use case Name Similarity Calculation

Participating

Actors

User

Flow of Events 1) User after term frequency calculation, he will go for next button.

2) He will click on similarity button to calculate the cos similarity values

3) The sum of all documents similarity values gives the purity values.


Exit Condition Finally Similarity calculation between all documents is saved onto the

output panel and click on EXIT button to close the application


25

Browse for input dataset

Term frequency is calculated

Click on next button

Click on similarity button

Cos similarity values are calculated

Purity values calculated

User

Similarity calculation is saved &Click on EXIT button

Figure 4.7: Similarity Calculation

Scenario 5

26


Browse for input dataset

Preprocessing

Similarity Calculation

Click on next button

process the cluster values

get query results from unmatched values

user

Cluster values are saved &Click on EXIT button

Figure 4.8:Cluster Formation and Query Results

4.3.2 Use case Diagram:A use case diagram in the Unified Modeling Language (UML) is a type of behavioral

diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical

27

Use case Name Cluster Formation and Query Result


Flow of Events 1) User after processing the similarity values, he will click on next button.

2) User process the cluster values of matches documents with cluster id

3) User gets the Query result at the end in order to show the values that are not matched in that list.


Exit Condition Finally Cluster values between all documents is saved onto

the output panel and click on EXIT button to close the

application

overview of the functionality provided by a system in terms of actors, their goals (represented as

use cases), and any dependencies between those use cases. The main purpose of a use case

diagram is to show what system functions are performed for which actor. Roles of the actors in

the system can be depicted.

System: A system/system boundary groups use cases together to accomplish a purpose. Each

use case diagram can only have one system.

Actor: An actor represents a coherent set of roles that users of the system plays when interacting

with the use cases of the system. An actor participates in use cases to accomplish an overall

purpose. An actor can represent the role of a human, a device or any other systems.

Use case: A use case describes a sequence of actions that provide something of measurable value

to an actor and is drawn as a horizontal ellipse.

Use case relationships:Four relationships among use cases are used often in practice.

Include: In one form of interaction, a given use case may include another. "Include is a Directed

Relationship”.

In between two use cases, implying that the behavior of the included use case is

inserted into the behavior of the including use case”. The first use case often depends on the

outcome of the included use case. This is useful for extracting truly common behaviors from

multiple use cases into a single description. The notation is a dashed arrow from the including to

the included use case, with the label "«include»".

Extend: This relationship indicates that the behavior of the extension use case may be inserted in

the extended use case under some conditions. The notation is a dashed arrow from the extension to

the extended use case, with the label "«extend»". The notes or constraints may be associated with

this relationship to illustrate the conditions under which this behavior will be executed. Modelers

use the «extend» relationship to indicate use cases that are "optional" to the base use case.

Generalization: A given use case may have common behaviors, requirements, constraints, and

assumptions with a more general use case. In this case, describe them once, and deal with it in the

28

same way, describing any differences in the specialized cases. The notation is a solid line ending

in a hollow triangle drawn from the specialized to the more general use case.

Association: Associations between actors and use cases are indicated in use case lid lines. An

association exists whenever an actor is involved with an interaction described by a use case.

Associations are modeled as lines connecting use cases and actors to one another, with an optional

arrowhead on one end of the line. The arrowhead is often used to indicate the direction of the

initial invocation of the relationship or to indicate the primary actor within the use case.

ChooseAnInputDataset

Preprocessing

TermFrequency

SimilarityCalculation

ClusterFormation

User/Computer Examiner

EvaluatingQueryResults

Figure 4.9: Use Case Diagram

4.3.3 Class Diagram

29

A class diagram describes the static structure of the system. It is a graphic presentation of the

static view that shows a collection of declarative (static) model elements, such as classes, types,

and their contents and relationships. Classes are abstractions that specify the common structure

and behavior of a set of objects. Objects are the instances of the classes that are created, modified

and destroyed during the execution of the system. A Class diagram describes the system in terms

of objects, classes, attributes, operations and their associations.

Class: A rectangle is the icon that represents the class. It is divided into 3 areas. The uppermost

contains name, the middle area holds the attributes and the bottom area holds the operations.

Package: A package is a mechanism for organizing elements into groups. It is used in the Use

Case, Class, and Component diagrams. Packages may be nested within other packages. A

package may contain both subordinate packages and ordinary model elements. The entire system

description can be thought of as a single high-level subsystem package with everything else in it.

Subsystem: A subsystem groups diagram elements together.

Generalization: Generalization is a relationship between a general element and a more specific

kind of that element. It means that the more specific element can be used whenever the general

element appears.

Usage: Usage is a dependency situation in which one element (the Client) requires the presence

of another element (the supplier) for its correct functioning or implementation.

Realization: Realization is the relationship between a specialization and its

implementation. It is an indication of the inheritance of behavior without the inheritance of

structure. One classifier specifies a contract such that another classifier guarantees to carry out

Realization is used in two places: one is between interfaces and the classes that realize them, and

the other is between use cases and the collaboration that realize them.

30

Association: Association is represented by drawing a line between classes and can be

named to facilitate model understanding. If two classes are associated, you can navigate from an

object of one class to an object of the class.

Aggregation: Aggregation is a special kind of association in which class represents as the larger

class that consists of a smaller class. It has the meaning of “has-a” relationship.

Composition: Composition is a strong form of aggregation association. It has strong ownership

and coincident lifetime of parts by the whole. A part may belong to only one composite. Parts

with non-fixed multiplicity may be created after the composite itself. But once

created, they live and die with it (that is, they share lifetimes). Such arts can also be explicitly

removed before the death of the composite.

N-ary Association: N-ary associations are associations that connect more than two classes.

Dependency: The dependency link is a semantic relationship between two elements. It indicates

that whenever a change occurs in one element, there may be a change necessary to other element.

31

1

Preprocessdocclust : JLablecal : JButtonfolder : filestr : stringword : string

preprocess()calcActionPerformed()strmmingAction()

*

1

processJButton1 : JButtonJButton2 : JButtonJButton3 : JButtonJTable1 : JTable

process()JButtonAction()JButton()

StemmerStep1 : intStep2 : intstep3 : intj : intk : int

stemmer()const()cvc()

Graphhsk : doubleKmeans : doublr

draw()main()

11

1

Figure 4.10: Class Diagram

32

4.3.4 Sequence Diagram:

A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram that

shows how processes operate with one another and in what order. It is a construct of a Message

Sequence Chart. Sequence diagrams are sometimes called event diagrams, event scenarios, and

timing diagrams.

Object: Object can be viewed as an entity at a particular point in time with a specific value and

as a holder of identity that has different values over time. Associations among objects are not

shown. When you place an object tag in the design area, a lifeline is automatically drawn and

attached to that object tag.

Actor: An actor represents a coherent set of roles that users of a system play when interacting

with the use cases of the system. An actor participates in use cases to accomplish an overall

purpose. An actor can represent the role of a human, a device, or any other systems.

Message: A message is a sending of a signal from one sender object to other receiver object(s).

It can also be the call of an operation on receiver object by caller object. The arrow can be

labeled with the name of the message (operation or signal) and its argument values. A sequence

number that shows the sequence of the message in the overall interaction as well as a guard

condition can also be labeled at the arrow.

Lifetime: It is the duration that indicates the completion of an action or a message and it will

cause transition from one state to another state. The life time of an object is represented with a

dotted line.

Self Message: A message that indicates an action will perform at a particular state and stay

there.

Create Message: A message that indicates an action that will perform between two states.

33

Figure 4.11: Sequence Diagram

34

4.3.5 Collaboration diagram:

Communication diagram was called collaboration diagram in UML 1. It is similar to sequence

diagrams but the focus is on messages passed between objects. The same information can be

represented using a sequence diagram and different objects. Click here to understand the

differences using an example

Class roles:

Class roles describe how objects behave. Use the UML object symbol to illustrate class roles, but

don't list object attributes.

Association roles:

Association roles describe how an association will behave given a particular situation. You can

draw association roles using simple lines labeled with stereotypes.

Messages: Unlike sequence diagrams, collaboration diagrams do not have an explicit way to

denote time and instead number messages in order of execution. Sequence numbering can

become nested using the Dewey decimal system. The condition for a message is usually placed

in square brackets immediately following the sequence number. Use a * after the sequence

number to indicate a loop.

35

Figure 4.12: Collaboration Diagram

36

4.3.6 Activity Diagram:

Activity diagrams are graphical representations of workflows of stepwise activities and actions

with support for choice, iteration and concurrency. In the Unified Modeling Language, activity

diagrams can be used to describe the business and operational step-by-step workflows

of components. An Activity diagram consists of the following behavioral elements:

Action State: It describes the execution of an atomic action.

Sub-Activity: It is an activity that will perform within another activity.

Initial State: A pseudo state to establish the start of the event .

Final State: It signifies when a transition ends.

Horizontal Synchronization: A horizontal synchronization splits a single transition into parallel

transition or merges concurrent transitions to a single target.

Vertical Synchronization: A vertical synchronization splits a single transition into parallel

transitions or merges concurrent transitions to a single target.

Decision Point: A decision point is used to model the conditional flow of

control. It labels each output transition of a decision with a different guard condition

Swim Lane: A Swim lane is a partition on interaction diagram for organizing responsibilities for

activities. Each lane presents the responsibilities of a particular class. To use a Swim lane

activity diagrams are arranged into vertical zones.

37

document

preprocessing

term frequency

unconsidered

similarity computation

cluster formation

query results

validyes no

Figure 4.13:Activity Diagram

38

IMPLEMENTATION

39

5. IMPLEMENTATION

5.1 Preparing the data setsThe input to the document clustering algorithm can be any set of documents which have to be

divided into clusters based on their similarity. The individual terms from each of the documents

have to be extracted in order to identify similar items. The data set thus undergoes three pre-

processing steps:

Tokenization

Stop word Removal

Stemming

5.1.1 TokenizationTokenization is the process of breaking a stream of text up into words, phrases, symbols, or other

meaningful elements called tokens. The list of tokens becomes input for further processing such

as parsing or text mining. Tokenization is useful both in linguistics(where it is form of text

segmentation), and in computer science, where it forms part of lexical analysis.

5.1.2 Stop word Removal In computing, stop words are the words which are filtered out prior to, or after, processing of

natural language data(text).There is not one definite list of stop words which all tools use and

such a filter is not always used. Some tools specifically avid removing them to support phrase

search. Some of the common stop words are: a, be, been, and, as, out, ever, own, he, she, shall

etc.

5.1.3 StemmingStemming is the process for reducing derived words to their stem, base or root form-generally a

written word form. The stem need not be identical to the morphological root of the word; it is

usually sufficient that related words map to the same stem, even if this stem is not in itself a valid

root. Algorithms for stemming have been studied in computer science since the 1960s. Many

search engines treat words with the same stem as synonyms as a kind of query expansion, a

process called conflation. Stemming programs are commonly referred to as stemming algorithms

or stemmers.

40

5.2 Cluster Analysis

Clustering is the process of grouping a set of objects into classes of similar objects. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other cluster. Data clustering is a common technique for statistical data analysis, which is used in many fields, including machines learning, data mining, pattern recognition, image analysis and bioinformatics. The computation task of classifying the data set into k cluster is often referred to as k-clustering. Cluster is also called as data segmentation in some applications because clustering partitions large datasets into groups according to their similarity. Clustering can also be used for outlier detection. Cluster analysis aims to organize a collection of patterns into cluster based on similarity. Cluster has its roots in many fields, such as mathematics, computer science, statistics, biology, and economics. In different applications domains, a variety of clustering techniques have been developed, depending on the methods used to represent data, the measure of similarity between data objects, and the technique for grouping data objects into cluster. In data mining, hierarchical clustering is a method of cluster analysis which creates a hierarchical decomposition of the given set of data objects. Depending on the decomposition approach hierarchical algorithms are classified as agglomerative (merging) or divine (splitting). In this project we focus on decimal clustering using hierarchical clustering

Types of clustering

There are different clustering methodologies. Data clustering algorithms can be hierarchical. Hierarchical algorithms find successive clusters using previously established clusters. Hierarchical algorithms can be agglomerative (“bottom-up”) or (“up-down”) partitioning algorithms typically determine all clusters at once. There are different clustering methods like Density based, Grid based, Model based, Constraints Based clustering.

Figure: clustering

41

The clustering algorithms used are

K-means Algorithm

K-medoids Algorithm

Hierarchical Algorithm

These algorithms results minimal latency time delays for all the clients.

Partitioning Clustering:

Given a database of n objects a partitioning methods constructs partitionsof the data, where each partition represents a cluster andk<=n that it classifies the data into k groups. Given k, the number of partitions to construct the methods creates a initial partitioning. It then uses an iterative relocations technique that attempts to improve the partitioning by moving objects from one group to another.

K-means Algorithm: Demonstration of the standard algorithm

1) k initial “means” (in this case k=3) are randomly generated within the data domain.

2) k clusters are created by associating every observation with the nearest mean.

The partition here represent the Voronoi diagram generated by the means.

3) The centroid of each of the k clusters becomes the new mean.

4) Steps 2 and 3 are repeated until convergence has been reached.

Figure 5.1:Clustering through K-means

42

K-medoids Algorithm

1. Initialize: randomly select (without replacement) k of the n data points as the medoids

2. Associate each data point to the closest medoid. ("closest" here is defined using any valid

distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski

distance)

3. For each medoid m

4. For each non-medoid data point o.

5. Swap m and o and compute the total cost of the configuration

6. Select the configuration with the lowest cost.

7. Repeat steps 2 to 4 until there is no change in the medoid.

Figure: Cluster through k- medoids

43

http://en.wikipedia.org/wiki/Minkowski_distance

http://en.wikipedia.org/wiki/Minkowski_distance

http://en.wikipedia.org/wiki/Manhattan_distance

http://en.wikipedia.org/wiki/Euclidean_distance

http://en.wikipedia.org/wiki/Metric_space

Hierarchal clustering: A hierarchal clustering works by grouping data objects into a tree of clusters.

There are two types of hierarchal clustering :

Agglomerative hierarchal clustering: It is a bottom-up strategy, where it starts by placing each object into clusters & then merges into larger clusters un-till all the objects are in single cluster.

Divisive hierarchal clustering: This is top-down strategy, where the clusters are subdivided into smaller pieces un-till each object forms a cluster on its own or un-till it satisfies certain termination conditions.

Agglomerative vs. Divisive approach.

Agglomerative approach Divisive approach

We start out with all sample units in n clusters of size 1.

We start out with all sample units in a single cluster of size n.

Then, at each step of the algorithm, the pair of clusters with the shortest distance are combined into a single cluster.

Then, at each step of the algorithm, clusters are partitioned into a pair of daughters clusters, selected to maximize the distance between each daughter.

The algorithm stop when all sample units are combined into a single cluster of size n.

The algorithm stops when sample units are partitioned into n clusters of size 1.

Table: comparison of agglomerative and divisive approach

44

Hierarchical Algorithm:

• The maximum distance between elements of each cluster (also called complete-linkage

clustering): maxf d(x; y) : x 2 A; y 2 B g:

• The minimum distance between elements of each cluster (also called single-linkage

clustering): minf d(x; y) : x 2 A; y 2 B g:

• The mean distance between elements of each cluster (also called average linkage

clustering, used e.g. in UPGMA):

• The sum of all intra-cluster variance.

• The increase in variance for the cluster being merged (Ward’s method<ref

name="[6])The probability that candidate clusters spawn from the same distribution function

(V-linkage). Each agglomeration occurs at a greater distance between clusters than the

previous agglomeration, and one can decide to stop clustering either when the clusters are too

far apart to be merged (distance criterion) or when there is a sufficiently small number of

clusters (number criterion).

Hierarchical vs. partitioning algorithms:

Hierarchical techniques produce a nested sequence of partitions, with a single, all inclusive cluster at the top and singleton clusters of individual points at the bottom. Each intermediate level can be viewed as combining (splitting) two cluster from the next lower (next higher) level. Partitional techniques create a one level (unnested) partitioning the data points. If k is the desired number of clusters, then partitions approaches typically find all k clusters at once. Contrast this with traditional hierarchical schemes, which bisect a cluster to get two clusters or merge two clusters to get one.

Distance Measure

An important step in any clustering is to select a distance measure. We will determine how the similarity of two elements is calculated. This influence the shape of the clusters, as some elements may be close to another according to one distance and further away according to another. The various distance measures are:

45

Euclidean Distance

This is probably the most commonly chosen type of distance. It simply gives the geometric distance in the multidimensional space. It is computed as:

The Euclidean (and squared Euclidean) distances are usually computed on raw data and not from standardized data.

City Block Distance(Manhattan Distance):

This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. The city-block distance is computed as:

Cosine Similarity:

Cosine Similarity is one of the most popular similarity measure practical to text documents, such as in various information retrieval applicationsand clustering too. An important property of the cosine similarity is its independence of document length. For two documents A and B, the similarity between them can be calculated as

46

5.2 Software Environment and Technologies

Java Technology Java technology is both a programming language and a platform.

The Java Programming Language

The Java programming language is a high-level language that can be characterized by all

of the following buzzwords

Simple

Architecture neutral

Object oriented

Portable

Distributed

High performance

Interpreted

Multithreaded

Robust

Dynamic

Secure

With most programming languages, you either compile or interpret a program so that you can

run it on your computer. The Java programming language is unusual in that a program is both

compiled and interpreted. With the compiler, first you translate a program into an intermediate

language called Java byte codes —the platform-independent codes interpreted by the interpreter

on the Java platform. The interpreter parses and runs each Java byte code instruction on the

computer. Compilation happens just once; interpretation occurs each time the program is

executed. The following figure illustrates how this works.

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine

(Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can

run applets, is an implementation of the Java VM. Java byte codes help make “write once, run

anywhere” possible. You can compile your program into byte codes on any platform that has a

Java compiler. The byte codes can then be run on any implementation of the Java VM. That

47

means that as long as a computer has a Java VM, the same program written in the Java

programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

The Java Platform

A platform is the hardware or software environment in which a program runs. We’ve

already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and

Macros. Most platforms can be described as a combination of the operating system and

hardware. The Java platform differs from most other platforms in that it’s a software-only

platform that runs on top of other hardware-based platforms.

The Java platform has two components:

The Java Virtual Machine (JVM)

The Java Application Programming Interface (Java API)

The Java API is a large collection of ready-made software components that provide many

useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into

libraries of related classes and interfaces; these libraries are known as packages. Native code is

code that after you compile it, the compiled code runs on a specific hardware platform. As a

platform-independent environment, the Java platform can be a bit slower than native code.

However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can

bring performance close to that of native code without threatening portability.

What Can Java Technology Do?

48

The most common types of programs written in the Java programming language are applets and

applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet

is a program that adheres to certain conventions that allow it to run within a Java-enabled

browser. However, the Java programming language is not just for writing cute, entertaining

applets for the Web. The general-purpose, high-level Java programming language is also a

powerful software platform. Using the generous API, you can write many types of programs.

An application is a standalone program that runs directly on the Java platform. A special kind of

application known as a server serves and supports clients on a network. Examples of servers are

Web servers, proxy servers, mail servers, and print servers. Another specialized program is a

servlet. A servlet can almost be thought of as an applet that runs on the server side. Java Servlets

are a popular choice for building interactive web applications, replacing the use of CGI scripts.

Servlets are similar to applets in that they are runtime extensions of applications. Instead of

working in browsers, though, servlets run within Java Web servers, configuring or tailoring the

server.

Java makes our programs better and requires less effort than other languages. Java technology

will help you do the following:

Get started quickly:

Although the Java programming language is a powerful object-oriented

language, it’s easy to learn, especially for programmers already familiar with C or

C++.

Write less code:

Comparisons of program metrics (class counts, method counts, and so on) suggest

that a program written in the Java programming language can be four times smaller

than the same program in C++.

Write better code:

The Java programming language encourages good coding practices, and its

garbage collection helps you avoid memory leaks. Its object orientation, its

JavaBeans component architecture, and its wide-ranging, easily extendible API let

you reuse other people’s tested code and introduce fewer bugs.

Develop programs more quickly:

49

Our development time may be as much as twice as fast versus writing the

same program in C++. Why? We write fewer lines of code and it is a simpler

programming language than C++.

Avoid platform dependencies with 100% Pure Java:

We can keep our program portable by avoiding the use of libraries written in

other languages. The 100% Pure JavaTM Product Certification Program has a

repository of historical process manuals, white papers, brochures, and similar

materials online.

Write once, run anywhere:

Because 100% Pure Java programs are compiled into machine-independent

byte codes, they run consistently on any Java platform.

Distribute software more easily:

We can upgrade applets easily from a central server. Applets take advantage

of the feature of allowing new classes to be loaded “on the fly,” without recompiling

the entire program.

6.2.1 ODBC

Microsoft Open Database Connectivity (ODBC) is a standard programming interface for

application developers and database systems providers. Before ODBC became a de facto

standard for Windows programs to interface with database systems, programmers had to use

proprietary languages for each database they wanted to connect to. Now, ODBC has made the

choice of the database system almost irrelevant from a coding perspective, which is as it should

be. Application developers have much more important things to worry about than the syntax that

is needed to port their program from one database to another when business needs suddenly

change. Through the ODBC Administrator in Control Panel, we can specify the particular

database that is associated with a data source that an ODBC application program is written to

use. Think of an ODBC data source as a door with a name on it. Each door will lead us to a

particular database. For example, the data source named Sales Figures might be a SQL Server

database, whereas the Accounts Payable data source could refer to an Access database. The

physical database referred to by a data source can reside anywhere on the LAN. The ODBC

system files are not installed on your system by Windows 95. Rather, they are installed when you

setup a separate database application, such as SQL Server Client or Visual Basic 4.0.

50

The advantages of this scheme are so numerous that you are probably thinking there must

be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to

the native database interface. ODBC has had many detractors make the charge that it is too slow.

Microsoft has always claimed that the critical factor in performance is the quality of the driver

software that is used. And anyway, the criticism about performance is somewhat analogous to

those who said that compilers would never match the speed of pure assembly language. Maybe

not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which

means you finish sooner. Meanwhile, computers get faster every year.

6.2.2 JDBC

In an effort to set an independent database standard API for Java; Sun Microsystems

developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access

mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface

is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database

vendor wishes to have JDBC support, he or she must provide the driver for each platform that the

database and Java run on. To gain a wider acceptance of JDBC, Sun based JDBC’s framework

on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety

of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much

faster than developing a completely new connectivity solution. JDBC was announced in March

of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user

input, the final JDBC v1.0 specification was released soon after.

6.2.2.1 JDBC Goals

Few software packages are designed without goals in mind. JDBC is one that, because of

its many goals, drove the development of the API. The goals that were set for JDBC are

important. They will give you some insight as to why certain classes and functionalities behave

the way they do.

The seven design goals for JDBC are as follows:

1. SQL Level API

The designers felt that their main goal was to define a SQL interface for Java. Although

not the lowest database interface level possible, it is at a low enough level for higher-level

51

tools and APIs to be created. Conversely, it is at a high enough level for application

programmers to use it confidently. Attaining this goal allows for future tool vendors to

“generate” JDBC code and to hide many of JDBC’s complexities from the end user.

2. SQL Conformance

SQL syntax varies as you move from database vendor to database vendor. In an effort to

support a wide variety of vendors, JDBC will allow any query statement to be passed through

it to the underlying database driver. This allows the connectivity module to handle non-

standard functionality in a manner that is suitable for its users.

3. JDBC must be implemental on top of common database interfaces

The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal

allows JDBC to use existing ODBC level drivers by the use of a software interface. This

interface would translate JDBC calls to ODBC and vice versa.

4. Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they

should not stray from the current design of the core Java system.

5. Keep it simple

This goal probably appears in all software design goal listings. JDBC is no exception.

Sun felt that the design of JDBC should be very simple, allowing for only one method of

completing a task per mechanism. Allowing duplicate functionality only serves to confuse

the users of the API.

6. Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less error

appear at runtime.

7. Keep the common cases simple

Because more often than not, the usual SQL calls used by the programmer are simple

SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to

perform with JDBC. However, more complex SQL statements should also be possible. And

for dynamically updating the cache table we go for MS Access database. Java has two things:

a programming language and a platform.

52

Java Program

Compilers

Interpreter

My Program

8. Keep the common cases simple

Because more often than not, the usual SQL calls used by the programmer are simple

SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to

perform with JDBC. However, more complex SQL statements should also be possible. And

for dynamically updating the cache table we go for MS Access database.Java has two things:

a programming language and a platform.

Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.

6.2.2.2 JFree Chart

JFree Chart is a free 100% Java chart library that makes it easy for developers to display

professional quality charts in their applications. JFree Chart's extensive feature set includes:

A consistent and well-documented API, supporting a wide range of chart types; A flexible design

that is easy to extend, and targets both server-side and client-side applications;

Support for many output types, including Swing components, image files (including PNG and

JPEG), and vector graphics file formats (including PDF, EPS and SVG); JFreeChart is "open

53

source" or, more specifically, free software. It is distributed under the terms of the GNU Lesser

General Public Licence (LGPL), which permits use in proprietary applications.

1. Map Visualizations

Charts showing values that relate to geographical areas. Some examples include:

(a) population density in each state of the United States, (b) income per capita for each

country in Europe, (c) life expectancy in each country of the world. The tasks in this

project include: Sourcing freely redistributable vector outlines for the countries of the

world, states/provinces in particular countries (USA in particular, but also other areas);

Creating an appropriate dataset interface (plus default implementation), a rendered, and

integrating this with the existing XYPlot class in JFreeChart.

2. Time Series Chart Interactivity

Implement a new (to JFreeChart) feature for interactive time series charts --- to

display a separate control that shows a small version of ALL the time series data, with a

sliding "view" rectangle that allows you to select the subset of the time series data to

display in the main chart.

3. Dashboards

There is currently a lot of interest in dashboard displays. Create a flexible

dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies,

thermometers, bars, and lines/time series) that can be delivered easily via both Java Web

Start and an applet.

4. Property Editors

The property editor mechanism in JFreeChart only handles a small subset of the

properties that can be set for charts. Extend (or reimplement). this mechanism to provide

greater end-user control over the appearance of the charts.

54

http://www.gnu.org/licenses/lgpl.html

http://www.gnu.org/licenses/lgpl.html

http://www.gnu.org/philosophy/free-sw.html

6.TESTING

55

6. TESTINGThe purpose of testing is to discover errors. Testing is the process of trying to discover

every conceivable fault or weakness in a work product. It provides a way to check the

functionality of components, sub assemblies, assemblies and/or a finished product It is the

process of exercising software with the intent of ensuring that the Software system meets its

requirements and user expectations and does not fail in an unacceptable manner. There are

various types of test. Each test type addresses a specific testing requirement.

TYPES OF TESTS Unit testing Unit testing involves the design of test cases that validate that the internal program logic is

functioning properly, and that program inputs produce valid outputs. All decision branches and

internal code flow should be validated. It is the testing of individual software units of the

application .it is done after the completion of an individual unit before integration. This is a

structural testing, that relies on knowledge of its construction and is invasive. Unit tests perform

basic tests at component level and test a specific business process, application, and/or system

configuration. Unit tests ensure that each unique path of a business process performs accurately

to the documented specifications and contains clearly defined inputs and expected results.

Integration testing

Integration tests are designed to test integrated software components to determine if they

actually run as one program. Testing is event driven and is more concerned with the basic

outcome of screens or fields. Integration tests demonstrate that although the components were

individually satisfaction, as shown by successfully unit testing, the combination of components is

correct and consistent. Integration testing is specifically aimed at exposing the problems that

arise from the combination of components.

Functional test

Functional tests provide systematic demonstrations that functions tested are available as

specified by the business and technical requirements, system documentation, and user manuals.

Functional testing is centered on the following items:

Valid Input : identified classes of valid input must be accepted.

56

Invalid Input : identified classes of invalid input must be rejected.

Functions : identified functions must be exercised.

Output : identified classes of application outputs must be exercised.

Systems/Procedures: interfacing systems or procedures must be invoked. Organization and

preparation of functional tests is focused on requirements, key functions, or special test cases. In

addition, systematic coverage pertaining to identify Business process flows; data fields,

predefined processes, and successive processes must be considered for testing. Before functional

testing is complete, additional tests are identified and the effective value of current tests is

determined.

System Test

System testing ensures that the entire integrated software system meets requirements. It tests a

configuration to ensure known and predictable results. An example of system testing is the

configuration oriented system integration test. System testing is based on process descriptions

and flows, emphasizing pre-driven process links and integration points.

White Box Testing White Box Testing is a testing in which in which the software tester has knowledge of the

inner workings, structure and language of the software, or at least its purpose. It is purpose. It is

used to test areas that cannot be reached from a black box level.

Black Box Testing

Black Box Testing is testing the software without any knowledge of the inner workings,

structure or language of the module being tested. Black box tests, as most other kinds of tests,

must be written from a definitive source document, such as specification or requirements

document, such as specification or requirements document. It is a testing in which the software

under test is treated, as a black box .you cannot “see” into it. The test provides inputs and

responds to outputs without considering how the software works.

6.1 Unit Testing:

57

Unit testing is usually conducted as part of a combined code and unit test phase of the

software lifecycle, although it is not uncommon for coding and unit testing to be conducted as

two distinct phases.

6.1.1Test strategy and approachField testing will be performed manually and functional tests will be written in detail.

Test objectives

All field entries must work properly. Pages must be activated from the identified link.The entry screen, messages and responses must not be delayed.

Features to be tested

Verify that the entries are of the correct format

No duplicate entries should be allowed

All links should take the user to the correct page.

6.1.2 Test Cases

S No. of test case 1

Name of test User browse for a file (Success)

Sample Input User selects a file to be clustered.

Expected output Displays Message as “File updated successfully”

Actual output Same as expected

RemarksThis component clearly tells that file is uploaded successfully.

Table 6.1: Unit Test Case1


58

Name of test User browse for a file (Fails)

Sample InputUser selects a file to be clustered of different file format.

Expected output Displays Message as “File not updated”


RemarksThis component clearly tells that file is not updated successfully.



Name of test User Click on Remove Button

Sample Input User after uploads a file click on remove button

Expected outputDisplays message as “Login SucStop words removed

Successfully”


RemarksThis component tells that user removed stop words successfully so that it can be forwarded for stemming


59



Name of test User Click on Stemming Button

Sample InputUser after click on remove button ,he will go for

stemming action

Expected outputDisplays message as “Stemming is successful” and

displays the words in distinct type


RemarksThis component tells that user removed stop words successfully ,so that it is filtered from stop words.



60


Name of testUser clicks on Stemming button without performing remove action

Sample InputUser forgot to click on remove and he went directly to

stemming button..

Expected outputDisplays message as “Please enter the remove and

then click on stemming”


RemarksThis component clearly that we will get an invalid error message if we doesn’t enter remove button before we go for stemming button.

Name of testUser clicks on Calculation button without performing stemming action

Sample InputUser forgot to click on stemming and he went directly

to calculation button..

Expected outputDisplays message as “Please enter the stemming and

then click on calculation”


RemarksThis component clearly that we will get an invalid error message if we doesn’t enter stemming button before we go for calculation button.



Name of test User Click on Stemming Button

Sample InputUser after click on stemming button ,he will go for

calculation action in order to find the clusters.

Expected outputDisplays message as “Clustered the input data set

successfully”


RemarksThis component tells that user performed stemming and it is ready to apply clustering algorithms.


61

6.2 Integration Testing

Software integration testing is the incremental integration testing of two or more

integrated software components on a single platform to produce failures caused by interface

defects. The task of the integration test is to check that components or software applications, e.g.

components in a software system or – one step up – software applications at the company level –

interact without error.

Test Results: All the test cases mentioned above passed successfully. No defects encountered.

6.3 Acceptance Testing

User Acceptance Testing is a critical phase of any project and requires significant participation by the end user. It also ensures that the system meets the functional requirements.

Test Results: All the test cases mentioned above passed successfully. No defects encountered.

62

7.CONCLUSION

7. CONCLUSION

63

We presented an approach that applies document clustering methods to document analysis of

computer document inspection. Also, we reported and discussed several practical results that can

be very useful for researchers and practitioners of document computing. More specifically, in our

experiments the hierarchical algorithms known as Average Link and Complete Link presented

the best results. Despite their usually high computational costs, we have shown that they are

particularly suitable for the studied application domain because the dendrograms that they

provide offer summarized views of the documents being inspected, thus being helpful tools for

document examiners that analyze textual documents from seized computers. As already observed

in other application domains, dendrograms provide very informative descriptions and

visualization capabilities of data clustering structures .

The partitional K-means and K-medoids algorithms also achieved good results when

properly initialized. Considering the approaches for estimating the number of clusters, the

relative validity criterion known as silhouette has shown to simplified version. In addition, some

of our results suggest that using the file names along with the document content information may

be useful for cluster ensemble algorithms. Most importantly, we observed that clustering

algorithms indeed tend to induce clusters formed by either relevant or irrelevant documents, thus

contributing to enhance the expert examiner’s job. Furthermore, our evaluation of the proposed

approach in five real-world applications show that it has the potential to speed up the computer

inspection process. Aimed at further leveraging the use of data clustering algorithms in similar

applications, a promising venue for future work involves investigating automatic approaches for

cluster labeling. The assignment of labels to clusters may enable the expert examiner to identify

the semantic content of each cluster more quickly—eventually even before examining their

contents. Finally, the study of algorithms that induce overlapping partitions (e.g., Fuzzy C-

Means and Expectation-Maximization for Gaussian Mixture Models) is worth of investigation.

64

REFERENCES

65

REFERENCES

1.Document Clustering for Forensic Analysis:An Approach for Improving

Computer Inspection Luís Filipe da Cruz Nassif and Eduardo Raul Hruschka.

2.Data Mining Concepts and Techniques-Jiawei Han and Micheline Kamber

3.Object-Oriented Software Engineering-Bruegge,Dutoit.

4.JAVA,The Complete Reference by Herbert Schildt.

66

APPENDIX

67

APPENDIX:

A-Input/Output Screens:

a.Selecting Dataset:

Figure 9.1: Selecting Dataset

Description:

The above page mainly indicates the dataset . in this process selecting the dataset.The selecting the data from computer randomly.

68

b. Removing Stop Words:

Figure 9.2: Removing Stop Words

Description:

The above page mainly indicates the removing stop words. In this process removing unnecessary data from the dataset.

69

d. Stemming:

Figure 9.3: Stemming Page

Description:

The above page mainly indicates the stemming data. We select the required data in stemming data process we remove the unnecessary words.

70

e. Clustering Process:

Figure 9.4: Clustering Process

Description:

The above page mainly indicates the clustering process we form the number of clusters

71

f. K-means:

g. Computing Term Frequency:

Figure 9.5: K-means Page

Description:This page describes the repetition of words. It calculates the how many times each word repeated.

72

h. Cluster Preprocessing:

Figure 9.6: Cluster Preprocessing

Description:

The above page mainly indicates the cluster preprocessing we used snow ball technique.

i. Distance Calculation:

73

Figure 9.7: Distance Calculation

Description:

The above page mainly indicates the Distance calculation. We used eucludian distance method.

j. Incremental or Hierarchal Clustering:

74

Figure 9.8: Incremental or Hierarchal Clustering

Description:

The above page mainly indicates the Incremental or Hierarchal Clustering. Here we initially find the similarity of the data points.

k. Similarity Measurement

75

Figure 9.9: Similarity Calculation

Description:

The above page mainly indicates the similarity calculation, in this level we calculate the similarity using cosine similarity and also provide maximum dissimilar value.

l. Purity Checking

76

Figure 9.10: Purity Checking

Description:

The above page mainly indicates the Purity Checking , in this level we give get the purity levels of K-means and Incremental Clustering.

m. Clustering Accuracy

77

Figure 9.11: Clustering Accuracy

Description:

The above page mainly indicates the Clustering Accuracy. In the above page the accuracy of the clustering techniques is represented in the form of a graph.

B-Source Code

78

Preprocess.java :

package ncluster;

import com.mysql.jdbc.Connection;

import java.io.*;

import java.sql.*;

import java.util.*;

import javax.swing.JFileChooser;

import ptstemmer.implementations.PorterStemmer;

public class preprocess extends javax.swing.JFrame {

String cont="", line="", path="", filename="", word="", str="", count="", nooffile="";

public static int numofdoc,count1,coun,i, noofterm;

File folder, files[];

PorterStemmer stemmer = new PorterStemmer();

float[] tf=new float[1500];

double[] idf=new double[1500];

double[] result=new double[1500];

int i1=0,j1=0,k1=0;

public preprocess() {

initComponents();

}

@SuppressWarnings("unchecked")

// <editor-fold defaultstate="collapsed" desc="Generated Code">

private void initComponents() {

selfiles = new javax.swing.JLabel();

79

select = new javax.swing.JButton();

jScrollPane1 = new javax.swing.JScrollPane();

text = new javax.swing.JTextArea();

textbox1 = new javax.swing.JTextField();

removestopword = new javax.swing.JButton();

stemming = new javax.swing.JButton();

title = new javax.swing.JLabel();

pathoffile = new javax.swing.JLabel();

calc = new javax.swing.JButton();

jPanel1 = new javax.swing.JPanel();

DocClust = new javax.swing.JLabel();

jLabel1 = new javax.swing.JLabel();

jLabel2 = new javax.swing.JLabel();

setDefaultCloseOperation(javax.swing.WindowConstants.EXIT_ON_CLOSE);

setTitle("Selecting_Documents");

setMinimumSize(new java.awt.Dimension(599, 601));

getContentPane().setLayout(null);

selfiles.setFont(new java.awt.Font("Times New Roman", 1, 15)); // NOI18N

selfiles.setForeground(new java.awt.Color(51, 51, 51));

selfiles.setText("Select Files ");

getContentPane().add(selfiles);

selfiles.setBounds(10, 110, 100, 30);

80

select.setBackground(java.awt.SystemColor.inactiveCaption);

select.setFont(new java.awt.Font("Times New Roman", 1, 11)); // NOI18N

select.setForeground(new java.awt.Color(0, 0, 102));

select.setText("SELECT");

select.addActionListener(new java.awt.event.ActionListener() {

public void actionPerformed(java.awt.event.ActionEvent evt) {

selectActionPerformed(evt);

}

});

getContentPane().add(select);

select.setBounds(120, 110, 100, 30);

text.setBackground(java.awt.SystemColor.inactiveCaption);

text.setColumns(20);

text.setFont(new java.awt.Font("Times New Roman", 1, 15)); // NOI18N

text.setForeground(new java.awt.Color(51, 51, 51));

text.setRows(5);

jScrollPane1.setViewportView(text);

getContentPane().add(jScrollPane1);

jScrollPane1.setBounds(70, 240, 440, 320);

textbox1.setFont(new java.awt.Font("Tahoma", 0, 12)); // NOI18N

textbox1.setForeground(new java.awt.Color(0, 0, 102));

getContentPane().add(textbox1);

81

textbox1.setBounds(170, 170, 360, 30);

removestopword.setBackground(java.awt.SystemColor.inactiveCaption);

removestopword.setFont(new java.awt.Font("Times New Roman", 1, 11)); // NOI18N

removestopword.setForeground(new java.awt.Color(0, 0, 102));

removestopword.setText("REMOVE");

removestopword.addActionListener(new java.awt.event.ActionListener() {


removestopwordActionPerformed(evt);

}

});

getContentPane().add(removestopword);

removestopword.setBounds(240, 110, 100, 30);

stemming.setBackground(java.awt.SystemColor.inactiveCaption);

stemming.setFont(new java.awt.Font("Times New Roman", 1, 11)); // NOI18N

stemming.setForeground(new java.awt.Color(0, 0, 102));

stemming.setText("STEMMING");

stemming.addActionListener(new java.awt.event.ActionListener() {


stemmingActionPerformed(evt);

}

});

getContentPane().add(stemming);

stemming.setBounds(350, 110, 100, 30);

82

title.setBackground(new java.awt.Color(255, 0, 0));

title.setFont(new java.awt.Font("Times New Roman", 1, 15)); // NOI18N

getContentPane().add(title);

title.setBounds(80, 210, 368, 21);

pathoffile.setFont(new java.awt.Font("Times New Roman", 1, 15)); // NOI18N

pathoffile.setText("Path of the File");

getContentPane().add(pathoffile);

pathoffile.setBounds(40, 170, 110, 18);

calc.setBackground(java.awt.SystemColor.inactiveCaption);

calc.setFont(new java.awt.Font("Times New Roman", 1, 11)); // NOI18N

calc.setForeground(new java.awt.Color(0, 0, 102));

calc.setText("CALCULATION");

calc.addActionListener(new java.awt.event.ActionListener() {


calcActionPerformed(evt);

}

});

getContentPane().add(calc);

calc.setBounds(460, 110, 130, 30);

jPanel1.setBackground(new java.awt.Color(204, 204, 204));

jPanel1.setLayout(null);

83

DocClust.setFont(new java.awt.Font("Times New Roman", 1, 20)); // NOI18N

DocClust.setIcon(new javax.swing.ImageIcon(getClass().getResource("/image/cooltext1297916834.png"))); // NOI18N

jPanel1.add(DocClust);

DocClust.setBounds(30, 60, 563, 40);

jLabel1.setIcon(new javax.swing.ImageIcon(getClass().getResource("/image/cooltext1297931724.png"))); // NOI18N

jPanel1.add(jLabel1);

jLabel1.setBounds(30, 10, 570, 40);

jLabel2.setIcon(new javax.swing.ImageIcon(getClass().getResource("/image/deep-blue-sky-background.jpg"))); // NOI18N

jPanel1.add(jLabel2);

jLabel2.setBounds(-50, -20, 680, 660);

getContentPane().add(jPanel1);

jPanel1.setBounds(-10, 0, 610, 630);

java.awt.Dimension screenSize = java.awt.Toolkit.getDefaultToolkit().getScreenSize();

setBounds((screenSize.width-607)/2, (screenSize.height-659)/2, 607, 659);

}// </editor-fold>

private void selectActionPerformed(java.awt.event.ActionEvent evt) {

try{

JFileChooser chooser=new JFileChooser();

84

int returnVal = chooser.showOpenDialog(this);

if(returnVal == JFileChooser.APPROVE_OPTION) {

folder = chooser.getCurrentDirectory();

path = folder.getPath();

textbox1.setText(path);

files = folder.listFiles();

}

title.setText("Content of the File");

if(files.length>1){

for(i = 0;i<files.length; i++){

if (files[i].isFile())

{

int index = files[i].getName().lastIndexOf('.');

if (index>0&& index <= files[i].getName().length() - 2 ) {

filename = files[i].getName().substring(0, index);

String fname = filename.toUpperCase();

text.append("\n"+fname+"\n\n");

}

}

FileReader fr = new FileReader(files[i]);

BufferedReader br = new BufferedReader(fr);

while((line = br.readLine())!=null){

text.append(line+" ");

}

text.append("\n");

85

}

}

}

catch (Exception ex) {

System.out.println(ex.getMessage());

}

}

private void removestopwordActionPerformed(java.awt.event.ActionEvent evt) {

while(true){

ch = Character.toLowerCase((char) ch);

w[j] = (char) ch;

if (j < 500) j++;

ch = in.read();

if (!Character.isLetter((char) ch)){

for (int c = 0; c < j; c++) s.add(w[c]);

s.stem();

{

String u;

u = s.toString();

f.createNewFile();

FileWriter writer = new FileWriter(newfname,true);

writer.write(u+" ");

writer.close();

text.append(u+"\n");

}

86

break;

}

}

}

if (ch < 0) break;

}

text.append("\n");

}

catch (Exception ex){


}

}

catch (Exception ex){


}

}

}

}

catch(Exception ex){


}

}

private void calcActionPerformed(java.awt.event.ActionEvent evt) {

frame1 form =new frame1();

87

form.setVisible(true);

}

public static void main(String args[]) {

java.awt.EventQueue.invokeLater(new Runnable() {

public void run() {

new preprocess().setVisible(true);

}});}

// Variables declaration - do not modify

private javax.swing.JLabel DocClust;

private javax.swing.JButton calc;

private javax.swing.JLabel jLabel1;

private javax.swing.JLabel jLabel2;

private javax.swing.JPanel jPanel1;

private javax.swing.JScrollPane jScrollPane1;

private javax.swing.JLabel pathoffile;

private javax.swing.JButton removestopword;

private javax.swing.JButton select;

private javax.swing.JLabel selfiles;

private javax.swing.JButton stemming;

private javax.swing.JTextArea text;

private javax.swing.JTextField textbox1;

private javax.swing.JLabel title;

// End of variables declaration }

Graph.java :

package ncluster;

88

import java.awt.*;

import org.jfree.chart.*;

import org.jfree.chart.axis.*;

import org.jfree.chart.plot.*;

import org.jfree.chart.renderer.category.BarRenderer;

import org.jfree.data.category.DefaultCategoryDataset;

public class Graph {

public static double kmeans = Purity.res1;

public static double hsk = Purity.res;

public static void main(String arg[]) {

DefaultCategoryDataset dataset = new DefaultCategoryDataset();

dataset.setValue(kmeans, "Accuracy", "K-MEANS");

dataset.setValue(hsk, "Accuracy", "Incremental Mining");

JFreeChart chart = ChartFactory.createBarChart("", "Text Mining", "Accuracy", dataset, PlotOrientation.VERTICAL, false, true, false);

chart.setBackgroundPaint(Color.white);

final CategoryPlot plot = chart.getCategoryPlot();

plot.setBackgroundPaint(Color.lightGray);

plot.setDomainGridlinePaint(Color.white);

plot.setRangeGridlinePaint(Color.white);

final NumberAxis rangeAxis = (NumberAxis) plot.getRangeAxis();

rangeAxis.setStandardTickUnits(NumberAxis.createIntegerTickUnits());

final BarRenderer renderer = (BarRenderer) plot.getRenderer();

renderer.setDrawBarOutline(false);

final GradientPaint gp0 = new GradientPaint(

89

0.0f, 0.0f, Color.blue,

0.0f, 0.0f, Color.lightGray);


0.0f, 0.0f, Color.green,



0.0f, 0.0f, Color.red,


renderer.setSeriesPaint(0, gp0);



final CategoryAxis domainAxis = plot.getDomainAxis();

domainAxis.setCategoryLabelPositions(

CategoryLabelPositions.createUpRotationLabelPositions(Math.PI / 6.0));

ChartFrame frame1 = new ChartFrame("Clustering Accuracy", chart);

frame1.setVisible(true);

frame1.setSize(500, 500);

}

90