msword

A Management and Visualisation Tool for

Text Mining Applications

Student Peishan Mao

MSc Computing Science Project Report

School of Computing Science and Information System

Birkbeck College, University of London 2005

Status Draft

Last saved 09 Apr. 23

1 of 92

1 TABLE OF CONTENTS

1 TABLE OF CONTENTS 2

2 ACKNOWLEDGEMENT 5

3 ABSTRACT 6

4 INTRODUCTION 7

5 BACKGROUND 8

5.1 Written Text 8

5.2 Natural Language Text Classification 85.2.1 Text Classification 85.2.2 The Classifier 9

5.3 Text Classifier Experimentations 12

6 HIGH-LEVEL APPLICATION DESCRIPTION 14

6.1 Description and Rationale 146.1.1 Build a Classifier 146.1.2 Evaluate and Refine the Classifier 15

6.2 Development and Technologies 15

7 DESIGN 17

7.1 Functional Requirements 17

7.2 Non-Functional Requirements 227.2.1 Usability 227.2.2 Hardware and Software Constraint 227.2.3 Documentation 23

7.3 System Framework 23

7.4 Components in Detail 257.4.1 The Client - User Interface 257.4.2 Display Manager 267.4.3 The Classifier 267.4.4 Data Manipulation and Cleansing 287.4.5 Experimentation 297.4.6 Results Manager 307.4.7 Error Handling 31

2 of 92

7.5 Class Diagram 32

8 DATABASE 33

8.1 Entities 338.1.1 Score Table 338.1.2 Source Table 338.1.3 Configuration Table 338.1.4 Score Functions Table 338.1.5 Match Normalisation Functions Table 348.1.6 Tree Normalisation Functions Table 348.1.7 Classification Condition Table 348.1.8 Class Weights Table 348.1.9 Temporary Max and Min Score Table 34

8.2 Views 358.2.1 Weighted Scores 358.2.2 Maximum and Minimum Scores 358.2.3 Misclassified Documents 35

8.3 Relation Design for the Main Tables 35

9 IMPLEMENTATION 37

9.1 Main User Interface 37

9.2 Display Manager 39

9.3 Classifier Classes 40

9.4 Results Output Classes 41

9.5 Other Controller Classes 43

9.6 TreeView Controller Class 44

9.7 Error Interface 45

10 IMPLEMENTATION SPECIFICS 46

10.1 Generic Selection Form Class 46

10.2 Visualisation of the Suffix Tree 48

10.3 Dynamic Sub-String Matching 49

10.4 User Interaction Warnings 50

11 USER GUIDE 53

3 of 92

11.1 Getting Started 5311.1.1 Input Data 53

11.2 Loading a Resource Corpus 54

11.3 Selecting a Sampling Set 57

11.4 Performing Pre-processing 61

11.5 Running N-Fold Cross-Validation 6411.5.1 Set Up Cross-Validation Set 6411.5.2 Perform experiments on the data 67

11.5.2.1 Create the Suffix Tree 6711.5.2.2 Display Suffix Tree 6911.5.2.3 Delete Suffix Tree 7111.5.2.4 N-Gram Matching 7111.5.2.5 Score Documents 7311.5.2.6 Classify documents 7411.5.2.7 Add New Document to Classify 76

11.6 Creating a Classifier 79

12 TESTING 81

13 CONCLUSION 83

13.1 Evaluation 83

13.2 Future Work 84

14 BIBLIOGRAPHY 86

15 APPENDIX A DATABASE 88

16 APPENDIX B CLASS DEFINITIONS 90

17 Appendix C Source Code 93

4 of 92

2 ACKNOWLEDGEMENT

I would like to thank the following people for their help over the course of this project:

Rajesh Pampapathi: for his spectrum of help on the project, ranging from his patient and advice on the whole area of text classification, and pointing me in the right direction for information on the topic to being interviewed as a potential user to the proposed system as part of the requirement collection.

Timothy Yip: for laboriously proof reading the draft for the report despite not having much interest in information technology.

5 of 92

3 ABSTRACT

This report describes the design and implementation of a management and visualisation tool for text classification applications. The system is built as a wrapper for machine learning classification tool. It aims to provide a flexible framework to accommodate for future changes to the system. The system is implemented in C# .Net with a Windows Forms front end and an Access Database as an example, but should be flexible enough to add different underlying components.

6 of 92

4 INTRODUCTION

This report describes the project carried out to implement a management and visualisation tool for text classification. It covers background information about the project, the design, implementation and conclusion. The report is organised as follows:

Section 4 this section. It describes the organisation of the report.

Section 5 takes a look at the background of the project. This section covers discussion on natural language classification, and suffix tree data structure used in Pampapathi et al’s study.

Section 6 a high-level description and rationale of the system.

Section 7 describes the design of the system. Lays out the system requirements, system framework, and describes system components and classes.

Section 8 explains the database design and description of the database entities and table relations.

Section 9 discusses how the system was implemented and goes into class definitions.

Section 10 focuses on specific system implementations and looks at the implementation of the generic selection form class, visualisation of the suffix tree, dynamic sub-string matching on documents, and user warnings.

Section 11 is the user guide to the system.

Section 13 concludes the project. This section discusses whether the system built has met the requirements laid out at the beginning of the project. It also looks at future work.

Appendix A Database

Appendix B Class Definitions

Error: Reference source not found

7 of 92

5 BACKGROUND

5.1 Written Text

Writing has long been an important means of exchanging information, ideas and concepts from one individual to another, or to a group. Indeed, it is even thought to be the single most advantageous evolutionary adaptation for species preservation [2]. The written text available contains a vast amount of information. The advent of the internet and on-line documents has contributed to the proliferation of digital textual data readily available for our perusal. Consequently, it is increasingly important to have a systematic method of organising this corpus of information.

Tools for textual data mining are proving to be increasingly important to our growing mass of text based data. The discipline of computing science has provided significant contributions to this area by means of automating the data mining process. To encode unstructured text data into a more structured form is not a straightforward task. Natural language is rich and ambiguous. Working with free text is one of the most challenging areas in computer science.

This project aims to investigate how computer science can help to evaluate some of the vast amounts of textual information available to us, and how to provide a convenient way to access this type of unstructured data. In particular, the focus will be on the data classification aspect of data mining. The next section will explore this topic in more depth.

5.2 Natural Language Text Classification

5.2.1 Text Classification

F Sebastiani [3] described automated text categorisation as

“The task of automatically sorting a set of documents into categories (or classes, or topics) from a predefined set. The task, that falls at the crossroads of information retrieval, machine learning, and (statistical) natural language processing, has witnessed a booming interest in the last ten years from researchers and developers alike.”

Classification maps data into predefined groups or classes. Examples of classification applications include image and pattern recognition, medical diagnosis, loan approval, detecting faults in industry applications, and classifying financial trends. Until the late 80’s, knowledge engineering was the dominant paradigm in automated text categorisation. Knowledge engineering consists of the manual definition of a set of rules which form part of a classifier by domain experts. Although this approach has produced results with accuracies as high as 90% [3], it is labour intensive and domain specific. The emergence of a new paradigm based on machine learning which answers many of the limitations with knowledge engineering has superseded its predecessor.

Machine learning encompasses a variety of methods that represent the convergence of statistics, biological modelling, adaptive control theory, psychology, and artificial

8 of 92

intelligence (AI) [11]. Data classification by machine learning is a two-phase process (Figure 1). The first phase involves a general inductive process to automatically build a model by using classification algorithm that describes a predetermined set of data classes which are non-overlapping. This step is referred to as supervised learning because the classes are determined before examining the data and the set of data is known as the training data set. Data in text classification comes in the form of files and each file is often described as documents. Classification algorithms require that the classes are defined based on purely the content of the documents. They describe these classes by looking at the characteristics of the documents in the training set already known to belong to the class. The learned model constitutes the classifier and can be used to categorise future corpus samples. In the second phase, the classifier constructed in the phase one is used for classification.

Machine leaning approach to text classification is less labour intensive, and is domain independent. Since the attribution of documents to categories is based purely on the content of the documents effort is thus concentrated on constructing an automatic builder of classifiers (also known as the learner), and not the classifier itself [3]. The automatic builder is a tool that extracts the characteristics from the training set which is represented by a classification model. This means that once a learner is built, new classifiers can be automatically constructed from sets of manually classified documents.

Figure 1. a) Step One in Text Classification b) Step two in text classification

5.2.2 The Classifier

In general a text classifier comprises a number of basic components. As noted in the previous section, the text classifier begins with an inductive stage. A classifier requires some sort of text representation of documents. In order to build an internal model the inductive step involves a set of examples used for training the classifier. This set of examples is known as the training set and each document in the training set is assigned to a class C = {c1, c2, … cn}. All the documents used in the training phase are transformed into internal representations.

5.2.3 Currently, a dominant learning method in text classification is based on a vector space model [5]. The Naïve Bayesian is one example and is often used as a benchmark in text classification experiments. Bayesian classifiers are statistical classifiers. Classification is based on the probability that a given document belongs to a particular class. The approach is ‘naïve’ because it assumes that

9 of 92

Classification Algorithm

Classification Model

Classification Model

Training Set

New Documents

Test Set

a)

b)

the contribution by all attributes on a given class is independent and each contributed equally to the classification problem. By analysing the contribution of each ‘independent’ attribute, a conditional probability is determined. Attributes in this approach are the words that appear in the documents of the training set.Documents are represented by a vector with dimensions equal to the number of different words within the documents of the training set. The value of each individual entry within the vector is set at the frequency of the corresponding word. According to this approach, training data are used to estimate parameters of a probability distribution, and Bayes theorem is used to estimate the probability of a class. A new document is assigned to the class that yields the highest probability. It is important to perform pre-processing to remove frequent words such as stop words before a training set is used in the inductive phase.The Naïve Bayesian approach has several advantages. Firstly, it is easy to use; secondly only one scan of the training data is required. It can also easily handle missing values by simply omitting that probability when calculating the likelihoods of membership in each class. Although the Naïve Bayesian-based classifier is popular, documents are represented as a ‘bag-of-words’ where words in the document have no relationships with each other. However words that appear in a document are usually not independent. Furthermore, the smallest unit of representation is a word.Research is continuously investigating how designs of text classifiers can be further improved and Pampapathi et al [1] at Birkbeck College, London recently proposed a new innovative approach to the internal modelling of text classifiers. They used a well known data structure called a suffix tree [11] which allows for indexing the characteristics of documents at a more granular level, with documents represented by substrings. The suffix tree is a compact trie containing all the suffixes of strings represented. A trie is a tree structure, where each node represents one character, and the root represents the null string. Each path from the root represents a string, described by the characters labelling the nodes traversed. All strings sharing a common prefix will branch off from a common node. When strings are words over a to z, a node has at most 26 children, one for each letter (or 27 children, plus a terminator). Suffix trees have traditionally been used for complex string matching problems in matching string sequences (data compression, DNA sequencing). Pampapathi et al’s research is the first to apply suffix trees to natural language text classification. Pampapathi et al’s method of constructing the suffix tree varies slightly from the standard way. Firstly, the tree nodes are labelled instead of the edges in order to associate directly the frequency with the characters and substrings. Secondly, a special terminal character is not used as the focus is on the substrings and not the suffixes. Each suffix tree has a depth. The depth is described by the maximum number of levels in the tree. A level is defined by the number of nodes away from the root node. For example the suffix tree illustrated in 5.2.3 has a depth of 4. Pampapathi et al’s sets a limit to the tree depth and each node of the suffix tree stores the frequency and the character. For example, to construct a suffix tree for the string S1 = “COOL”, the suffix tree in 5.2.3 is created. The substrings are COOL; OOL; OL; and L.

10 of 92

Suffix

Tree for String ‘COOL’

If a second string S2 =”FOOL” is inserted into the suffix tree, it will look like the diagram illustrated in Figure 3. The substrings for S2 are FOOL; OOL; OL; and L. Notice that the last three substrings in S2 are duplicates of some of the substrings already seen in S1, and new nodes are not created for these repeated substrings.

Figure 2. Suffix Tree with String ‘FOOL’ Added

Similar to the Naïve Bayesian method, a classifier using the suffix tree for its internal model undergoes supervised learning from a training set which contains documents that have been pre-classified into classes. Unlike the Naïve Bayesian approach, the suffix tree, by capturing the characteristics of documents at the character level, does not require pre-processing of the training set. A suffix tree is built for each class and a new document is classified by scoring it against each of the trees. The class of the highest

11 of 92

C (1) O (1) O (1) L (1)

O (1)

O (1) L (1)

L (1)

L (1)

Root

C (1) O (1) O (1) L (1)

O (2)

O (2) L (2)

L (2)

L (2)

Root

F (1) O (1) O (1) L (1)

scoring tree is assigned to the document. Pampapathi et al’s study was based on email classification and the result of the experiment showed that a classifier employing a suffix tree outperformed the Naïve Bayesian method.

In order to solve a classification problem, not only is the classifier one of the central components, but as seen with the Naïve Bayesian method it is also important to perform pre-processing on data used for training. The next section looks at other processes involved in text classification other than the classifier component itself.

5.3 Text Classifier Experimentations

As described in previous sections that there is a two-step process to classification:

1. Create a specific model by evaluating the training data. This step has as input the training data (including the category/class labels) and as output a definition of the model developed. The model created which is the classifier classifies the training data as accurately as possible.

2. Apply the model developed by classifying new sets of documents.

In the research community or for those interested in evaluating the performance of a classifier the second step can be more involved. First, the predictive accuracy of the classifier is estimated. A simple yet popular technique is called the holdout method which uses a test set of class-labelled samples. These samples are usually randomly selected and it is important that they are independent of the training samples, otherwise the estimate could be optimistic since the learned model is based on that data, and therefore tend to overfit. The accuracy of a classifier on a given test set is the percentage of test set samples that are correctly classified by the classifier. For each test sample the known class label is compared with the classifier’s class prediction for that sample.

If the accuracy of the classifier model is considered as acceptable, the model can be used to classify new documents.

Figure 3. Estimating Classifier Accuracy with the Holdout Method

The estimate using the holdout method is pessimistic since only a portion of the initial data is used to derive the classifier. Another technique call N-fold cross-validation is often used in research. Cross-validation is a statistical technique which can mitigate bias caused by a particular partition of training and test set. It is also useful when the amount of data is limited. The method can be used to evaluate and estimate the performance of a classifier, and the aim is to obtain as honest an estimation as possible

12 of 92

Derive Classifier

Estimate Accuracy

Corpus data

Test Set

Training Set

about the classification accuracy of the system. N-fold cross-validation involves partitioning the dataset (initial corpus) randomly into N equally sized non-overlapping blocks/folds. Then the training-testing process is run N times, with a different test set. For example, when N=3, we will have the following training and test sets.

Figure 4. 3-Fold Cross-Validation

For each cross-validation run the user will be able to use a training set to build the classifier.

Stratified N-fold cross-validation is a recommended method for estimating classifier accuracy due to its low bias and variance [13]. In stratified cross-validation, the folds are stratified so that the class distribution of the samples in each fold is approximately the same as that of the initial training set.

Preparing the training set data for classification using pre-processing can help improve the accuracy, efficiency, and scalability of the evaluation of the classification. Methods include stop word removal, punctuation removal, and stemming.

The use of the above techniques to prepare the data and estimate classifier accuracy increases the overall computational time yet is useful for evaluating a classifier, and selecting among several classifiers.

The current project aims to build a system which is a wrapper to a text classifier and incorporates the suffix tree that was used in the research done by Pampapathi et al as an example. The next section and beyond describes the project in detail.

13 of 92

Block 1

Block 2

Block 3

Train Test

Run 1 1, 2 3

Run 2 1, 3 2

Run 3 2, 3 1

6 HIGH-LEVEL APPLICATION DESCRIPTION

6.1 Description and Rationale

The aim of this project is to build a management and visualisation tool that will allow researchers to perform data manipulation support for underlying text classification algorithms. The tool will provide a software infrastructure for a data mining system based on machine learning. The goal is to build a flexible framework that would allow changes to the underlying components with relative ease. Functions maybe added to the system in the future. Adding new functionalities should have minimal effect on the current system.

The system will be built as a wrapper for the two-step process involved in classification. First, a component will be built that will automatically build a classifier given some training data. Secondly, to provide capabilities to perform classification and evaluate the performance of a classifier. Additionally, the tool will provide functionalities to run data sampling and various pre-processing on data.

For the researcher it is incumbent to clearly define the training set (this will be known as the ‘resource corpus’ in this report) used for the training the classifier. When the resource corpus is small the user can choose to use the entire corpus in the study. If the resource corpus is large, the tool gives the option to select sampling sets to represent it. A number of sampling methodologies is implemented that allows the user to select a sample, which will reflect the characteristics of the resource corpus from which it is drawn.

Note that a resource corpus is grouped into classes and this structure needs to be taken into consideration when the sampling mechanism was developed. Three popular sampling methods will be developed. Although other sampling methods can be added, such as convenience sampling, judgement sampling, quota sampling, and snowball sampling.

Note that the user can choose to evaluate data used to construct the classier before actually building the classifier. The tool will be designed to be generic enough to analyse a corpus of any categorisation type e.g. automated indexing of scientific articles, emails routing, spam filtering, criminal profiling, and expertise profiling.

6.1.1 Build a Classifier

The tool allows the user to build a classifier. The current framework only implements the suffix tree-based classifier developed by Birkbeck College using the suffix tree, but will be flexible enough to incorporate other classification models in the future. The research on suffix trees applied to classification is new, and there is currently no such application. The learning process of the classifier follows the machine learning approach to automated text classification, whereby the system automatically builds a classifier for the categories of interest. From the graphical user interface (GUI), the user can select a corpus to use as training data. The application provides links to .dll files developed by Birkbeck College which allow the user to build a suffix tree from the selected corpus. The internal data representation is constructed by generalising from a training set of pre-classified documents. Once the classifier is built the user can load new documents into the system to be classified.

14 of 92

6.1.2 Evaluate and Refine the Classifier

In research once a classifier has been built it is desirable to evaluate its effectiveness. Even before the construction of the classifier the tool provides a platform for users to perform a number of experiments and refinements on the source (training) data. Hence, the second focus of the project is to provide a user-friendly front-end and a base application for testing classification algorithms.

The user can load in a text based corpus and perform standard pre-processing functions to remove noise and prepare the data for experimentation. There is also a choice of sampling methods to use in order to reduce the size of the initial corpus making it more manageable.

Sebastiani [2] notes that any classifier is prone to classification error, whether the classifier is human or machine. This is due to a central notion to text classification that the membership of a document in a class based on the characteristics of the document and the class is inherently subjective, since the characteristics of both the documents and class cannot be formally specified. As a result automatic text classifiers are evaluated using a set of pre-classified documents. The accuracy of classifiers is compared to the classification decision and the original category the documents were assigned to. For experimentation and evaluation purpose, this set of pre-classified documents is split into two sets: a training set and test set, not necessarily of equal sizes.

The tool implements an extra level of experimentation using n-fold cross-validation. When employing cross-validation in classification it must take into account that the data is grouped by classes therefore this project will implement stratified cross-validation.

Once a classifier has been constructed, it is possible to perform data classification experiments as well as other tasks such as single document analysis. For example, for the implementation of a suffix tree-based classifier the user will be able to view the structure of the suffix tree, as well, the documents in the test sets or load a new document and obtain a full matrix of output data about it. The output data is persisted in an information system which is subsequently used to perform analysis and visualisation tasks.

6.2 Development and Technologies

Development was done in C#, using the .NET framework. The architect of the system was designed to be an extensible platform to enable users and developers to leverage the existing framework for future system upgrades. The tool was built from several components and aims to be modular. There are a number of controller components to provide functionalities for the tool. A set of libraries is used to provide the functionalities for the suffix tree. Working closely with researchers from Birkbeck College on the interface, these libraries for the suffix tree were provided by Birkbeck College.

The suffix tree data structure is built in memory and can become very large. One solution to better utilise resources is to have the data structure physically stored as one tree, although it is logically represented as individual trees for each class. Further discussion can be found in subsequent sections.

15 of 92

A Windows application was built as the client. This forms the interface that the user interacts with to gain access to the functionalities of the tool. The output data is cached in a database.

The main targeted users for the tool are researchers in the research community for natural language text classification, and other users who want to mine textual data.

16 of 92

7 DESIGN

7.1 Functional Requirements

Requirements for the application were collected from research on natural language text classification and discussions with targeted users in the research community. Requirements are the capabilities and conditions to which the application must conform. The functional requirements of the system are captured using ‘use cases’. Use cases are a useful tool in describing how a user interacts with a system. They are written stories that describe the interaction between the system and the user that is easy to understand. Requirements can often change over the course of development and for this reason there was no attempt to define and freeze all requirements from the onset of the project. The following use cases were produced. Note some use cases were added throughout the development of the system

Use Case Name: Load Directory as Source Corpus

Primary Actor: User

Pre-conditions: The application is running

Post-conditions: A source corpus is loaded into the application

Main Success Scenarios:

Actor Action (or Intention)

1. The user selects a valid directory and has at least read access to the directory, and loads it as a corpus into the system

System Responsibility

2. The system checks for directory path validity and access

3. Builds a tree structure of classes based on the sub-folders in the directory and displays the classes in the GUI

Use Case Name: View a Document in Corpus

Primary Actor: User

Pre-conditions: A corpus is successfully loaded

Post-conditions:



1. Select the document to view


2. Display content of document in the GUI

Use Case Name: Create Sampling Set

17 of 92

Primary Actor: User

Preconditions: A source corpus is successfully loaded

Postconditions: A sampling set based on the source corpus is created. New file directory created for the corpus.



1. User selects how they want to select the sampling set

2. User specifies location to store the documents/files created for the sampling set


3. Creates a sampling set based on parameters given by the user

4. Creates the directory structure and document/files in the location specified by the user

5. Displays new corpus created in the GUI

Use Case Name: Run Pre-Processing

Primary Actor: User

Pre-conditions: A training set exist in the system

Post-conditions: A new pre-processed sampling set created. New file directory created for the corpus.



1. Select type of pre-processing to perform

2. User specifies location to store the documents/files created for the pre-pre-processing set

3. Run pre-processing


4. Performs pre-processing5. Creates a new pre-processed set6. Stores the directory structure and

documents/files at the location specified by the user.

7. Displays the corpus as a directory structure in the GUI

Use Case Name: Run N-Fold Cross-Validation

Primary Actor: User

Preconditions: A sampling set is successfully created

Postconditions: N-fold cross-validation set is created virtually



1. User selects sampling set to process and the number of fold


2. Builds n-fold cross-validation set based on parameters given by the user, which includes the n-runs,

18 of 92

each run containing training set and test set.

3. Displays new cross-validation set created in the GUI

Use Case Name Create Classifier (Suffix Tree)

Primary Actor: User

Preconditions: A cross-validation set or classification set exist

Postconditions: Classifier created in memory



1. User actives an event to build classifier for a cross-validation set or classification set

2. User choose any additional conditions to apply


3. Builds classifier in memory, based on the corpus set selected

4. indicate in the GUI that the classifier of the corpus has been created

Use Case Name: Score Documents

Primary Actor: User

Preconditions: An n-fold cross-validation set is created. Classifier for the corpus set is created

Postconditions: Documents in the cross-validation set is scored and data stored in the database



1. User selects the cross-validation run to score


2. Scores all documents under the selected corpus set

3. Inserts score data into database

Use Case Name: Classify Documents

Primary Actor: User

Preconditions: An n-fold cross-validation set is created. Classifier for the set is created and the documents have been scored

Postconditions: Misclassified documents in the cross-validation set is flagged


Actor Action (or Intention) System Responsibility

19 of 92

1. User selects the cross-validation run to classify

2. Classify all documents under the selected cross-validation set

3. Flag all misclassified documents in the GUI

Use Case Name: Create Classification Set

Primary Actor: User

Preconditions: A source corpus is successfully loaded

Postconditions: A classification set is created virtually



1. User selects the corpus set they want to use to create a classifier


2. Display new corpus created in the GUI as a classification corpus set

Use Case Name: Load New Document to Classify

Primary Actor: User

Preconditions: Cross-validation set or classification set exist

Postconditions: Substring matches and relates output data is store in database



1. User decides which suffix tree to use for classification and loads in a valid textual document as an item to be classified and analysed


2. Document name and relevant information is displayed in the GUI ready to be analysed

3. Score and classify document4. Stores output data in database

Use Case Name: View a Document

Primary Actor: User

Pre-conditions: Document loaded into the system

Post-conditions:



1. Select the document to view


2. Display content of document on GUI

20 of 92

Use Case Name View n-Gram Matches in document

Primary Actor: User

Preconditions: The document in concern is successfully loaded and suffix classifier created

Postconditions:



1. User selects a string/substring in a document to match


2. Queries the classifier to retrieve the n length substring matches

3. Displays to user the frequency for the string/substring selected

Use Case Name View Statistics on Matches

Primary Actor: User

Preconditions: Document successfully loaded, scored and output exists in database

Postconditions: Displays information in GUI



1. User selects to view output


2. System queries and retrieves relevant data in the database

3. Displays the output in table form in the GUI

Use Case Name Visualise Representation of Classifier (View Suffix Tree)

Primary Actor: User

Preconditions: Classifier was successfully built

Postconditions: Classifier visual representation displayed on GUI



1. User selects option to display suffix tree


2. Builds visual representation of the classifier and displays in GUI

21 of 92

Use Case Name Delete Classifier

Primary Actor: User

Preconditions: Classifier was successfully built

Postconditions: Classifier is deleted



3. User selects classifier to delete


4. Remove classifier, and clear displayed tree in GUI

7.2 Non-Functional Requirements

The non-functional requirements for the use cases are as follows.

7.2.1 Usability

The user should have one main single user interface to interact with the system. The user interface should be user friendly and the complexity of computation e.g. building an n-fold cross-validation set, scoring documents against a classification model, should be hidden from the user.

An experimental run of the suffix tree classifier could involve as many as 126 scoring configurations, all of which could together take some considerable time to calculate. It therefore makes sense to keep a store of all calculated scores, rather than calculate them on-the-fly whenever they are requested. The results will be cached in a data store, which is implemented as database in this project. Hence, optimizing system responsiveness.

Some system requests can only be activated once a pre-condition has been satisfied e.g. the user can only score documents when the suffix tree has been created. The system should give informative warning messages if the user attempts to perform a task without pre-conditions being satisfied. Where appropriate, upon a task being performed, the system may automatically carry out pre-conditions before performing the requested task.

7.2.2 Hardware and Software Constraint

The application should be easily extensible and scalable. Developers should be able to add both extra functionality and expand the workload the application can handle with relative ease.

The design should consider the future enhancement of the system and should be reasonably easy to maintain and upgrade. Codes should also be well documented.

The system should use an RDBMS to manage its data layer, but be independent of the RDBMS it uses to manage its data.

22 of 92

7.2.3 Documentation

Help menus and tool tips will be available to help users interact with the system. The application will also come with a user manual, including screen shots. The application will be available along with written documentation for its installation and configuration.

7.3 System Framework

It was decided to build the system with a number of components. Each component has a specialised function in the system. Figure 6 illustrates the main components and the system boundary. The next section will describe the functions of each component in more detail and section 7.5 contains the class diagram. By isolating system responsibilities the following main components were identified.

User interface

Display Manager

Classifier (Central Manager, STClassifier Manager, STClassifier)

Sampling Set Generator

Pre-processor

Cross-validation

Results Manager (Database Manager, OLEDB, Database)

Figure 7 shows how the system is divided into a client/server architecture. The advantage of this set up is its ease of maintenance as the server implementation can be an abstraction to the client. All the functionalities of the system are accessed through the graphical user interface (GUI). The implementation is in the server, isolating users from the system complexities not relevant to the user.

One of the main aims of the design of the system was to create a flexible framework.

The green boxes seen in Figure 8 represent new or alternative components that can be added to the system in the future with relative ease.

23 of 92

Figure 5. System Components and Boundary

Figure 6. Client Server Division

24 of 92

System Boundary

Server

Client

Figure 7. Additional or Alternative Components

7.4 Components in Detail

7.4.1 The Client - User Interface

The user interacts with the system via a single graphical user interface which is also the client. In this project the client is implemented as a set of Windows forms and controls in .NET. There is one main form where users can access all the functionalities of the system. There are a number of other dialog boxes and forms to help with the navigation and interaction with the system. For example there is a Select Scoring Method form, used to request from the user the scoring methodology to use when scoring a new document. Other more generic forms such as the Select Dialog form are employed for a number of uses and do not display specific types of information (see section 10 Implementation Specifics for further discussion).

The client is simply an event handler for each of the GUI controls that calls the Central Manager via the Display Manager for actual data processing. The GUI contains no implementation, but delegates to the Display Manager, thus decoupling the interface from the implementation. There is a two-way communication between the client and the Display Manager, whereby a user invokes an event and related messages are passed to the Central Manager. The Central Manager passes the messages to the Central Manager which subsequently either delegates to other more specialised controllers to handle the task, or resolves the request itself.

The design of the screens was done in speaking with potential users. The user should be able to perform all the tasks described by the use cases seen earlier in the Functional Requirements section (the functions will not be reiterated here).

25 of 92

For this project Windows forms were chosen for the implementation because most users are familiar with the Windows form interface. It creates a familiar interface on initial interaction with the system and facilitates use of the system. In particular, the .NET framework provides a wealth of controls and functionalities, which help to build a user friendly interface and hides the complexity of the underlying workings from the user. The different components are built as separate classes and the user interface or the client can be implemented using a different methodology from Windows forms, such as command line as illustrated.

Figure 8. Client interface and Its Collaborating Components

7.4.2 Display Manager

The Display Manager is a layer between the User Interface and the Central Manager and the rest of the system. It essentially passes messages between these two components. The Display Manager is responsible for information displayed back to the user and it manages also the input data.

7.4.3 The Classifier

It was mentioned in the previous section that the Central Manager is part of the classifier. Figure 10 illustrates the classifier, which is enclosed by the red box and its

26 of 92

connecting components. The classifier comprises of the Central Manager, a controller that manages the underlying model of the classifier, and the underlying model itself.

The Central Manager is a controller that handles the communication between all the main components in the system which communicates with the classifier. The Central Manager should provide the following functionalities:

Select Sampling Set for a corpus

Pre-process all documents in a corpus

Run cross-validation on a corpus

Create a classifier for a given corpus

Score all documents in a corpus

Classify all documents in a corpus

Obtain classification results for a corpus

There are further controller classes called by the Central Manager to provide more specialised functionalities, these are the Output Manager, Suffix Tree Manager, Sampling Set Generator, Pre-processor, and Cross-validation.

When a user loads a corpus into the system it is managed by the Central Manager. If there is a request to create a sampling set for example, the Central Manager should know where the corpus is located and delegates the Sampling Set Generator the task of creating a sampling set based on parameters set by the user. Similarly, a request from the user to perform pre-processing on the corpus is delegated to the Pre-processor to carry out the task by the central manager.

The various components is designed to have specialised tasks, they do not need to know where the data is located as this information is passed to the components when the Central Manger invokes a request. The Sampling Set generator does not need to know how the Pre-processor carries out its task, nor does it need to know about the Cross-validation component. The three components receive data and requests from the Central Manager, perform its task and return any information back to the Central Manager.

The classifier has to be connected to an internal model. In this project the suffix tree data structure is employed to model the representation of document characteristics. As seen in Figure 10, the classifier can be implemented with different types of models such as a Naïve Bayesian or Neural Networks. There is a dual way communication between the Central Manager and the STClassifier via the STClassifier Manager. The STClassifier is a DLL library built by Birkbeck research. It provides public interfaces to:

Building the representation of documents using the suffix tree data structure

Training the classifier

Score a document

Returns classification results

The STClassifier Manager controls the flow of messages between the Central Manager and the STClassifier. Responsibilities involve converting data to the format that is

27 of 92

accepted by the STClassifier, and converting output from the STClassifier which is passed back to the STClassifier Manager. It is essentially a wrapper class for the STClassifier.

The suffix tree is built using the contents of documents in a training set. Once a suffix tree is built it will be cached in an ArrayList that is managed by the STClassifier Manager. An ArrayList is a C# collection class implemented in .NET. The suffix tree remains stored in memory until the user activates an event to delete the suffix tree. As a result the system does not need to create a suffix tree every subsequent action that references it. Hence, only methods in the STClassifier Manager are called and it is not necessary to call methods in the STClassifier.

The classifier generates output data when a request is invoked to classify and score documents. These two actions can be a time consuming activities. The Central Manager decides what type of output data needs to be saved and passes the data from the classifier to the Results Manager to handle. Section Figure 13 describes the design of the Results manager.

Figure 9. The Classifier and Its Collaborating Components

7.4.4 Data Manipulation and Cleansing

28 of 92

Classifier

When a corpus is loaded into the system as input data. The user can create sampling sets from the initial corpus and also prepare the data for experimentation by performing various types of pre-processing on the data. The input data is given to the classifier, which sends it to the Sampling Set Generator to handle the generation of sampling sets. Various sampling methodologies can be plugged into the Sampling Set Generator. For this project the system will implement random sampling and systematic sampling methodologies. The pre-processor provides the functionality for pre-processing data passed to it. Similarly, various methods of pre-processing can be plugged into the system with relative ease. Currently, the system provides stemming, stop word removal, and punctuation removal.

In order for a method to plug into the system, a method class must implement an IMethod interface so that it guarantees the following:

A method class must have a name property to return the name of the method. This is necessary, so if new methods are added to the system it will be identified by its name.

A method class must have a Run method. This method is where all the work is done

A set of utility classes will provide helper functionalities such as random number generator, common divisor, and file system.

Figure 10. Data Manipulation and Cleansing Components and Its Collaborating Components

7.4.5 Experimentation

Setting up data for experimentation is the main responsibly of the Cross-validation class. The Central Manager passes a corpus to the Cross-validation component, which uses the data to build N-fold cross-validation sets. It divides the given set of corpus into N blocks and builds a training set and test set for each N run. The data is stored as an array that is passed back to the Central Manager.

29 of 92

The methods the Cross-Validation class is expected to perform are:

Set the number of N-folds

Run N-fold cross-validation on a given source data

Return the cross-validation sets in an array data structure

Figure 11. Cross-validation and Its Collaborating Components

7.4.6 Results Manager

The Results Manager handles the output of the classifier and the repository of the output. The underlying RDBMS of this project is an Access database, which is used to cache the data generated by the classifier. The OLEDB component is responsible for the direct communication with the database. This class needs to provide the basic database functionalities such as read/write/ delete in a generic fashion. It is through the Database Manager object that all communication with the OLEDB library occurs, and the data flow between the Results Manager. The Database Manager manages the OLEDB. The green boxes illustrate that the information system for the system does not necessarily has to be an Access database. The system is designed to be able to store the data using a different means with relative ease, e.g. XML files, SQL server etc.

30 of 92

Figure 12. Results Manager and Its Collaborating Components

7.4.7 Error Handling

Adequate error handling for an end user application is essential. Displays of warnings and errors should be handled in the higher level of the system, namely by the Display manager and then displayed to the user in a reasonable fashion. Errors that occur in the other classes should be propagated to the Display Manager. All classes apart from the User Interface and the Display Manager are expected to implement an IErrorRecord interface. A class that implements this interface will guarantee that it has a property called error which returns the error message.

31 of 92

7.5 Class Diagram

Figure 14 shows a class diagram of the main components of the system discussed above

Figure 13. Class Diagram

32 of 92

8 DATABASE

8.1 Entities

All the data in the system is stored in an Access database. The following describes the organisation of the data that the system will store.

8.1.1 Score Table

When a user calls to score a new document or a set of documents, each document is scored against 126 configurations for each class. The data is cached in the score table.

8.1.2 Source Table

The source table stores the location properties of documents. This includes the physical pathname of the document and where it is logically located in the display tree.

8.1.3 Configuration Table

This configuration table stores the 126 combination of scoring methods used in Pampapathi et al’s study. Each configuration consists of a type of scoring function, match normalisation, and tree normalisation function.

8.1.4 Score Functions Table

33 of 92

This table contains the name description of score functions.

8.1.5 Match Normalisation Functions Table

This table contains the name description of match normalisation functions.

8.1.6 Tree Normalisation Functions Table

This table contains the name description of tree normalisation functions.

8.1.7 Classification Condition Table

This table stores any classification conditions to be considered when classifying a document from a particular corpus.

8.1.8 Class Weights Table

This table stores the class weights when classifying documents.

8.1.9 Temporary Max and Min Score Table

34 of 92

This is a temporary table used to cache the maximum and minimum scores for a class grouped by document, configuration.

8.2 Views

The following are some of the main views to assist in querying the main tables for data displayed in the user interface.

8.2.1 Weighted Scores

This view obtains the weighted scores by documents and scoring configuration.

8.2.2 Maximum and Minimum Scores

This view obtains the maximum and minimum score by document and scoring configuration.

8.2.3 Misclassified Documents

This view obtains the misclassified documents and related data.

8.3 Relation Design for the Main Tables

The main table of the database is the Scores table. This table contains the scores for each document, scored by different configuration combinations (see the Implementation section for scoring configuration description). Figure 15 shows the relationships between the main tables.

35 of 92

Figure 14. Table Relations

36 of 92

tMatchNormalisation

PK Index

Name

tScoreFunction

PK Index

Name

tTreeNormalisation

PK Index

Name

Config

PK,I1 ConfigId

FK2 SFFK3 MNFK1 TN SF Name MN Name TN Name

Scores

PK ScoreId

FK2,I4,I3 SourceIdFK1,I2,I1 ConfigId Score Class True Class Score

Source

PK SourceId

Node Parent Path Node Path File Path

tempMaxMinWScores

FK2,I2 SourceIdFK1,I1 ConfigId True Class

MaxOfWScore MinOfWScore

*..1

1..1

*..1

*..1*..1

1..1 1..1

9 IMPLEMENTATION

Due to the large size of the program, this report will not cover all the different implementation details, but instead the discussion will focus on the main classes and highlight some specific implementation. See Appendix B Class Definitions.

9.1 Main User Interface

The main form of the user interface is divided into four resizable panes which each display different types of information to the user (see Figure 16):

tvExplorer

rtxtView/sTreeView.

lblTreeDetail/listView

rTxtInfo

The tvExplorer is a Windows Form TreeView control, which displays the different corpuses available in the system. The information is presented as a hierarchy of nodes, like the way files and folders are displayed in the left pane of Windows Explorer.

The rtxtView is implemented as a Windows Forms RichTextBox control. When the user selects a child node in tvExplorer that represents a document, rtxtView will display the content of document. The rtxtView will also allow users to perform dynamic n-gram (sub-string) matching on a document (see section 10.3 Dynamic Sub-String Matching).

The sTreeView is implemented as a TreeView control. It shares the same pane as the rtxtView control and is only made visible on the main form (and the rtxtView becomes invisible) when the user requests to display a suffix tree that has been created. At the same time the lblSTreeDetail control, which is implemented as a Windows Form Label control will display description about the suffix tree currently displayed in the sTreeView control. ListView is a Windows Form ListView control which provides information related to the current content of the rtxtView control.

RtxtInfo is a RichText control and displays classification summary regarding a document.

37 of 92

Figure 15. Main User Interface

The main form is implemented as a .NET class called MainForm. Figure 17 shows the class members and class interface.

Note that there are other Windows Form control classes which were implemented to control the flow of user-system interaction. Section 10 Implementation Specifics will describe one of them in detail, and see Appendix x for all the user interface classes.

38 of 92

tvExplorer

lblSTreeDetail/listView

rtxtInfo rtxtView/sTreeView

Figure 16. MainForm Class Definition

9.2 Display Manager

The DisplayManager contains methods which collaborate with the CentralManager class to obtain information from the classifier. It also contains methods to display the relevant information in the user interface (i.e. the MainForm class). Figure 18 shows the class definition.

39 of 92

Figure 17. DisplayManager Class Definition

9.3 Classifier Classes

The classifier components implemented are:

Central Manager class

IClassifierModel interface

STClassifierManager class

STClassifier class

At the lowest level of the classifier classes is the STClassifier class which performs generic suffix tree operations such as create suffix tree, train suffix tree, add classes, and score class. The STClassifierManager is a controller or it can also be seen as a wrapper for the STClassifier class. This class contains methods to perform tasks that are more specific to the system. In order to plug in a classifier model into the Central Manager it must implement the IClassifierModel interface. The figures below show the members and class interfaces for the classes.

40 of 92

Figure 18. CentralManager Class Definition

Figure 19. IClassifierModel Interface Definition

Figure 20. SuffixTreeManager Class Definition

Figure 21. EMSTreeClassitier Class Definition

9.4 Results Output Classes

41 of 92

The output components implemented are:

IOutput interface

DatabaseManager class

OLEDB class

This project employed an Access database for its data storage component. At the lowest level of the results output classes is the OLEDB class. This class has direct access to the database and has methods to perform generic database commands such as connect to database, close connection to database, insert, and SQL delete, update, and select commands. The DatabaseManager class is a controller or it can also be seen as a wrapper for the OLEDB class to call methods to perform tasks more specific to the system. Notice that the IOutput interface has replaced the previously proposed Output Manager class. It was found that there was no real need to have another class between the Database Manager and the rest of the system, but instead to implement a contract to ensure that the Database Manager would provide a minimum of certain functionalities such as open a data store, close a data store, select, update, and delete. The figures below illustrate the definitions of the components.

Figure 22. IOutput Interface Definition

Figure 23. DatabaseManager Class Definition

42 of 92

Figure 24. OLEDB Class Definition

9.5 Other Controller Classes

The SampleSetGenerator, Preprocessor, and CrossValidation classes have a fairly simple class interface.

The most important method for each class is to execute the main task it is responsible for. That is to create a sampling set/corpus for the SampleSetGenerator class, Perform pre-processing for the Preprocessor class, and run cross-validation for the CrossValidation class.

Figure 25. SampleSetGenerator

Figure 26. Preprocessor Class Definition

43 of 92

Figure 27. CrossValidation Class Definition

The SampleSetGenerator class and the Preprocessor class have additional methodology classes plugged into them. As can be seen by each respective class, they have a class member called methodNames. This is an array that stores the method names each method implemented in the system.

The Preprocessor class implements three pre-processing methodologies: punctuation removal, stop word removal, and stemming. Each method class has to implement the IMethod interface.

Additional sampling methodologies that are plugged into the class can each be built as new classes and has to implement the IMethod interface

The SampleSetGenerator class similarly implements three sampling methodologies: census sampling, random sampling, and systematic sampling.

See appendix x for all class definitions. Below illustrates the IMethod interface definition and example of the StopWord method that is plugged into the Preprocessor class.

Figure 28. IMethod Interface Definition

Figure 29. StopWord Class Definition

9.6 TreeView Controller Class

During development it was discovered that it made sense to implement a separate controller class to manage the nodes displayed in the interface of the tvExplorer control

44 of 92

and the sTreeView control. The TreeViewNodeManager was implemented to handle TreeView nodes operations. The class included methods to perform the following tasks:

Create a new TreeNode in a TreeView control

Add TreeNode to a TreeView control

Search for a TreeNode in a TreeView control

Get a child TreeNode

Figure 30. TreeViewNodeManager Class Definition

9.7 Error Interface

The IErrorRecord simply returns an error message. This interface is implemented by all the classes in the system apart from the MainForm class and the DisplayManager class.

Figure 31. IErrorRecord Interface Definition

45 of 92

10 IMPLEMENTATION SPECIFICS

The system is very much an end-user application and this section will discuss a number a specific user interface implementations developed to satisfy some of the requirements.

10.1 Generic Selection Form Class

When a user invokes the application to select a sampling set, the application (or more specifically, the SamplingSetGenerator class) needs to know the following parameter settings in order to perform the task:

Source corpus to select sample from

Type of sampling methodology to use

Destination of new sampling corpus created

It was decided to use a popup Windows From or called a dialog box to collect the information from the user. Originally, a prototype for this dialog box was built that looked like the form shown in Figure 33. As illustrated, the top combo box lets user select the corpus to use, and the destination to save the sampling sets is specified in the destination text box. The available sampling methods are each represented as a separate check box. This implementation made the form static: it could only be used for selecting sampling methodologies, and if a new sampling method was added/removed to the system it would have been necessary to change the interface also.

Figure 32. Pre-Processing Methods Dialog Box

A rethink of how to make the form more flexible and accommodate future changes led to an alternative design illustrated in Figure 34. The check boxes were replaced with two list boxes. The list box on the left contains all the available pre-processing methods the

46 of 92

Select corpus

Select/specify destination

Choice of sampling methodologies

user can use and the list box on the right contains the methods which the user has selected to run.

Figure 33. Generic Selection Form Class Used for Pre-processing

The form is implemented as a class called SelectDialog. The class was designed to be generic enough to be reusable for similar data request such as selecting a sampling method, and selecting class frequencies to display along with a suffix tree (Figure 35).

When the class is instantiated the class constructor lets the developer customise the properties such as the form name, label names, and populate the left hand side list box.

Figure 34. Other Examples of the Generic Selection Form Class

47 of 92

10.2 Visualisation of the Suffix Tree

One of the requirements of the system was to be able to visualise the suffix tree. Initially a prototype that was built experimented with creating a custom class library to draw the representation of the suffix tree as a tree like structure shown in Figure 36.

A red node represented an expanded node, and a blue node represented a non expanded node. Each node label, apart from the root node displayed the character in the suffix tree node and the class frequency information.

With this implementation, it was necessary to keep track of the layout of the nodes to make sure it all fitted on to the page. It can be seen that if the frequencies of each node is included in the display, the visual representation becomes more convoluted. Suffix trees used to represent text documents are expected to be large and it will prove a problem to visually represent a suffix tree for a whole training corpus with this technique. Various ways to improve this method of visual representation was thought of, but in the end it was concluded to use a different approach.

Figure 35. Suffix Tree Visualisation

The final implementation choice was inspired by the Windows Explorer directory tree structure. The TreeView control of the .NET Windows Form library was used. As seen in Figure 37 example, the suffix tree visual representation is much clearer. Each node can accommodate the display of a number of class frequencies and not hinder the display clarity. Additionally, this approach is consistent with the display structure used in the tvExplorer control.

48 of 92

Figure 36. Suffix Tree Visualisation Implementation

10.3 Dynamic Sub-String Matching

Another requirement was to be able to perform n-gram matching on documents. That is to select a sub-string S1, and verifying whether S1

exists in the related suffix tree, and retrieve the frequency of occurrences in the tree. In the application built for this project, users are able to perform sub-string matching on the content of documents that belong to a corpus with a suffix tree associated with it. Such as a corpuses belonging to a cross-validation set or a corpuses that is a classification set.

The chosen method to implement this functionality was aimed at maximising interactiveness with the user. Once a related suffix tree has been created, the user is able to view the content of the document in the rtxtView control (a RichTextBox control of Windows Forms that forms one of the four panels on the main form). By selecting or highlighting a sub-string or text S1

in rtxtView, the system will automatically query the associated suffix tree and display on screen the frequencies of the S1

found in each class (see Figure 38).

With this user interface design, the functionality was made to be a more dynamic interaction with the user and increase the dynamics of user experience.

49 of 92

rtxtView

tvExplorer

Figure 37. N-Gram Matching Example

10.4 User Interaction Warnings

Some system events can only be activated once a pre-condition has been satisfied. The system uses different methods to give informative warning messages if the user attempts to perform a task without pre-conditions being fulfilled.

For example, the user can only score documents when the associated suffix tree has been created. If the user attempts to score documents before the tree has been constructed, the system will show a message box to warn the user (see Figure 39).

Not all warnings are displayed as a message box. Message boxes require a response from the user before the next action can be performed: that is the user has to close the message box first. In some situations this user response is not necessary. One of these situations is when the user wants to perform an n-gram match on a document. This can only be done when the associated suffix tree has been created. As seen in the previous section the user can perform dynamic n-gram matching by selecting a text displayed in the rtxtView control. If the associated suffix tree has not been created, a ToolTip control will notify the user that the suffix tree needs to be created first (see Figure 40).

50 of 92

rtxtView

Figure 38. Message box Warning Example

Figure 39. ToolTip Warning Example

51 of 92

rtxtView

rtxtView

Alternatively the system could simply disable the menu control for an action that is not available at a given point in time, but it will not be intuitive for the user to know what is required to active the functionality. Different ways of displaying informative warning to the user facilitates continuous use-system interaction.

Effort has not only been made to develop information warnings to the user, but other general informative messages are also shown to the user which is dependant on the user-system interaction. For instance, when it is possible to perform n-gram matching on a document that is currently selected and viewed in the rtxtView control, the system will notify the user of this functionality as a ToolTip control when the mouse cursor is moved over the rtxtView control (Figure 41).

Other more subtle indications of system states are also used. For example, a red coloured tree icon (Figure 40) is used to indicate that a suffix tree has not been created. A green coloured tree icon is used to indicate that a suffix tree has been created (Figure 41).

Figure 40. ToolTip Informative Use Example

52 of 92

rtxtView

11 USER GUIDE

11.1 Getting Started

At application start-up there are five base nodes: Resource Sets, Sampling Sets, Pre-Processed Sets, Cross-Validation Sets, and Classification Sets displayed in the top left panel. These five nodes represent the five types of corpuses that the system differentiates. As you interact with the system and perform various tasks new nodes are added to these base nodes as child nodes.

Actions are requested using the main menu and tree node sensitive pop-up menus in the top left panel.

Figure 1. Main User Form at Application Start Up

11.1.1 Input Data

The data loaded into the system as a corpus must follow a standard structure. The documents have to be in text format represented as text files. The documents have to be stored in a location accessible by the system and stored in one main folder directory which contains subfolders to represent the classes. Each class folder should contain the documents which have been pre-labelled to belong to the class the folder represents. See Figure 42 and Figure 42 for an example of Ham and Spam email corpus data.

53 of 92

Figure 41. Folder Directory Structure Email Example

Figure 42. Content of Class Directory Example

11.2 Loading a Resource Corpus

To start you can load resource data into the system. You can load more than one set of data. To load an initial corpus into the system follow the steps described below.

Select Actions | 1. Add Resource Corpus on the main menu

54 of 92

Figure 43.

Then select the directory where your data is located and click [OK]. Note that the input data has to be in the standard structure as explained in section x.

Figure 44.

Once you have selected the data, it will be displayed as two levels of child nodes under the Resource Sets node. The system uses the same names for the child nodes as the folder directory names used in the input data.

55 of 92

Figure 45.

Subsequently, you can navigate to the document nodes by expanding the nodes until you reach the leaf nodes, which are also the document nodes. You can view the content of a document by selecting the document node.

Figure 46.

56 of 92

11.3 Selecting a Sampling Set

You can select a sampling set from the resource sets. The methods currently available are census sampling, random sampling, and systematic sampling.

Select Actions | 2. Select Sampling Set on the main menu

Figure 47.

Select the resource set from the combo box that you want select a sample from.

Figure 48.

57 of 92

Define the output location where you would like to store the data for the sampling set(s). You can either directly input the directory name in the destination text

box, or you can click the browse command button and select the directory from the Browse for folder dialog box.

Figure 49.

You can choose to run three different sampling methodologies on the resource set. The left hand list box contains the available sampling methods. Use the arrow command button to select of un-select methods you wish to run. Each method you choose to run will generate a separate sampling set at your chosen destination.

Census sampling takes the whole resource set as your sampling set

Random sampling is the purest form of probability sampling. Each member of the population has an equal and known chance of being selected. When there are very large populations, it is often difficult or impossible to identify every member of the population, so the pool of available subjects becomes biased.

You will need to select the sample size ratio you wish to select. The combo box will give you the ratios available for the resource corpus you have selected.

58 of 92

Figure 50.

Systematic sampling is also called an Nth name selection technique. After the required sample size has been calculated, every Nth record is selected from a list of population members. As long as the list does not contain any hidden order, this sampling method is as good as the random sampling method. Its only advantage over the random sampling technique is simplicity.

Select the Nth number to use for you systematic sampling selection. The combo box will give you the numbers available for the resource corpus you have selected.

59 of 92

Figure 51.

All the sampling methods are stratified and take into consideration the classes within a corpus when performing sampling.

Once you have selected the data, it will be displayed as two levels of child nodes under the Sampling Sets node.

Figure 52.

60 of 92


Figure 53.

11.4 Performing Pre-processing

Once you have selected your sampling set(s), you can perform pre-processing on you sampling set(s). There are currently three methods available: stop word removal, punctuation removal and stemming

Select Actions | 3. Run Pre-processing on the main menu

61 of 92

Figure 54.

Select the sampling set and the destination you would like to pre-processed corpus to be saved. Then select the types of text pre-processing you wish to perform.

Figure 55.

62 of 92

Once you have selected the data, it will be displayed as two levels of child nodes under the Pre-Processing Sets node.

Figure 56.


63 of 92

Figure 57.

11.5 Running N-Fold Cross-Validation

Once you have selected your sample and performed pre-processing, you can analyse the data in terms of using it as a training set for a classifier by running n-fold cross validation. N-Fold cross-validation analysis will split the corpus into N blocks of data. Each block will have a training set data and test set data. The former is used to train the classifier. One a classifier has been built, you can test the classifier with the test set data.

11.5.1 Set Up Cross-Validation Set

Select Actions | 4. Run N-Fold Cross-Validation on the main menu

Figure 58.

Select the pre-processed corpus to use and the number of N-fold.

64 of 92

Figure 59.

Each fold, or known as a run contains a Training Set and a Test Set . There is also an empty node for loading in new documents , and a node representing the status of the corresponding suffix tree .

Figure 60.

65 of 92

For both the Training Set and Test Set nodes you can navigate to the document nodes by expanding the nodes until you reach the leaf nodes, which are also the document nodes. You can view the content of a document by selecting the document node.

Figure 61.

11.5.2 Perform experiments on the data

11.5.2.1 Create the Suffix Tree

You can create a suffix tree using the Training Set of the cross-validation run as the training data. To do this, follow the steps below.

Select the suffix tree node and right click the mouse. Select Create Suffix Tree… menu item

66 of 92

Figure 62.

Select the suffix tree depth. Once the tree has successfully been created, the suffix tree icon changes into a green colour.

Figure 63.

67 of 92

11.5.2.2 Display Suffix Tree

The suffix tree can only be displayed if the tree has been created. A red coloured tree icon indicates that a suffix tree has not been created, and a green coloured tree icon

means the tree has been created. If you attempt to display the suffix tree without it being created, a message box will notify you that the suffix tree needs to be created first. See Section 11.5.2.1 on how to create a suffix tree. Below describes how to display the suffix tree.

Select the suffix tree node and right click the mouse. Select Display Suffix Tree… menu item

Figure 64.

Select the class frequencies you would like to be displayed with the suffix tree

68 of 92

Figure 65.

The top right panel shows information about the suffix tree. The bottom right panel displays the visualisation of the suffix tree. Expand the nodes to see each level.

69 of 92

Figure 66.

11.5.2.3 Delete Suffix Tree

Select the suffix tree node and right click the mouse. Select Delete Suffix Tree… menu item

Figure 67.

11.5.2.4 N-Gram Matching

When the suffix tree icon is green i.e. when the suffix tree has been created, you can perform n-gram matching on the documents in the Test Sets. N-gram matching is when lets you select a sub-string and match it again the suffix tree and query the frequency of occurrences of the sub-string for each class.

Select a document under the Test Set node . The content of the document will be display in the bottom right pane.

70 of 92

Figure 68.

Select a sub-string within the text you want to match against the suffix tree. Note that the maximum length of the string that will exist in the suffix tree is the same as the dept of the tree you specified when you created the tree. For example, if you created a suffix tree with a depth of 5, and there will be no occurrences exiting in the suffix tree for a string that is 6 characters in length.

Figure 69.

71 of 92

11.5.2.5 Score Documents

You can score documents each class, once the suffix tree has been built. The system will calculate 126 different configurations of scoring metrologies. All scored are normalised.

Select a Test Set node and right click the mouse. Select Score All Documents menu item

Figure 70.

Once the documents have been scored, you can view the results for each document by simply selecting the document nodes.

72 of 92

Figure 71.

11.5.2.6 Classify documents

You can classify documents under the Test Set node once they have been scored. The system will flag any miss-classified documents.

Select a Test Set node and right click the mouse. Select Classify All Documents… menu item

73 of 92

Figure 72.

Specify the minimum score lead value. A document is given a score for each class and the minimum score lead value is the value that you want the highest class score to lead all the other class scores before it is classified under the highest class score.

The scores for each class can be weighted. Specify the weights for each class.

Then specify the scoring configuration you want to use.

Figure 73.

When the documents are classified any miss-classified documents will be

flagged with a red document icon . You can select the files and view the scores in more detail. You can also drill down and do n-gram matching against the documents to analyse the reason for any miss-classified documents.

74 of 92

Figure 74.

11.5.2.7 Add New Document to Classify

Select a New Documents node and right click the mouse. Select Add… menu item.

Figure 75.

Select the document you want to add. You can select multiple documents if you wish.

75 of 92

Figure 76.

Specify the minimum score lead value. A document is given a score for each class and the minimum score lead value is the value that you want the highest class score to lead all the other class scores before it is classified under the highest class score.

The scores for each class can be weighted. Specify the weights for each class.

Then specify the scoring configuration you want to use.

The system will automatically score and classify the document(s).

76 of 92

Figure 77.

Once the new document(s) has been added you can view the scores and its content by clicking on the document node. The bottom left pane will display information about the selected document and classification details. You can also perform n-gram matching.

77 of 92

Figure 78.

11.6 Creating a Classifier

You can create a new classifier using corpuses with the icon as training data. These are the corpuses under the Sampling Sets node and Pre-processed Sets node.

Select a corpus with the icon and right click the mouse. Select Add to Classification Set menu item. A new set of nodes is displayed under the Classification base node. The set contains the training data used to build the classifier, a node for loading new documents, and a node to represent the associated suffix tree.

Figure 79.

Similar to the functionalities available to perform on a cross-validation set, you can create, display, and delete the suffix tree. See section 11.5.2 Perform experiments on the data for detail.

78 of 92

Figure 80.

Like for cross-validation sets, you can add new document, score and classify the documents. You can then view the content, scores and perform n-gram matching on the documents. See section 11.5 Running N-Fold Cross-Validation for details.

Figure 81.

79 of 92

12 TESTING

The main input data used to test the data was sourced from data used in Pampapathi et al’s study. The data is emails grouped into two classes: ham (legitimate emails) and spam (unsolicited emails).

Testing was carried out throughout the development of the tool. Every functionality implementation was followed by tests on the functionality to make sure it works and that new developments did not impact previously implemented codes.

Initial design of the system included storing an internal representation of the data loaded into the system. There was a Corpus class, which could contain N number of classes called Class. A Class class contained an array which stored the pathnames of files/documents contained in each class of the corpus. As a result of the testing it was found that keeping an internal representation of the data took up significant computational time. The user interface was designed to show this information already in the TreeView structure (Figure 83). Therefore, to save extra computational time the information for sets of data was retrieved from the TreeView control instead, and the classes were dropped from the design. This design decision had other implications which are discussed in next section.

Figure 82. TreeView Control at the Top Left Panel of the User Interface

Actual using the system also helped to identify where system warnings and messages should be appropriately shown. Different icons were used in the GUI to represent

80 of 92

different types of data sets. A menu item to close the application was also added (Figure 84).

Figure 83. Application Close

The system is not limited in classifying ham and spam email documents as used for testing in this project, but extends to data with more than two classes.

81 of 92

13 CONCLUSION

13.1 Evaluation

One of the first and most important question after a software development is ‘Does the system fulfil the original requirements?’

The aim was to build a management and visualisation tool for text mining applications. Apart from providing the functionalities required for such a tool, the system had to employ a flexible framework that would allow additions or substitution of the underlying components with relative ease.

The system built for this project has managed to fulfil the core requirements and functions well. It provides a software infrastructure for a data mining system based on machine learning that automatically manages and refines the knowledge discovery and data mining process. The system hides the complexity involved in document categorisation and provides a single platform design for users in the research community to test and tune a classifier for their research domain.

It has been built to be a wrapper to provide the two-step process involved in classification, and a platform to carry out classifier validation and use. Each system component is built as a separate class and can be easily replaced or additions added to provide different functionalities. For some components to be added to the system it has to satisfy a contract that is defined by an interface. For example, to add a new classifier model next to the existing suffix tree classifier, the new class that is plugged into to the system has to implement the IClassifierModel interface.

There have been changes in the design of the system during the course of the project. Firstly, a new class, the TreeViewNodeManager was added to handle the TreeView controls used in the GUI. Secondly, a corpus class and category class were dropped. These two classes were intended to represent the data the system used. Instead the data logic was kept within the TreeView control structure used in the GUI. This changed resulted in increased system processing speed, and reduced duplicate of data management. However, it also meant that the current implemented GUI is tightly integrated with the system by the Windows Forms implementation, which does not satisfy the flexible framework that the system aims to provide. The design change decision was taken because since another aim of the project was to provide a visualisation tool it was felt that a graphical interface is an appropriate choice and increasing system response in a user-end type of application is important. The Windows Forms library in .NET provides powerful and wealth of existing capabilities to implement sophisticated GUIs. Other forms of visualisation of the classification model such as visually representing the suffix tree were experimented by building custom dll. This approach took up time and was essentially reinventing a wheel that was already available in .NET. It is unlikely that future work on the system would implement another user interface such a command line, but future requirements would be more concentrated on ways to improve the current GUI. If indeed there is a need to change the user interface, only the user interface, the DisplayManager need to be changed significantly, and the TreeViewNodeManager class discarded.

The requirements outlined in the Requirements section have been fulfilled, though there is scope for improving the application. The next section list out some suggested future work on the tool.

82 of 92

13.2 Future Work

A number of suggested improvements and additional functionalities are suggested below.

Database connection settings such as the database name, location, user id etc are currently hard coded in the DisplayManager class which sets the class members in the DatabaseManager class. This could be incorporated in the user interface to request the settings from the user. This will not only remove the hard coding, but if another type of output method in added to the system, the user can select which output store to use.

Evaluating the performance of a classifier is usually examined by evaluating the accuracy of the classification. Other evaluation approaches may involve determining the space and time overhead used. Although these are usually secondary, but are useful especially when new classification models are added. Future work could incorporate these techniques into the tool.

The system can employ a configuration file (e.g. XML file) to save settings for the system last executed session. So that on application close and the next time it is opened it will automatically setup the interface with the last session.

Going a step further, the configuration file could be used to store the current settings/system state during runtime. If this is implemented using a standard format such as XML, the same standard could be used as format of message exchanged between classes.

The ability to remove documents from a corpus. A user may find that a document that has been pre-classified to belong to a particular class is actually not a good representation for that class and should be removed from the training corpus.

Follow on from the previous item it may be useful not only to remove documents from a corpus but to move it to another corpus.

With a generic design the previous functionality should also with ease implement the functionality to allow users to add a document to a corpus.

Future work could develop the system to handle training data that has N level of hierarchies. A group of data is often found to belong to another higher level group for example, the classification automobile has cars, vans etc. Cars can be further broken down into the type of cars of make of cars, and then the car model.

Implementing a distributed system, running on several servers at once. This framework will provide support for high-throughput data processing. On the onset of the project research into ways of distributed systems was done. It was found that the remoting technology was best suited for this project. However it is complex to set up and the constraint in time did not allow for this form of architecture to be implemented in this project.

The presentation of the graphical user interface could be improved. For example the main form could be further divided into corpus types so that each screen shows grouped information and it will also be clearer.

83 of 92

The user guide could be integrated in the user interface, with search and index options.

The classifier is based on machine learning and the training data to such a classifier has to be pre-labelled in order to use it. The system could incorporate a whole new set of functionalities which uses clustering to find undetected group structures. As a result the input data loaded into the system does not necessarily have to be pre-classified.

Some of these changes can be made without too much trouble and would constitute an extra few months of work in total. Other suggestions are more involved and require more time to implement.

The project was a good challenge and through the course of the project I have learnt a lot about software development.

84 of 92

14 BIBLIOGRAPHY

[1] Rajesh M. Pampapathi, Boris Mirkin, Mark Levene. A suffix Tree Approach to Text Categorisation Applied to Spam Filtering. Available online: http://arxiv.org/abs/cs.AI/0503030, February 2005.

[2] Donald P. Ryan. Ancient Languages and Scripts. Webpage (last accessed 10 August 2005): http://www.plu.edu/~ryandp/texts.html

[3] Fabrizio Sebastiani. Text Categorization. In Alessandro Zanasi (ed.), Text Mining and its Applications, WIT Press, Southampton, UK, 2005, p109-129.

[4] Fabrizio Sebastiani. A Tutorial on Automated Text Categorisation. Istituto di Elaborazione dell’Informazione. Consiglio Nazionale delle Ricerche. Via S. Maria, 46-56126 Pisa (Italy), 1999.

[5] Joachims T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Nedellec C & Rouveirol C (eds.) Proceedings of ECML-98, 10th European Conference on Machine Learning. Lecture Notes in Computer Science series, no. 1398 Heidelberg: Springer Verlag 1998. p137-142.

[6] Gill Benjerano and Golan Yona. Variations on Probabilistic Suffix Trees: Statistical Modelling and Prediction of Protein Families School of Computer Science and Engineering, Hebrew University, Jerusalem 91904, Israel and Department of Structural Biology, Fairchild Bldg. D-109, Stanford University, CA, 94305, USA.

[7] Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), pp. 1-47, 2002.

[8] Yiming Yang. An Evaluation of Statistical Approached to Text Categorization. Information Retrieval, vol. 1, No 1/2., Kluwer Academic Publishers pp. 69-90, 1999.

[9] C. Manning and H. Schutze. Foundations of Statistical Natural Language Processing MIT Press, 1999.

[10] M. F. Porter. An algorithm for suffix stripping. In Readings in Information Retrieval, pp 313-316. Morgan Kaufmann Publishers Inc, 1997.

85 of 92

http://www.plu.edu/~ryandp/texts.html

http://arxiv.org/abs/cs.AI/0503030

Note: the algorithm was originally described in Porter, M. F., 1980, An algorithm for suffix stripping, Program, 14(3) : 130-137. It has since been reprinted in Sparck Jones, Karen, and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1-55860-454-4

[11] Dan Gusfield. Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press. 1997.

[12] Jiawei Han. Data Mining: Concepts and Techniques. Morgan Kaufmann, Academic Press. Lonond. 2001.

[13] Margaret H. Dunham. Data Mining and advance topics. Prentice Hall. London. 2003.

[14] Craig Larman. Applying UML and Patterns, An Introduction to Object-oriented Analysis and Design and the Unified Process (2nd Ed). Prentice Hall PTR, US, 2002.

[15] Martin Fowler. UML Distilled, A Brief Guide to the Standard Object Modeling Language (3rd ed). Pearson Education Inc, Boston, 2004.

[16] Dave Thomas, Agile Programming: design to accommodate Change. IEE Software www.computer.org/software, vol. 22, No 3 May/June 2005.

[17] Peter Drayton, Ben Albahari and Ted Neward. C# in a nutshell: a desktop quick reference. O’Reilly & Associates. California. 2002.

[18] Jesse Liberty. Programming C#. O’Reilly. Beijing Cambridge. 2003.

[19] C

86 of 92

http://www.computer.org/software

15 APPENDIX A DATABASE

Last Score ID:

SELECT Max(Scores.ScoreId) AS MaxOfScoreId

FROM Scores;

Last Source ID:

SELECT Max(Source.SourceId) AS MaxOfSourceId

FROM Source;

Weighted Scores View Query:

SELECT Scores.ScoreId, Scores.SourceId, Scores.ConfigId, Scores.[Score Class], Scores.[True Class], Scores.Score, [Score]*ClassWeights.Weight AS WScore

FROM (

Source

INNER JOIN ClassWeights ON Source.[Node Parent Path] = ClassWeights.[Node Path]) INNER JOIN Scores ON Source.SourceId = Scores.SourceId

WHERE (((Scores.[Score Class])=[Class])

);

Maximum and Minimum Scores View Query:

SELECT MaxWTable.SourceId, MaxWTable.ConfigId, MaxWTable.[True Class], MaxWTable.MaxOfWScore, MaxWTable.MinOfWScore

FROM (

SELECT ws2.SourceId, ws2.ConfigId, ws2.[True Class], Max(ws2.WScore) AS MaxOfWScore, Min(ws2.WScore) AS MinOfWScore

FROM WeightedScores AS ws2

GROUP BY ws2.SourceId, ws2.ConfigId, ws2.[True Class]

) AS MaxWTable

WHERE (

((MaxWTable.MaxOfWScore) Not In (

SELECT Count(ws3.WScore) AS CountOfWScore

FROM WeightedScores AS ws3

87 of 92

GROUP BY ws3.SourceId, ws3.ConfigId, ws3.[True Class], ws3.WScore

HAVING ( Count(ws3.WScore) >1)

))

);

Misclassified Documents:

SELECT Source.[Node Path], Source.[File Path], Config.SF, Config.MN, Config.TN, t2.[True Class], t2.[Score Class], t2.WScore

FROM (

(qry2a_MaxMinLeadWScoresByFile AS t1

INNER JOIN qry2a_MaxMinLeadWScoresByFile AS t2 ON (t1.SourceId = t2.SourceId)

AND (t1.ConfigId = t2.ConfigId)) INNER JOIN Source ON t1.SourceId = Source.SourceId)

INNER JOIN Config ON t1.ConfigId = Config.ConfigId

WHERE (((t2.[True Class])<>[t2].[Score Class]) AND ((t1.[True Class])=[t1].[Score Class]) AND ((t1.NScore)<[t2].[LeadDiff]));

88 of 92

16 APPENDIX B CLASS DEFINITIONS

Intefaces:

Data Types:

89 of 92

User Interfaces:

Methods:

90 of 92

Utility:

91 of 92

17 APPENDIX C SOURCE CODE

92 of 92