39
Table of Contents Chapter 6. Applications to Text Mining ....................................................................................... 245 6.1. Centroid-based Text Classification ..................................................................................... 247 6.1.1. Formulation of centroid-based text classification ..................................................... 248 6.1.2. Effect of Term distributions........................................................................................ 251 6.1.3. Experimental Settings and Results ............................................................................. 253 6.2. Document Relation Extraction ........................................................................................... 258 6.2.1. Document Relation Discovery using Frequent Itemset Mining ................................. 259 6.2.2. Empirical Evaluation using Citation Information........................................................ 259 6.2.3. Experimental Settings and Results ............................................................................. 264 6.3. Application to Automatic Thai Unknown Detection .......................................................... 269 6.3.1. Thai Unknown Words as Word Segmentation Problem ............................................ 271 6.3.2. The Proposed Method................................................................................................ 271 6.3.3. Experimental Settings and Results ............................................................................. 280 Sponsored by AIAT.or.th and KINDML, SIIT CC: BY NC ND

Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

Table of Contents

Chapter 6. Applications to Text Mining ....................................................................................... 245 6.1. Centroid-based Text Classification ..................................................................................... 247

6.1.1. Formulation of centroid-based text classification ..................................................... 248 6.1.2. Effect of Term distributions........................................................................................ 251 6.1.3. Experimental Settings and Results ............................................................................. 253

6.2. Document Relation Extraction ........................................................................................... 258 6.2.1. Document Relation Discovery using Frequent Itemset Mining ................................. 259 6.2.2. Empirical Evaluation using Citation Information ........................................................ 259 6.2.3. Experimental Settings and Results ............................................................................. 264

6.3. Application to Automatic Thai Unknown Detection .......................................................... 269 6.3.1. Thai Unknown Words as Word Segmentation Problem ............................................ 271 6.3.2. The Proposed Method ................................................................................................ 271 6.3.3. Experimental Settings and Results ............................................................................. 280

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 2: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

245

Chapter 6. Applications to Text Mining

As one application of data mining, text mining is a knowledge-intensive process that deals with a

document collection over time by a set of analysis natural language processing tools. Text mining

seeks to extract useful information from a large pile of textual data sources through the identification

and exploration of interesting patterns. The data sources can be electronic documents, email, web

documents or any textual collections, and interesting patterns are found not in formalized database

records but, instead, in the unstructured textual data in the documents in these collections. It is quite

common that text mining and data mining share many high-level architectural similarities, including

preprocessing routines, pattern-discovery algorithms, and visualization tools for presenting mining

results. While data mining assumes that data have already been stored in a structured format with

preprocessing of data cleansing and transformation, text mining deals with preprocessing of feature

extraction, i.e., usually keywords from natural language documents. The number of features in text

mining seems much larger than that in data mining since the features in text mining involves words,

which are highly various. Most text mining works exploits techniques and methodologies from the

areas of information retrieval, information extraction, and corpus-based computational linguistics.

This chapter presents three examples of text mining applications: text classification, document

relation extraction and unknown word detection in Thai language. The original literatures related to

these three applications can be found in (Lertnattee and Theeramunkong, 2004a), (Sriphaew and

Theeramunkong, 2007a) and (TeCho et. al, 2009b).

Before explanation of the applications, some basic concepts of text processing are provided as

follows. Towards text mining, several preprocessing techniques have been proposed to transform

structured document representations from raw textual data. Most techniques aims to use and

produce domain-independent linguistic features with natural language processing (NLP) techniques.

There are also text categorization and information extraction (IE) techniques, which directly deal

with the domain-specific knowledge. Note that a document is an abstract object. Therefore, we can

have a variety of possible actual representations for it. To exploit information in documents, we need

a so-called document structuring process which transforms raw representation to some kinds of

structured representation. To solve this task, at least three subtasks need to be solved; (1) text pre-

processing task, (2) problem-independent task, and (3) problem-dependent task. As the first subtask,

text pre-processing converts raw representation into a structure suitable for further linguistic

processing. For example, when the raw input is a document image or a recorded speech, pre-

processing is to convert the raw input into a stream of text, sometimes with text structures such as

paragraphs, columns and tables, as well as some document-level fields such as author, title, and

abstract by visual presentation. To convert document images to texts, optical character recognition

(OCR) is used while speech recognition can be applied to transform audio speeches into texts. As the

second subtask, the problem-independent tasks process text documents using general knowledge on

natural language. The tasks may include word segmentation or tokenization, morphological analysis,

POS tagging, and syntactic parsing in either shallow or deep processing. The output of these tasks is

not specific for any particular problem, but typically employed for further problem-dependent

processing. The domain-related knowledge, however, can often enhance performance of general-

purpose NLP tasks and is often used at different levels of processing. As the last step, the problem-

dependent tasks attempt to output final representation suitable for the concerned task for text

categorization, information extraction, etc. However, up to now it has been shown that different

analysis levels, including phonetic, morphological, syntactical, semantical, and pragmatical, occur

simultaneously and depend on each other. Even now how human process a langauge is still

unrevealed. Some works have tried to combine such levels into one single process but still have not

yet achieved a level of satisfactory. Therefore most of text understanding methods use the divide-

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 3: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

246

and-conquer strategy, separating the whole problem into several subtasks and solving them

independently as follows.

Tokenization and Word Segmentation

The important step towords text analysis is to break down a continuous character stream into

meaningful constituents, such as chapters, sections, paragraphs, sentences, words, and even syllables

or phonemes. Tokenization is a process to break the text into sentences and words. In English, the

main challenge is to identify sentence boundaries since a period can be used as the end of a sentence

and a part of a previous token like Dr., Mr., Ms., Prof., St., No. and so on. In general a tokenizer may

extract token features, such as types of capitalization, inclusion of digits, punctuation, special

characters, and so on. These features usually describe some superficial property of the sequence of

characters that make up the token. For languages without explicit word boundaries such as Thai,

Japanese, Korean and Chinese, word segmentation is necessary. This processing is very important to

construct the fundamental units for processing such languages.

Part-of-Speech (POS) Tagging

POS tagging is the process to assign a word type (category) for each word in a sentence with the

appropriate POS tags based on the context they appear. The POS tag of a word specifies the role the

word plays in the sentence where it appear. It also provides the initial information related to the

semantic content of a word. Among several works, the most common set of tags includes seven

different tags, i.e., article, noun, verb, adjective, preposition, number, and proper noun. Some systems

contain a much more elaborate set of tags. For instance, there have been at least 87 basic tags in the

complete Brown Corpus. More types of tags means more detailed analysis.

Syntactical Parsing

Syntactical parsing is a process that applies a grammar to detect the structure of a sentence. In the

sentence structure, common constituents in grammars include noun phrases, verb phrases,

prepositional phrases, adjective phrases, and subordinate clauses. Following grammar rules, each

phrase or clause may consist of smaller phrases or words. For deeper analysis, the syntactical

structure of sentences may also elaborate the roles of different phrases, such as a noun phrase as a

subject, an object, or a complement. In the grammar, it is also possible to specify dependency among

phrases or clauses at several different levels. After analyzing a sentence, the output can be

represented as a sentence graph with connected components.

Shallow Parsing

In real situation, it is not easy to fully analyze the structure of a sentence since language usage is

sometimes complicated and flexible. Therefore it is almost impossible to construct a grammar that

covers all cases. Moreover, while we try to revise a grammar to cover special cases, as a by-product a

lot of ambiguity will be triggered in the grammar. Such ambiguity needs to be solved by higher

process, such as semantic or pragmatic processing. By this situation, normally traditional algorithms

are computationally expensive to process a large number of sentences in a very large corpus. They

are also not robust enough. Instead of full analysis, shallow parsing is a practical alternative since it

will not perform a complete analysis of a whole sentence but only treat some parts in the sentence

that are simple and unambiguous. For example, shallow parsing finds only small and simple noun

and verb phrases, but not complex clauses. Therefore we can compromise speed and robustness of

processing by sacrificing depth of analysis. Most prominent dependencies might be formed, but

unclear and ambiguous ones are left unresolved.

For the purposes of information extraction, shallow parsing is usually sufficient and therefore

preferable to full analysis because of its far greater speed and robustness.

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 4: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

247

Problem-Dependent Tasks

After text preprocessing and problem-independent processing, the final stage is to create meaningful

representations for either later more sophisticated processing. Normally to process a text,

documents are expected to be represented as sets of features. Two common applications for text

mining are text categorization (TC) and information extraction (IE). Both of these applications need a

tagging (and sometimes parsing) process. TC and IE enable users to move from a machine-readable

representation of the documents to a machine- understandable form of the documents.

Text categorization or text classification is a task to assign a category (also called a class) to each

document, such as giving the class ‘political’ to a political news. The number of groups depends on

the user preference. The set of all possible categories is usually manually predefined beforehand. All

categories are usually unrelated. However, recently there are multidimensional text classification or

multi-class text classification has been explored intensively. Information Extraction (IE) is a task to

discover important constituents in a text, such as what, where, who, whom, when (5W). Without IE

techniques, we would have much more limited knowledge discovery capabilities. IE is different to

information retrieval (IR) which perform search. Information retrieval just discover documents

relavant to a given query and let the user read the whole document. IE, on the other hand, aims to

extract the relevant information and present it in a structured format, such as a table. IE can help us

save time for reading the whole document by providing essential information in a structured form.

6.1. Centroid-based Text Classification

With the fast growth of online text information, there has been extreme need to find and organize

relevant information in text documents. For this purpose, it is known that automatic text

categorization (also known as text classification) becomes a significant tool to utilize text

documents efficiently and effectively. As an application, it can improve text retrieval as it allows

find class-based retrieval instead of full retrieval. Given statistics acquired from a training set of

labeled documents, text categorization is a method to use these statistics to assign a class label to

a new document. In the past, a variety of classification models were developed in different

schemes, such as probabilistic models (i.e., Bayesian classification), decision trees and rules,

regression models, example-based models (e.g., k-nearest neighbor or k-NN), linear models,

support vector machine, neural networks and so on. Among these methods, a variant of linear

models called a centroid-based or linear discriminant model is attractive since it has relatively

less computation than other methods in both the learning and classification stages. The

traditional centroid-based method can be viewed as a specialization of so-called Rocchio method

proposed by Rocchio (1971) and used in several works on text categorization (Joachims, 1997).

Based on the vector space model, a centroid-based method computes beforehand, for each class

(category), an explicit profile (or class prototype), which is a centroid vector for all positive

training documents of that category. The classification task is to find the most similar class to the

vector of the document we would like to classify, for example by the means of cosine similarity.

Despite the less computation time, centroid-based methods were shown to achieve relatively

high classification accuracy. In a centroid-based model, an individual class is modeled by

weighting terms appearing in training documents assigned to the class. This makes classification

performance of the model strongly depend on the weighting method applied in the model. Most

previous works of centroid-based classification focused on weighting factors related to frequency

patterns of words or documents in the class. Moreover, they are often obtained from statistics

within a class (i.e., positive examples of the class). The most popular factors are term frequency

(tf) and inverse document frequent (idf).

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 5: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

248

Text categorization or text classification (TC) is a task of assigning a Boolean value to each

pair where is a domain of documents and

is a set of predefined categories. A value of T (i.e., true) is assigned to

when the document is determined to belong to the category . On the other hand, a value of F

(i.e., false) is assigned to when the document is determined not to belong to the

category . In general, text classification is composed of two main phases, called model training

phase and classification phase. In the training phase, the task is to approximate the unknown

target function that describes how documents should be classified. Based on a

training set, a function called the classifier (also called rule, hypothesis, or model) is acquired

as the result of approximation. A good classifier is a model that coincides with the target function

as much as possible.

The TC task discussed above is general. Anyway, there are some additional factors or

constraints possible for this task. They include single-label vs. multi-label, category-pivoted vs.

document-pivoted and hard vs. ranking classification. Single-label classification assigns exactly

one category to each while multi-label classification may give more than one categories to

the same . A special case of single-label TC is binary TC where each must be

assigned either to category or to its complement . From the pivot aspect, there are two

different ways of using a text classifier. Given ., the task to find all that the document

dj belongs to is called document-pivoted classification. Alternatively, given , the task to find

all that the document belongs to is named category-pivoted classification. This

distinction is more pragmatic than conceptual and it occurs when the sets C and D might not be

available in their entirety right from the scratch. Lastly, hard categorization is to assign T or F

decision for each pair while ranking categorization is to rank the categories in C

according to their estimated appropriateness to , without taking any hard decision on any of

them. The task of ranking categorization is to approximate the unknown target function

by generating a classifier that matches with the target

function as much as possible. The result is to assign a number between 0 and 1 to each

pair . This value represents the likelihood the document is classified into the

category . Finally, for each , a ranked list of categories is obtained. This list would be of

great help to a human expert to make the final categorization decision. By these definitions, the

focused task in this work is evaluated as single-label, category-pivoted and hard classification.

6.1.1. Formulation of centroid-based text classification

In centroid-based text categorization, an explicit profile of a class (also called a class prototype)

is calculated and used as the representative of all positive documents of the class. The

classification task is to find the most similar class to the document we would like to classify, by

way of comparing the document with the class prototype of the focused class. This approach is

characterized by at least three factors; (1) representation basics, (2) class prototype

construction: term weighting and normalization, and (3) classification execution: query

weighting and similarity definition. Their details are described in the rest of this section.

Representation Basics

The frequently used document representation in IR and TC is the so-called bag of words (BOW)

where words in a document are used as basics for representing that document. There are also

some works that use additional information such as word position and word sequence in the

representation. In the centroid-based text categorization, a document (or a class) is represented

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 6: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

249

by a vector using a vector space model with BOW. In this representation, each element (or

feature) in the vector is equivalent to a unique word with a weight. The method to give a weight

to a word is varied work by work as described in the following section. In a more general

framework, the concept of n-gram can be applied. Instead of a single isolated word, a sequence of

n words will be used as representation basics. In several applications, not specific for

classification, the most popular n-grams are 1-gram (unigram), 2-gram (bigram) and 3-gram

(trigram). Alternatively, the combination of different n-grams, for instance the combination of

unigram and bigram, can also be applied. The n-grams or their combinations form a set of so-

called terms that are used for representing a document. Although a higher gram provides more

information and this may affect in improving classification accuracy, more training data and

computational power are required

Class Prototype Construction: Term Weighting and Normalization

Once we obtain a set of terms in a document, it is necessary to represent them

numerically. Towards this, term weighting is applied to set a level of contribution of a

term to a document. In the past, most of existing works applied term frequency (tf) and

inverse document frequency (idf) in the form of for representing a document.

In the vector space model, given a set of documents , a document

is represented by a vector , where is a weight assigned to a

term in the document. Here, assume that there are m unique terms in the universe.

The representation of the document is defined as follows.

In this definition, is term frequency of a term in a document and is defined

as . Here, is the total number of documents in a collection and is the

number of documents, which contain the term . Three alternative types of term

frequency are (1) occurrence frequency, (2) augmented normalized term frequency

and (3) binary term frequency. The occurrence frequency, the simplest and intuitive

one, corresponds to the number of occurrence of the term in a document. The

augmented normalized term frequency is defined by where is

the occurrence frequency and is the maximum term frequency in a document.

This compensates for relatively high term frequency in the case of long documents. It

works well when there are many technical meaningful terms in documents. The binary

term frequency is nothing more than 1 for presence and 0 for absence of the term in the

document. Term frequency alone may not be enough to represent the contribution of a

term in a document. To achieve a better performance, the well-known inverse

document frequency can be applied to eliminate the impact of frequent terms that exist

in almost all documents.

Besides term weighting, normalization is another important factor to represent a

document or a class. Without normalization, the classification result will strongly

depend on the document length. A long document is likely to be selected, compared to a

short document since it usually includes higher term frequencies and more unique

terms in document representation. The higher term frequency of a long document will

increase the average contribution of its terms to the similarity between the document

and the query. More unique terms also increase the similarity and chances of retrieval

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 7: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

250

of longer documents in preference over shorter documents. To solve this issue,

normally all relevant documents should be treated as equally important for

classification or retrieval. Normalization by document length is incorporated into term

weighting formula to equalize the length of document vectors. Although there are

several normalization techniques including cosine normalization and byte length

normalization, the cosine normalization is the most commonly used. It can solve the

problem of overweighting due to both higher term frequency and more unique terms.

The cosine normalization is done by dividing all elements in a vector with the length of

the vector, that is

where is the weight of the term before normalization.

Given a class with a set of its assigned documents, there are two possible alternatives to

create a class prototype. One is to normalize each document vector in a class before summing up

all document vectors to form a class prototype vector (normalization then merging). The other is

to sum up all document vectors before normalizing the result vector (merging then

normalization). The latter one is also called a prototype vector, which is invariant to the number

of documents per class. However, both methods obtain high classification accuracy with small

time complexity. The class prototype can be derived as follows. Let is a document

vector belonging to the class } be a set document vectors assigned to the class . Here, a class

prototype is obtained by summing up all document vectors in and then normalizing the

result by its size as follows.

Classification execution: query weighting and similarity definition

The last but not least important factors are query weighting and similarity definition.

For query weighting, term weighting described above can also be applied to a query or

a test document (i.e., a document to be classified). The simple term weighting for a

query is . In the same way as class prototype construction, there are three

possible types of term frequency; occurrence frequency, augmented normalized term

frequency and binary term frequency. Once a class prototype vector and a query vector

have been constructed, the similarity between these two vectors can be calculated. The

most popular one is cosine distance. This similarity can be calculated by the dot

product between these two vectors. Therefore, the test document ( ) will be assigned

to the class whose class prototype vector is the most similar to the query vector ( )

of the test document.

Here, as stated before, is equal to 1 since the class prototype vector has been

normalized. Moreover, the normalization of the test document has no effect on ranking.

Therefore, the test document is assigned to the class when the dot product of the test

document vector and the class prototype vector achieves its highest value.

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 8: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

251

6.1.2. Effect of Term distributions

Originally, Lertnattee and Theeramunkong (2004a, 2006a, 2006b, 2007b) have done a series of

research works to investigate the effect of term distributions on classification accuracy.

Therefore, the reader can find the full description of this work in those publications. In this

section, the summary of this work is given. Here, three types of term distributions, called inter-

class, intra-class and in-collection distributions, are introduced. These distributions are expected

to increase classification accuracy by exploiting information of (1) term distribution among

classes, (2) term distribution within a class and (3) term distribution in the whole collection of

training data. They are used to represent importance or classification power to weight that term

in a document. Another objective of this work is to investigate the pattern of how these term

distributions contribute to weight a term in documents. For example, high term distribution of a

word (or term) should promote or demote importance of that word. Here, it is also possible to

consider unigram or bigram as document representation.

Term distributions

The first question is what are the characteristics of terms that are significant for representing a

document or a class. In general, we can observe that (1) a significant term should appear

frequently in a certain class and (2) it should appear in few documents. These two properties can

be handled by the conventional term frequency and inverse document frequency, respectively.

However, we can observe more that (1) a significant term should not distribute very differently

among documents in the whole collection, (2) it should distribute very differently among classes,

and (3) it should not distribute very differently among documents in a class. These three

characteristics cannot be represented by conventional tf and idf. it is necessary to use

distribution (relative information) instead of frequency (absolute information). Distribution

related information that we can exploit includes distributions of terms among classes, within a

class and in the whole collection. Three kinds of this information can be defined as inter-class

standard deviation (icsd), class standard deviation (csd) and standard deviation (sd). Let be

term frequency of the term of the document in the class . The formal definitions of icsd, csd

and sd are given below.

where

is an average term frequency of the term in all documents within the class , is the number

of classes and is the number of documents in the class .

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 9: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

252

1. Inter-class standard deviation:

The inter-class standard deviation of a term is calculated from a set of average

frequencies , each of which is gathered from each class . This deviation is an inter-class

factor. Therefore, icsd for a term is independent of classes. A term with a high icsd distributes

differently among classes and should have higher discriminating power for classification

than the others. This factor promotes a term that exists in almost all classes but its

frequencies for those classes are quite different. In this situation, the conventional factors tf

and idf are not helpful.

2. Class-standard deviation

The class standard deviation of a term in a class . is calculated from a set of term

frequencies , each of which comes from term frequency of that term in a document in the

class. This deviation is an intra-class factor. Therefore, csds for a term vary class by class.

Different terms may appear with quite different frequencies among documents in the class.

This difference can be alleviated by the way of this deviation. A term with a high csd will

appear in most documents in the class with quite different frequencies and should not be a

good representative term of the class. A low trrcsd of a term may be triggered by either of the

following two reasons. The occurrences of the term are nearly equal for all documents in the

class or the term rarely occurs in the class.

3. Standard deviation:

The standard deviation of a term is calculated from a set of term frequencies , each of

which comes from term frequency of that term in a document in the collection. The deviation

is a collection factor. Therefore, sd for a term is independent of classes. Different terms may

appear with quite different frequencies among documents in the collection. This difference

can be also alleviated by the way of this deviation. A term with a high sd will appear in most

documents in the collection with quite different frequencies. A low sd of a term may be

caused by either of the following two reasons. The occurrences of the term are nearly equal

for all documents in the collection or the term rarely occurs in the collection.

Enhancement of term weighting using term distributions

The second question is how the above-mentioned term distributions contribute to term

weighting. The term distributions, i.e., icsd, csd and sd, can enhance the performance of a

centroid-based classifier with the standard weighting . Two issues of consideration are

whether these distributions should act as a promoter (multiplier) or a demoter (divisor) and

how strong they affect the weight. To grasp these characteristics, term weighting can be designed

using the following skeleton. Here, is a weight given to the term of the class .

The includes the factors of term distributions. The parameters , , and are numeric

values used for setting the contribution levels of icsd, csd and sd to term weighting, respectively.

For each parameter, a positive number means the factor acts as a promoter while a negative one

means the factor acts as a demoter. Moreover, the larger a parameter is, the more the parameter

contributes to term weighting as either a promoter or a demoter.

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 10: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

253

Data sets and experimental settings

The following shows a set of experiments to investigate the effect of term distribution on

classification accuracy. Four data sets are used in the experiments: (1) Drug Information (DI), (2)

Newsgroups (News), (3) WebKB1 and (4) WebKB2. The first data set, DI is a set of web pages

collected from www.rxlist.com. It includes 4480 English web pages with seven classes: adverse

drug reaction, clinical pharmacology, description, indications, overdose, patient information, and

warning. Each web page in this data set consists of informative content with a few links. Its

structure is well organized. The second data set, Newsgroups contains 19,997 documents. The

articles are grouped into 20 different UseNet discussion groups. In this data set, some groups are

very similar. The third and fourth data sets are constructed from WebKB containing 8145 web

pages. These web pages were collected from departments of computer science from four

universities with some additional pages from some other universities. The collection can be

arranged to seven classes. In our experiment, we use the four most popular classes: student,

faculty, course and project as our third data set called WebKB1. The total number of web pages is

4199. Alternatively, this reduced collection can be rearranged into five classes by university

(WebKB2): cornell, texas, washington, wisconsin and misc (collected from some other

universities). The pages in WebKB are varied in their styles, ranging from quite informative

pages to link pages. Table 6-1 indicates the major characteristics of the data sets. More detail

about the document distribution of each class in WebKB is shown in Table 6-2.

Table 6-1: Characteristics of the four data set

Data sets DI News WebKB1 WebKB2

1. Type of docs HTML Plain Text HTML HTML

2. No. of docs 4480 19,997 4199 4199 3. No. of classes 7 20 4 5 4. No. of docs/class 640 1000 Varied Varied

Table 6-2: The distribution of the documents in WebKB1 an d WebKB2

WebKB1 WebKB2 Subtotal

Cornell Texas Washington Wisconsin Misc.

Course 44 38 77 85 686 930

Faculty 34 46 31 42 971 1124 Project 20 20 21 25 418 504 Student 128 148 126 156 1083 1641 Subtotal 226 252 255 308 3158 4199

For the HTML-based data sets (i.e., DI and WebKB), all HTML tags are eliminated from the

documents in order to make the classification process depend not on tag sets but on the content

of web documents. By the similar reason, all headers are omitted from Newsgroups documents,

the e-mail-based data set. For all data sets, a stop word list is applied to take away some common

words, such as a, for, the and so on, from the documents. This means when a unigram model is

occupied, a vector is constructed from all features (words) except stop words. In the case of a

bigram model, after eliminating stop words, any two contiguous words are combined into a term

for the representation basic. Moreover, terms occurring less than three times, are ignored.

6.1.3. Experimental Settings and Results

This section shows two experimental results as investigation of the effect of term distribution. In

the first experiment, term distribution factors are combined in different manners, and the

efficiencies of these combinations are evaluated. From now, let us call the classifiers that

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 11: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

254

incorporate term distribution factors in their weighting, term-distribution-based centroid-based

classifiers (later called TCBs). As the second experiment, top 10 TCBs obtained from the previous

experiment are selected for investigating the effect for term distribution factors in different types

of frequency-based factor in query weighting in both unigram and bigram models. Three types of

query weighting are investigated: term frequency, binary and augmented normalized term

frequency. The TCBs will be compared to a number of well-known methods as a baseline for

comparison: a standard centroid-based classifier (for short, SCB), a centroid-based classifier

modified the term weighting with information gain (for short, SCBIG), k-NN and naïve Bayes (for

short, NB). In both experiments, a data set is split into two parts: 90% for the training set and

10% for the test set. In the fourth experiment, since the objective is to investigate the effect of

training set size, we fix the size of a test set to 10% of the whole data set but vary the size of a

training set from 10% to 90%. All experiments perform 10-fold cross validation.

One of the most important factors towards the meaningful evaluation is the way to set

classifier parameters. Parameters that are applied to these classifiers are determined by some

preliminary experiments. For SCB, we apply the standard term weighting . For SCBIG, a

term goodness criterion called information gain (IG) is applied for adjusting the weight in SCB,

resulting in . The k values in k-NN are set to 20 for DI, 30 for Newsgroups and 50

for WebKB1, WebKB2 and WebKB12. Moreover, term weighting used in k-NN is

where means the maximum term frequency in a document. The k and this

term weighting performed well in our pretests. For NB, two possible alternative methods to

calculate the posterior probability are binary frequency and occurrence frequency. The

occurrence frequency is selected for comparison since it outperforms the binary frequency. The

query weighting for TCBs is by default. As the performance indicator, classification

accuracy is applied. It is defined as the ratio of the number of documents assigned with their

correct classes to the total number of documents in the test set.

Effect of term distribution factors

This experiment investigates the combination of term distribution factors in improving the

classification accuracy. Although the previous experiment suggests the role of each term

distribution factor, all possible combinations are explored in this experiment. Two following

issues are taken into account: (1) which factors are suitable to work together and (2) what is the

appropriate combination of these factors. To the end, we perform all combinations of icsd, csd

and sd by varying the power of each factor between -1 and 1 with a step of 0.5 and using it to

modify the standard weighting of . At this point, a positive number means the factor

acts as a promoter while a negative one means the factor acts as a demoter. The total number of

combinations is 125 (=5 5 5). These combinations include and six single-factor

term weightings. By the result, we find out that there are only 19 patterns giving better

performance than . The 20 best (top 20) and the 20 worst classifiers, according to

average accuracy on the four data sets, are selected for evaluation. Table 6-3 shows the number

of the best (worst) classifiers for each power of icsd, csd and sd. Moreover, the numbers in

parentheses show the numbers of the top 10 classifiers for each power. For more detail, the

characteristics and performances of the top 20 term weightings are shown in Table 6-4. Both

results are originally provided in (Lertnattee and Theeramunkong, 2004a).

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 12: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

255

Table 6-3: Descriptive analysis of term distribution factors (TDF) with different power of each

factor. Part A: the best 20 and Part 2: the worst 20 (best 10 and worst 10 in parenthesis)

(source: Lertnattee and Theeramunkong, 2004a)

TDF Power of the factor Number of methods -1 -0.5 0 0.5 1 Part A

0(0) 0(0) 6(2) 9(5) 5(3) 20(10) 5(4) 7(4) 6(2) 2(0) 0(0) 20(10) 9(4) 7(4) 4(2) 0(0) 0(0) 20(10)

Part B 6(1) 3(1) 3(2) 3(3) 5(3) 20(10) 4(0) 0(0) 1(0) 6(2) 9(8) 20(10) 1(0) 1(0) 1(0) 6(3) 11(7) 20(10)

Table 6-3 (part A) provides the same conclusion as the result obtained from the first

experiment. That is, sd and csd are suitable to be a demoter rather than a promoter while icsd

performs opposite. There are almost no negative results, except csd, and it is more obvious in the

case of the top 10. On the other hand, Table 6-3 (part B) shows that the performance is low if sd

and csd are applied as a promoter. However, it is not clear whether using icsd as a demoter

harms the performance.

Table 6-4: Classification accuracy of the 20 best term weightings (source: Lertnattee and Theeramunkong, 2004a)

Methods Power of Term weightings DI News WebKB1 WebKB2 Avg.

icsd csd sd

TCB1* 0.5 -0.5 -0.5 96.81 79.52 82.45 92.67 87.86

TCB2* 0.5 -1 0 95.16 79.73 81.90 93.17 87.49

TCB3* 0.5 -1 -0.5 92.25 83.17 78.88 93.71 87.00

TCB4* 1 -0.5 -1 96.65 77.70 82.90 90.21 86.87

TCB5* 0.5 0 -1 96.14 77.67 81.50 91.24 86.63

TCB6* 0.5 -0.5 -1 92.57 83.13 78.64 91.62 86.49 TCB7 1 -1 -1 91.07 82.17 80.09 92.28 86.40

TCB8* 1 -1 -0.5 94.80 78.79 80.16 91.14 86.22

TCB9* 0 -0.5 0 93.75 80.70 80.90 89.19 86.13

TCB10* 0 0 -0.5 92.90 78.97 79.11 92.86 85.96

TCB11 0.5 -0.5 0 96.45 74.40 80.28 92.02 85.79

TCB12 0 -0.5 -0.5 90.56 83.08 76.28 92.40 85.58

TCB13 1 -0.5 -0.5 96.18 71.49 80.81 89.76 84.56

TCB14 0.5 0 -0.5 95.58 72.69 78.64 90.93 84.46

TCB15 0 0.5 -1 90.92 78.21 77.02 91.40 84.39 TCB16 0 0 -1 88.55 82.70 73.45 90.71 83.85 TCB17 1 0 -1 96.00 69.76 79.95 89.50 83.80

TCB18 0.5 0.5 -1 93.95 71.73 78.45 90.24 83.59

TCB19 0.5 -1 -1 90.67 82.09 70.64 90.64 83.51 SCB 0 0 0 91.67 74.76 77.71 88.76 83.23

Table 6-4 also emphasizes the classifiers that outperform the standard in all four

data sets, with a mark *. Here, there are nine classifiers that are raised up. This fact shows that

there are some common term distributions that are useful generally in all data sets. Here, the

best term distribution in this experiment is . That is, the powers are

0.5 for icsd, and -0.5 for both csd and sd. However, it is observed that the appropriate powers of

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 13: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

256

term distribution factors depend on some characteristics of data sets. For instance, when the

power of csd changes from -0.5 to -1.0 (TCB1 to TCB3 in Table 6-4), the performances for DI and

WebKB1 decrease but those for Newsgroups and WebKB2 increase. This suggests that csd, a

class dependency factor, is more important in Newsgroups and WebKB2 than DI and WebKB1.

Experiments with different query weightings, and unigram/bigram models

In this experiment, top 10 TCBs obtained from the previous experiment are selected for

exploring the effect for term distribution factors in different types of query weighting in both

unigram and bigram models. In this experiment, the TCBs are compared to SCB, SCBIG, k-NN and

NB. Three types of query weighting are investigated: term frequency (n), binary (b) and

augmented normalized term frequency (a). The simple query weighting (n) sets term frequency

(or occurrence frequency) tf as the weight for a term in a query. The binary query weighting (b)

sets either 0 or 1 for terms in a query. The augmented normalized term frequency (a) defines

as a weight for a term in a query. This query term weighting is applied for

all centroid-based classifiers, i.e., TCBs, SCB and SCBIG. Furthermore, the query term weighting is

modified by multiplying the original weight with inverse document frequency (idf). The results

for unigram and bigram models are shown in Table 6-5 (panels A and B), respectively.

Table 6-5: Accuracy of the top 10 TCBs with different types of query weight compared to SCB,

SCBIG, k-NN and NB for unigram and bigram models (source: Lertnattee and Theeramunkong,

2004a)

Method DI News WebKB1 WebKB2

n b a n b a n b a n B a Part A (Unigram) TCB1 96.81 97.86 97.81 79.52 79.66 79.78 82.45 84.66 84.59 92.67 90.83 91.43 TCB2 95.16 95.96 95.87 79.73 80.93 80.95 81.90 85.33 85.12 93.17 93.05 93.21 TCB3 92.25 92.90 92.90 83.17 83.44 83.64 78.88 82.62 82.14 93.71 92.47 92.95 TCB4 96.65 97.46 97.39 77.70 77.72 77.88 82.90 85.12 85.02 90.21 88.02 88.83 TCB5 96.14 97.25 97.21 77.67 78.16 78.14 81.50 83.54 83.26 91.24 89.16 89.76 TCB6 92.57 93.06 93.01 83.13 83.30 83.46 78.64 81.07 80.61 91.62 89.19 89.76 TCB7 91.07 92.10 92.08 82.17 82.95 82.95 80.09 83.83 83.54 92.28 90.52 90.93 TCB8 94.80 96.36 96.32 78.79 79.53 79.62 80.16 84.12 84.02 91.14 89.69 90.12 TCB9 93.75 94.62 94.55 80.70 80.83 80.91 80.90 83.19 82.95 89.19 91.21 91.07 TCB10 92.90 94.51 94.38 78.97 79.38 79.57 79.11 81.47 81.21 92.86 91.71 92.19 SCB 91.67 92.99 93.01 74.76 75.29 75.37 77.71 78.66 78.73 88.76 91.12 91.07 SCBIG 96.19 97.43 97.39 60.83 59.31 60.40 75.02 78.78 78.26 90.26 89.59 89.95 k-NN 94.60 82.69 68.33 89.16 NB 95.00 80.82 81.40 87.45 Part B (Bigram) TCB1 98.73 99.35 99.35 81.83 82.37 82.36 84.19 86.71 86.35 93.88 93.36 93.43 TCB2 90.33 94.75 94.53 82.27 83.00 83.04 83.66 87.88 87.47 94.67 94.71 94.81 TCB3 97.90 99.24 99.22 85.15 85.20 85.24 82.47 85.69 85.26 95.52 94.98 95.19 TCB4 98.64 99.38 99.33 80.32 80.94 80.88 84.57 87.43 87.09 92.02 91.19 91.45 TCB5 98.04 98.68 98.68 81.22 81.94 81.91 83.40 85.76 85.43 92.74 92.17 92.36 TCB6 98.95 99.31 99.31 85.58 85.66 85.71 82.78 85.45 85.12 94.14 93.36 93.52 TCB7 85.25 90.13 89.71 84.80 84.98 84.92 82.88 86.78 86.31 94.05 93.47 93.57 TCB8 80.36 85.98 85.60 81.43 82.31 82.25 81.88 86.88 86.19 92.76 92.33 92.45 TCB9 98.42 98.93 98.91 82.77 83.05 83.01 82.47 85.16 84.76 93.81 94.88 94.88 TCB10 97.54 98.17 98.15 82.37 82.71 82.75 81.09 84.14 83.71 94.43 94.17 94.26 SCB 96.07 97.41 97.37 77.40 78.44 78.37 79.14 81.71 81.31 92.31 93.62 93.74 SCBIG 97.83 99.00 98.88 62.50 61.82 62.50 76.07 80.50 80.23 92.24 92.97 93.02 k-NN 97.48 82.75 70.16 91.62 NB 96.76 82.83 82.21 94.02

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 14: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

257

According to the results, we found out that the TCBs outperformed SCB, SCBIG, k-NN and NB

for almost cases, in both unigram and bigram models, independently of query weighting.

Normally the bigram model gains better performance than the unigram model. In the bigram, the

term distributions are still useful to improve classification accuracy. However, it is hard to

determine which query weighting performs better than the others but term distributions are

helpful for all types of query weighting. For SCBIG, the accuracy on DI significantly improves.

However, a little bit lower performance than SCB on average. The TCB1, TCB2 and TCB3 seem to

achieve higher accuracy than the others even TCB4 and TCB6 perform better in the bigram

model for DI and News, respectively.

Related works on centroid-based classification

Term weighting plays an important role to achieve high performance in text classification. In the

past, most approaches (Salton and Buckley, 1988; Skalak, 1994; Ittner et al., 1995; Chuang et al.,

2000; Singhal et al., 1996; Sebastiani, 2002) were proposed using frequency-based factors, such

as term frequency and inverse document frequency, for setting weights for terms. In these

approaches, the way to solve the problem caused by the situation that a long document may

suppress a short document is to perform normalization on document vectors or class prototype

vectors. That is, a vector for representing any document or any class is transformed into a unit

vector the length of which equals to 1. In spite of this, it is doubtful whether such frequency-

based term weighting is enough to reflect the importance of terms in the representation of a

document or a class or not. There were some works on adjusting weights using relevance

feedback approach. Among them, two popular schemas are the vector space model and the

probalistic networks. For the vector space model, the Rocchio feedback model (Rocchio, 1971;

Salton, 1989, Joachims, 1997) is the most common used method. The method attempts to use

both positive and negative instances in term weighting. One can expect more effective profile

representation generated from relevance feedback. For probabilistic networks approach, a query

can be modified by the addition of the first m terms taken from a list where all terms present in

documents deemed relevant are ranked (Robertson and Sparck-Jones, 1976).

The probalistic indexing technique was suggested by Fuhr (1989) and Joachims (1997) has

analysed a probabilistic consideration of this technique to the Rocchio classifier with

term weighting. Deng et al. (2002) introduced an approach to use statistics in a class, call

‘‘category relevance factor’’ to improve classification accuracy. Recently, Debole and Sebastiani

(2003) have evaluated some feature selection methods such as chi-square, information gain and

gain ratio. These feature selection methods were applied into term weighting for substituting idf

on three classifiers: k-NN, NB and Rocchio. From the result, these methods might be useful for k-

NN and support vector machine but seem useless for Rocchio. Recently the centroid-based

classifiers with the consideration of term distribution are explored by Han and Karypis (2000),

Lertnattee and Theeramunkong, (2004a; 2004b; 2005; 2006b; 2007a; 2007b; 2009) and

Theeramunkong and Lertnattee (2007). As a kind of term distribution, normalization is also an

important factor towards better accuracies as investigated by Singhal et al. (1995; 1996) and

Lertnattee and Theeramunkong (2003;2006a). A survey on statistical approaches for text

categorization was done by Yang (1999) and Yang and Liu (1999). Text classification with semi-

supervised learning can be found in (Nigam et al., 2000).

Conclusions

Section 6.1 shows that term distributions are useful for improving accuracy in centroid-based

classification. Three types of term distributions: interclass standard deviation (icsd), class

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 15: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

258

standard deviation (csd) and standard deviation (sd), were introduced to exploit information

outside/inside a class and that of the collection. The distributions were used to represent

discriminating power of each term and then to weight that term. To investigate the pattern of

how these term distributions contribute to weighting each term in documents, we varied term

distributions in their contribution to term weighting and then constructed a number of centroid-

based classifiers with different term weightings. The effectiveness of term distributions was

explored using various data sets. As baselines, a standard centroid-based classifier with

, a centroid-based classifiers with and two well-known methods, k-NN and

naïve Bayes are employed. Furthermore, both unigram and bigram models were investigated.

The experimental results showed the benefits of term distributions in classification. It was shown

that there was a certain pattern that term distributions contribute to the term weighting. It can

be claimed that terms with a low sd and a low csd should be emphasized while terms with a high

icsd should get more importance. For more detail, the reader can find from (Lertnattee and

Theeramunkong, 2004a).

6.2. Document Relation Extraction

Nowadays, it has become difficult for researchers to follow the state of the art in their area of

interest since the number of research publications has increased continuously and quickly. Such

a large volume of information brings about serious hindrance for researchers to position their

own works against existing works, or to find useful relations between them. Some research

works including (Kessler, 1963; Small, 1973; Ganiz, 2006), have been done towards the solution.

Although the publication of each work may include a list of related articles (documents) as its

reference, it is still impossible to include all related works due to either intentional reasons (e.g.,

limitation of paper length) or unintentional reasons (e.g., naively unknown). Enormous

meaningful connections that permeate the literatures may remain hidden.

Growing from different fields, known as literature-based discovery lead by Swanson (1986;

1990), the approach of discovering hidden and significant relations within a bibliographic

database has become popular in medical-related fields. As a content-based approach with

manual and/or semi-automatic processes, a set of topical words or terms are extracted as

concepts and then utilized to find connections among two literatures. Due to the simplicity and

practicality of this approach, it was used in several areas by its succeeding works (Gordon and

Dumais, 1998; Lindsay and Gordon, 1999; Pratt et al., 1999). Some works proposed citation

analysis based on so-called bibliographic coupling (Kessler, 1963) and co-citation (Small, 1973).

While they were successfully applied in several works (Nanba et al., 2000; White and McCain,

1989; Rousseau and Zuccala, 2004) to obtain topical related documents, they are not fully

automated and have a lot of labor intensive tasks. Based on association rule mining, an

automated approach to discover relations among documents in a research publication database

was introduced in Sriphaew and Theeramunkong (2005; 2007a; 2007b). Mapping a term (a word

or a pair of words) to a transaction in a transactional database, the topic-based relations among

scientific publications are revealed under various document representations. Although the work

expressed the first attempt to find document relations automatically by exploiting terms in

documents, it utilized only simple evaluation without elaborate consideration.

There has been little exploration of how to evaluate document relations discovered from text

collections. Most works in text mining utilized a dataset, which includes both queries and their

corresponding correct answers, as a test collection. They usually defined certain measures and

used them for performance assessment on the test collection. For instance, classification

accuracy is applied for assessing the class to which a document is assigned in text categorization

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 16: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

259

(TC) (Rosch, 1978) while recall and precision are used to evaluate retrieved documents with

regard to given query keywords in information retrieval (IR) (Salton and McGill, 1983). As a

more naive evaluation method, human judgment have been used in more recent works on mining

web documents, such as HITS (Kleinberg, 1999) and PageRank (Page et al., 1998), where there is

no standard dataset. However, this manual evaluation is a labor intensive task and quite

subjective. Moreover, there is a lack of standard criteria for evaluating document relations. So far,

while there have been several benchmark datasets, e.g., UCI Repository

(www.ics.uci.edu/~mlearn/MLRepository.html), WebKB (www.webkb.org), TREC data

(trec.nist.gov/data.thml), for TC and IR tasks, there is no standard dataset that is used for the

task of document relation discovery.

Toward resolving these issues, this section shows a brief introduction to a research work

that uses citation information in research publications as a source for evaluating the discovered

document relations. The full description of this work can be found in (Sriphaew and

Theeramunkong, 2007a).

Conceptually, the relations among documents can be formulated as a subgraph where each

node represents a document and each arc represents a relation between two documents. Based

on this formulation, a number of scoring methods are introduced for evaluating the discovered

document relations in order to reflect their quality. Moreover, this work also invents a generative

probability that is derived from probability theory and uses it to compute an expected score to

capture objectively how good evaluation results are.

6.2.1. Document Relation Discovery using Frequent Itemset Mining

A formulation of the ARM task on document relation discovery can be summarized as follows. Let

be a set of documents (items) where , and be a set of terms (transactions)

where . Also let represent the existence (0 or 1) of a term in a

document . A subset of is called a docset whereas a subset of is called a termset.

Furthermore, a docset with k documents is called k-docset (or a docset

with the length of k). The support of is defined as follows.

Here, an itemset with a support greater than a predefined minimum support is called a

frequent k-docset. We will use the term ``docset'' in the meaning of ``frequent docset'' and

``document relation'' interchangeably. Here, we need some kind of evaluation to assess which

document relations are better as one shown below.

6.2.2. Empirical Evaluation using Citation Information

This subsection presents a method to use citations (references) among technical documents in a

scientific publication collection to evaluate the quality of the discovered document relations.

Intuitively, two documents are expected to be related under one of the three basic situations: (1)

one document cites to the other (direct citation), (2) both documents cite to the same document

(bibliographic coupling) (Kessler, 1963) and (3) both documents are cited by the same document

(co-citation) (Small, 1973). An analysis of citation has been applied for several interesting

applications (Nanba et al., 2000; White and McCain, 1989; Rousseau and Zuccala, 2004).

Besides these basic situations, two documents may be related to each other via a more

complicated concept called transitivity. For example, if a document A cites to a document B, and

transitively the document B cites to a document C, then one could assume a relation between A

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 17: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

260

and C. In this work, with the transitivity property, the concept of order citation is originally

proposed to express an indirect connection between two documents. With the assumption that a

direct or indirect connection between two documents implies topical relation among them, such

connection can be used for evaluating the results of document relation discovery.

In the rest of this section, introductions of the u-th order citation and v-th order accumulative

citation matrix are given. Then, the so-called validity is proposed as a measure for evaluating

discovered docsets using information in the citation matrix. Finally, the expected validity is

mathematically defined by exploiting the concept of generative probability and estimation.

The Citation Graph and Its Matrix Representation

Conceptually citations among documents in a scientific publication collection form a citation

graph, where a node corresponds to a document and an arc corresponds to a direct citation of a

document to another document. Based on this citation graph, an indirect citation can be defined

using the concept of transitivity. The formulation of direct and indirect citations can be given in

the terms of the u-th order citation and the v-th order accumulative citation matrix as follows.

Definition 1: (the u-th order citation): Let be a set of documents (items) in the database. For

is the u-th order citation of x iff the number of arcs in the shortest path between x to y

in the citation graph is u ( ). Conversely, x is also called the u-th order citation of y.

Figure 6-1: An Example of a citation graph. (source: Sriphaew and Theeramunkong, 2007a)

For example, given a set of six documents and a set of six citations,

to , to and , to , and to and , the citation graph can be depicted in Figure

6-1. In the figure, , and is the first, is the second, and is the third order citation of

the document . Note that although there is a direction for each citation, it is not taken into

account since the task is to detect a document relation where the citation direction is not

concerned. Moreover, using only textual information without explicit citation or temporal

information, it is difficult to find the direction of the citation among any two documents.

Based on the concept of the u-th order citation, the v-th order accumulative citation matrix is

introduced to express a set of citation relations stating whether any two documents can be

transitively reached by the shortest path shorter than v+1.

Definition 2: (the v-th order accumulative citation matrix): Given a set of n distinct documents,

the v-th order accumulative citation matrix (for short, v-OACM) is an matrix, each element

of which represents the citation relation between two documents x, y where

when x is the u-th order citation of y and , otherwise . Note that

and .

Let be a set of documents (items) in the database. For is the u-th order citation of x

iff the number of arcs in the shortest path between x to y in the citation graph is u ( ).

Conversely, x is also called the u-th order citation of y. For the previous example, the 1-, 2- and 3-

d1 d2 d3 d4

d6 d5

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 18: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

261

OACMs can be created as shown in Table 6-6. Here, the 1-, 2- and 3-OACMs are represented by a

set of values [ ].

Table 6-6: the 1-, 2- and 3-OACMs are represented by a set of values [ ].

Document

[1,1,1] [1,1,1] [0,1,1] [0,0,1] [0,1,1] [0,0,0]

[1,1,1] [1,1,1] [1,1,1] [0,1,1] [1,1,1] [0,1,1]

[0,1,1] [1,1,1] [1,1,1] [1,1,1] [1,1,1] [0,1,1]

[0,0,1] [0,1,1] [1,1,1] [1,1,1] [0,1,1] [1,1,1]

[0,1,1] [1,1,1] [1,1,1] [0,1,1] [1,1,1] [0,1,1]

[0,0,0] [0,0,1] [0,1,1] [1,1,1] [0,0,1] [1,1,1]

The 1-OACM can be straightforwardly constructed from the set of the first-order citation (direct

citation). The (v+1)-OACM (mathematically denoted by a matrix ) can be recursively created

from the operation between v-OACM ( ) and 1-OACM ( ) according to the following formula.

where is an OR operator, is an AND operator,

is the element at the i-th row and the k-th

column of the matrix and is the element at the k-th row and the j-th column of the matrix

. Note that any v-OACM is a symmetric matrix.

Validity: Quality of Document Relation

This section defines the validity which is used as a measure for evaluating the quality of the

discovered docsets. The concept of validity calculation is to investigate how documents in a

discovered docset are related to each other according to the citation graph. Based on this concept,

the most preferable situation is that all documents in a docset directly cite to and/or are cited by

at least one document in that docset, and thereafter they form one connected group. Since in

practice only few references are given in a document, it is quite rare and unrealistic that all

related documents cite to each other. As a generalization, we can assume that all documents in a

docset should cite to and/or are cited by each other within a specific range in the citation graph.

Here, the shorter the specific range is, the more restrictive the evaluation is. With the concept of

v-OACM stated in the previous section, we can realize this generalized evaluation by a so-called v-

th order validity (for short, v-validity), where v corresponds to the range mentioned above.

Regarding the criteria of evaluation, two alternative scoring methods can be employed for

defining the validity of a docset. As the first method, a score is computed as the ratio of the

number of citation relations in which the most popular document in a docset contains to its

maximum. The most popular document is a document that has the most relations with the other

documents in the docset. Note that, it is possible to have more than one popular document in a

docset. The score calculated by this method is called soft validity.

In the second method, a stricter criterion for scoring is applied. The score is set to 1 only

when the most popular document connects to all documents in the docset. Otherwise, the score is

set to 0. This score is called hard validity. The formulation of soft v-validity and hard $v$-validity

of a docset X ( , denoted by (X) and

(X) respectively, are defined as follows.

For simplicity, we denote a numerator in the above equation with . Then,

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 19: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

262

Here, is the citation relation defined by Definition 2. It can be observed that the soft v-

validity of a docset is ranging from 0 to 1, i.e., while the hard v-validity is a binary

value of 0 or 1, i.e. . In both cases, the v-validity achieves the minimum (i.e., 0) when

there is no citation relation among any document in the docset. On the other hand, it achieves the

maximum (i.e., 1) when there is at least one document that has a citation relation with all

documents in a docset. Intuitively, the validity of a bigger docset tends to be lower than a smaller

docset since the probability that one document will cite to and/or be cited by other documents in

the same docset becomes lower.

In practice, instead of an individual docset, the whole set of discovered docsets needs to be

evaluated. The easiest method is to exploit an arithmetic mean. However, it is not fair to directly

use the arithmetic mean since a bigger docset tends to have lower validity than a smaller one. We

need an aggregation method that reflects docset size in the summation of validities. One of

reasonable methods is to use the concept of weighted mean, where each weight reflects the

docset size. Therefore, soft v-validity and hard v-validity for a set of discovered docsets F,

denoted by (F) and

(F), respectively, can be defined as follows.

where is the weight of a docset X. In this work, is set to , the maximum value that

the validity of a docset X can gain. For example calculation, given the 1-OACM in Table 6-6 and

, the set soft 1-validity of F (i.e., (F)) equals to

while the set

hard 1-validity of F (i.e., (F)) is

.

The Expected Validity

The evaluation of discovered docsets will depend on the citation relation , which is

represented by v-OACMs. As stated in the previous section, the lower v is, the more restrictive the

evaluation becomes. Therefore to compare the evaluation based on different v-OACMs, we need

to declare a value, regardless of the restriction of evaluation, to represent the expected validity of

a given set of docsets under each individual v-OACM. This section describes the method to

estimate the theoretical validity of the set of docsets based on probability theory. Towards this

estimation, the probability that two documents are related to each other under a v-OACM (later

called base probability), need to be calculated. This probability is derived by the ratio of the

number of existing citation relations to the number of all possible citation relations (i.e.,

) as shown in the following equation.

For example, using the citation relation in Table 6-6, the base probabilities for 1-, 2-, and 3-

OACMs are 0.40 (12/30), 0.73 (22/30) and 0.93 (28/30), respectively. Note that the base

probability of a higher-OACM is always higher than or equal to that of a lower-OACM. Using the

concept of expectation, the expected set v-validity ( ) can be formulated as follows.

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 20: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

263

Where is the expected v-validity of a docset , is the set of all possible citation

patterns for , is the invariant validity of , and is the generative probability of the

pattern estimated from the base probabilities under v-OACM ( ). Theoretically, finding

possible patterns of a docset can be transformed to the set enumeration problem. Given a docset

with the length of k (k-docset), there are possible citation patterns.

With different scoring methods, an invariant validity is individually defined on each criteria

regardless of the v-OACM. To simplify this, the notation is replaced by and for

the invariant validity calculated from soft validity and hard validity, respectively. Similar to

, an invariant validity of for soft validity is defined as follows:

For simplicity, we denote a numerator in the above equation by . An invariant validity

of based on hard validity is given by:

In the above equations, is the citation relation among two documents x, y in the citation

pattern where =1 when citation relation exists, otherwise =0. Note that

all 's have the same docset but represent different citation patterns. The following shows two

examples of how to calculate the expected v-validity for 2-docsets and 3-docsets. For simplicity,

the expected v-validity based on soft validity is firstly described, and the one based on hard

validity is discussed later.

With the simplest case, there are only two possible citation patterns for a 2-docset. Therefore,

the expected v-validity based on soft validity of any 2-docset (X) can be calculated as follows.

Figure 6-2: All possible citation patterns for a 3-docset. (source: Sriphaew and Theeramunkong, 2007a)

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 21: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

264

In the case of a 3-docset, there are eight possible patterns as shown in Figure 6-2. Here, we can

calculate the invariant validity based on soft validity ( ) of each pattern as follows. The first to

fourth patterns have the invariant validity of 1 (i.e.

). The fifth to seventh patterns gains the

invariant validity of 0.5 (i.e.

) while the last pattern occupies the invariant validity of 0 (i.e.

).

The generative probability of the first pattern is since there are three citation relations, and

that of the second to the fourth patterns equals to since there are two citation

relations and one missing citation relation. Regarding the citation pattern, the generative

probabilities of the other patterns can be calculated in the same manner. From the generative

probabilities shown in Figure 6-2, the expected v-validity based on soft validity can be calculated

as follows.

Here, the first term comes from the first pattern, the second term is derived from the second to

the fourth patterns, the third term is obtained by the fifth to the seventh patterns and the last

term is for the eighth pattern.

With another criterion of hard validity, the expected v-validity for a 2-docset is still the same

but a difference occurs for a 3-docset. The invariant validity based on hard validity ( ) equals to

1 for the first to fourth patterns and becomes 0 for the other patterns. The expected v-validity for

a 3-docset based on hard validity is then reduced to

All above examples illustrate the calculation of the expected validity of only one docset. To

calculate the expected v-validity of several docsets in a given set, the weighted mean of their

validities can be derived. The outcome will be used as the expected value for evaluating the

results obtained from our method for discovering document relations.

6.2.3. Experimental Settings and Results

This subsection presents three experimental results when the quality of discovered docsets is

investigated under several empirical evaluation criteria. The three experiments are (1) to

investigate characteristic of the evaluation by soft validity and hard validity on docsets

discovered from different document representations including their minimum support

thresholds and mining time and (2) to study the quality of discovered relations when using either

direct citation or indirect citation as the evaluation criteria. More complete results can be found

in (Sriphaew and Theeramunkong, 2007a).

Towards the first objective, several term definitions are explored in the process of encoding

the documents. To define terms in a document, techniques of n-gram, stemming and stopword

removal can be applied. The discovered docsets are ranked by their supports, and then the top-N

ranked relations are evaluated using both soft validity and hard validity. Here, the value of N can

be varied to observe the characteristic of the discovered docsets. For the second objective, the

evaluation is performed based on various v-OACMs, where the 1-OACM considers only direct

citation while a higher-OACM also includes indirect citation. Intuitively, the evaluation becomes

less restricted when a higher-OACM is applied as the calibration. To fulfill the third objective, the

expected set validity for each set of discovered relations is calculated. Compared to this expected

validity, the significance of discovered docsets is investigated.

To implement a mining engine for document relation discovery, the FP-tree algorithm,

originally introduced by Han et al. (2000) is modified to mine docsets in a document-term

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 22: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

265

database. In this work, instead of association rules, frequent itemsets are considered. Since a 1-

docset contains no relation, it is negligible and then omitted from our evaluation. That is, only the

discovered docsets with at least two documents are considered. The experiments were

performed on a Pentium IV 2.4GHz Hyper-Threading with 1GB physical memory and 2GB virtual

memory running Linux TLE 5.0 as an operating system. The preprocessing steps i.e., n-gram

construction, stemming and stopword removal, consume trivial computational time.

Evaluation Material

There is no gold standard dataset that can be used for evaluating the results of document relation

discovery. To solve this problem, an evaluation material is constructed from the scientific

research publications in the ACM Digital Library (www.portal.acm.org). As a seed of constructing

the citation graph, 200 publications are retrieved from each of the three computer-related

classes, coded by B (Hardware), E (Data) and J (Computer). With the PDF format, each

publication is attached with an information page in which citation (i.e., reference) information is

provided. The reference publications appearing in these 600 publications are further collected

and added into the evaluation dataset. In the same way, the publications referred to by these

newly collected publications are also gathered and appended into the dataset. Finally, in total

there are 10,817 research publications collected as the evaluation material. After converting

these collected publications to ASCII text format, the reference (normally found at the end of each

publication text) is removed by a semi-automatic process, such as using clue words of References

and Bibliography. With the use of the information page attached to each publication, the 1-

OACMs can be constructed and used for evaluating the discovered docsets. The v-OACM can be

constructed from (v-1)-OACM and 1-OACM. In our dataset, the average number of citation

relations per document is 8 for 1-OACM, 148 for 2-OACM, and 1008 for 3-OACM. It takes 1.14

seconds for generating 2-OACM from 1-OACM while it takes 15.83 seconds to generate 3-OACM

from 2-OACM. Together with text preprocessing, the BOW library by McCallum,(1996) is used as

a tool for constructing a document-term database. Using a list of 524 stopwords provided by

Salton and McGill (1986), common words, such as ‘a,’ ‘an,’ ‘is,’ and ‘for’, are discarded. Besides

these stopwords, terms with very low frequency are also omitted. These terms are numerous and

usually negligible.

Experimental Results

As stated at the beginning of this section, several term definitions can be used as factors to obtain

various patterns of document representation. In our experiment, eight distinct patterns are

explored. Each pattern is denoted by a 3-digit code. The first digit represents the usage of n-gram,

where `U' stands for unigram and `B' means bigram. The second digit has a value of either `O' or

`X', expressing whether the stemming scheme is applied or not. Also the last digit is either `O' or

`X', telling us whether the stopword removal scheme is applied or not. For example, `UXO' means

document representation generated by unigram, non-stemming and stopword removal.

Table 6-7 and Table 6-8 express the set 1-validity (soft validity/hard validity) of the

discovered docsets when various document representations are applied for unigram and bigram,

respectively. The minimum support and the execution time of mining for each document

representation to discover a specified number of top-N ranked docsets are also given in the table.

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 23: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

266

Table 6-7: Set 1-validity for various top-N rankings of discovered docsets, their supports and

mining time: soft validity/hard validity for the case of unigram. Here, minsup: minimum

support time:mining time (seconds) (source: Sriphaew and Theeramunkong, 2007a)

N Set Validity (%)

BXO BOO BXX BOX 1000 45.47/43.95 46.14/44.33 6.29/6.29 7.09/7.09

minsup=0.53,time=174.49 minsup=0.67,time=155.92 minsup=3.94,time=442.95 minsup=4.76,time=402.14

5000 29.31/23.88 29.13/27.24 3.83/3.33 3.88/3.59 minsup=0.35,time=188.88 minsup=0.47,time=166.96 minsup=3.15,time=612.82 minsup=3.79,time=570.65

10000 24.49/19.33 24.40/20.50 3.13/2.33 3.20/2.63 minsup=0.32,time=189.52 minsup=0.39,time=170.17 minsup=2.84,time=681.40 minsup=3.42,time=627.61

50000 19.29/ 6.36 18.88/ 8.62 2.46/0.98 2.36/1.19 minsup=0.25,time=195.39 minsup=0.29,time=176.48 minsup=2.31,time=816.43 minsup=2.71,time=767.25

100000 19.51/ 3.67 18.40/ 4.11 2.30/0.63 2.18/0.77 minsup=0.21,time=212.14 minsup=0.28,time=176.57 minsup=2.13,time=862.84 minsup=2.48,time=832.77

Average 27.61/19.64 27.39/20.96 3.60/2.71 3.74/3.05 minsup=0.33,time=192.08 minsup=0.42,time=169.22 minsup=2.87,time=683.29 minsup=3.43,time=640.08

Table 6-8: Set 1-validity for various top-N rankings of discovered docsets, their supports and mining time: soft validity/hard validity for the case of bigram. Here, minsup: minimum support time:mining time (seconds) (source: Sriphaew and Theeramunkong, 2007a)

N Set Validity (%)

UXO UOO UXX UOX 1000 3.88/3.78 2.36/2.26 2.79/2.79 1.76/1.76

minsup=32.72,time=122.49 minsup=46.35,time=74.77 minsup=55.61,time=160.98 minsup=74.78,time=89.39

5000 3.77/3.35 2.38/1.99 2.37/2.28 1.55/1.48 minsup=26.98,time=240.57 minsup=40.04,time=175.72 minsup=48.46,time=359.18 minsup=66.84,time=198.16

10000 3.47/2.63 2.16/1.53 2.09/1.75 1.35/1.11 minsup=24.68,time=312.69 minsup=37.63,time=231.41 minsup=45.66,time=466.00 minsup=63.76,time=277.67

50000 2.78/1.44 1.75/0.74 1.68/0.84 1.12/0.49 minsup=19.95,time=478.97 minsup=32.26,time=412.79 minsup=39.64,time=808.61 minsup=57.08,time=539.55

100000 2.71/1.02 1.68/0.48 1.66/0.57 1.14/0.32 minsup=18.37,time=564.65 minsup=30.40,time=531.10 minsup=37.40,time=1008.38 minsup=54.55,time=691.02

Average 3.32/2.44 2.06/1.40 2.12/1.64 1.38/1.03 minsup=24.54,time=343.87 minsup=37.34,time=285.16 minsup=45.35,time=560.63 minsup=63.40,time=359.16

From the table, some interesting observations can be made as follows. First, with the same

document representation, soft validity is always higher than or equal to hard validity since the

former is obtained by less restrict evaluation than the latter. Both validities involve valid

relations between any pair of documents in a discovered docset. A relation between two

documents is called valid when there is a link between those two documents under the v-OACM

(v=1 in this experiment). The evaluation based on soft validity focuses on the probability that

any two documents in a docset will occupy a valid relation. On the other hand, the evaluation

based on hard validity concentrates on the probability that at least one docset must have valid

relations with all of the other documents. For example, in the case of top-100000 ranking with

the `BXO' representation (as shown in Table 1), 19.51% of the relations in the discovered docsets

are valid while only 3.67% of the discovered docsets are perfect, i.e., there is at least one

document that contains valid relations with all of the other documents in the certain docset.

Second, in every document representation, both soft validity and hard validity become lower

when more ranks (i.e., top-N ranking with a larger N) are considered. As an implication of this

result, our proposed evaluation method indicates that better docsets are located at higher ranks.

Third, given two representations, say A and B, if the soft validity of A is better than that of B, then

the hard validity of A tends to be higher than that of B. Fourth, the results of the bigram cases

(`B**') are much better than those of the unigram cases (`U**'). One reason is that the bigrams

are quite superior to the unigrams in representing the content of a document. Fifth, in the cases

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 24: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

267

of bigram, the stopword removal process is helpful while the stemming process does not help

much. Sixth, in the cases of unigram, non-stemming is preferable while the stopword removal

process is not useful. Finally, the performance of `BXO' and `BOO' is comparable and much higher

than `BOX' and `BXX', while the performance of `UXO' is much higher than the other unigram

cases. However, on average, the `UXX' seems to be the second best case for the unigram. Since the

soft validity is more flexible than the hard validity, a higher soft validity is preferable. Although

performance of `BOO' seems to be slightly better than `BXO' in the higher ranks, `BXO' performs

better on average. In our task, the performance ranks for bigram is `BXO' >`BOO'> `BOX’ > `BXX’

and the performance ranks for unigram is `UXO' > `UXX' > `UOO’ > `UOX’.

In terms of minimum support and computation time, we can conclude as follows. First, since

a docset discovered from the bigram cases tends to have a lower support than the unigram cases,

it is necessary to set a small minimum support in order to obtain the same number of docsets.

Second, the cases with stopword removal run faster than ones without stopword removal since

they consider fewer words. Moreover, they tend to have a lower minimum support.

Besides 1-OACM, the discovered docsets can be evaluated with the criteria of 2-OACM and 3-

OACM. In this assessment, only four best representations, two from the unigram cases (`UXO' and

`UXX') and two from the bigram cases (`BXO' and `BOO'), are taken into consideration. Figure 6-3

displays the soft validity (the left graph) and the hard validity (the right graph) under 1-, 2-, and

3-OACMs. Since the minimum support and mining time in each case is the same as shown in

Table 6-7 and Table 6-8, they are omitted from the figure. In the figure, we use the notation to

represent the evaluation of docsets under the specified OACM where those docsets are

discovered from a specific document representation. For example, `3:BXO' means the evaluation

of docsets under 3-OACM where the docsets are discovered by encoding document

representation using the BXO scheme (bigram, non-stemming and stopword removal). Being

consistent for both soft validity and hard validity, the set 3-validity (one calculated under the 3-

OACM) of discovered docsets is higher than the set 2-validity (one calculated under the 2-OACM),

and in the same way the set 2-validity is much higher than the set 1-validity (one calculated

under the 1-OACM). Compared to the evaluation using only direct citation (1-OACM), more

relations in the discovered docsets are valid when both direct and indirect citations (2- and 3-

OACMs) are taken into consideration.

Similar to 1-OACM, `BXO' and `BOO' are comparable and perform as the best cases for both

soft validity and hard validity under the same OACM. Moreover, in the cases of bigram evaluated

under the 1- and 2-OACMs, the set validity drops remarkably when top-N rankings with a larger

N are focused.The quality of docsets in the higher rank (smaller N) outperforms that of the lower

rank. This outcome implies that our evaluation based on direct/indirect citations seems to be a

reasonable method for assessing docsets. For all types of document representation, the bigram

cases perform better than the unigram cases when they are evaluated under the same v-OACM.

Especially the cases under 3-OACM, where both two bigram cases (`3:BXO' and `3:BOO') are

almost 100% valid while two unigram cases (`3:UXO' and `3:UXX') are approximately 50% valid.

This phenomenon shows the advantage of bigram in being a good document representation for

document relation discovery and those documents in each docset cite to each other under the

specific range within citation graph. Furthermore, the performance gap between bigram and

unigram becomes smaller when top-N rankings with a larger N are considered. For a top-N

ranking with a larger N, the bigram cases tend to have bigger docsets than the unigram cases and

then obtain lower validity since naturally a bigger docset is likely to have lower validity.

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 25: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

268

Figure 6-3: Set validity based on the 1-, 2- and 3-OACMs when various top-N rankings of discovered docsets are considered: soft validity (left) and hard validity (right). (source: Sriphaew and Theeramunkong, 2007a)

Conclusions

Section 6.2 shows a method to use citation information in research publications as a source for

evaluating the discovered document relations. Three main contributions of this work are as

follows. First, soft validity and hard validity are developed to express the quality of docsets

(document relations), where the former focuses on the probability that any two documents in a

docset has a valid relation while the latter concentrates on the probability that at least one

document in a docset has valid relations with all of the other documents in that docset. Second, a

method to use direct and indirect citations as comparison criteria is proposed to assess the

quality of docsets. Third, the so-called expected validity is introduced, using probability theory,

to relatively evaluate the quality of discovered docset. By comparing the result to the expected

validity, the evaluation becomes impartial, even under different comparison criteria. The manual

evaluation was also done for performance comparison. Using more than 10,000 documents

obtained from a research publication database and frequent itemset mining as a process to

discover document relations, the proposed method was shown to be a powerful way to evaluate

the relations in four aspects: soft/hard scoring, direct/indirect citation and relative quality over

the expected validity. For more detail, the reader can find in (Sriphaew and Theeramunkong,

2007a).

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 26: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

269

6.3. Application to Automatic Thai Unknown Detection

Unknown word recognition plays an important role in natural language processing (NLP) since

words, fundamental units of a language, may be newly developed and invented. Most NLP

applications need to identify words in sentences before further manipulation. Word recognition

can be basically done by using a predefined lexicon, designed to include as many words as

possible. However, in practical, it is impossible to have a complete lexicon that includes all words

in a language. Therefore, it is necessary to develop techniques to handle words not presented in

the lexicon, so-called unknown words. In languages with explicit word boundary, it is

straightforward to identify an unknown word and its boundary. This simplicity is not conformed

to languages without word boundary (later called unsegmented language such as Thai, Japanese,

Chinese), where words are running without any explicit space or punctuation mark (Cheng et al.,

1999; Charoenpornsawat et al., 1998; Ling et al., 2003; Ando and Lee, 2000). Whereas analyzing

such languages requires word segmentation, existence of unknown words made segmentation

(or word recognition) accuracy lower (Theeramunkong and Tanhermhong, 2004; Asahara and

Matsumoto, 2004; Jung-Shin and Su, 1997). Accurate detection of unknown words and their

boundaries is mandatory towards high-performance word segmentation. As a similar task, word

extraction in unsegmented languages has also been explored in several studies (Su et al., 1994;

Chang and Su, 1995; Ge et al., 1999; Zhang et al., 2000; Zhang et al., 2008). Instead of segmenting

a running text into words, word extraction methods directly detect a set of unknown words from

the text without determining boundaries of all words in the text. In Thai, our target language,

major sources of unknown words are (1) Thai transliteration of foreign words, (2) invention of

Thai new technical words, and (3) emerging of Thai proper names. For example, Thai medical

texts often abound in transliterated words/terms or technical words/terms, related to diseases,

organs, medicines, instruments or herbs, which may not be in any dictionary. Thai news articles

usually include a lot of proper names related to persons, organizations, locations and so forth.

Indirectly related to unknown word recognition, Thai compound word extraction and word

segmentation without dictionaries were explored in (Sornlertlamvanich and Tanaka, 1996;

Theeramunkong and Tanhermhong, 2004; Sornil and Chaiwanarom, 2004). Without any

dictionary, these methods applied pure statistics with a kind of machine learning techniques to

detect compound words by observing frequently occurred substrings in texts. However, it seems

natural to utilize a dictionary for segmentation and simultaneously recognize unknown words

when substrings do not exist in the dictionary.

In the past, several works (Kawtrakul et al., 1997; Charoenpornsawat et al., 1998) have been

proposed to recognize both explicit and implicit unknown words. Forming from multiple

contiguous words, an implicit unknown word could be detected by observing its Co-occurrence

frequency. On the other hand, an explicit unknown word was triggered by an undefined substring,

and its boundary could be found by first generating boundary candidates with respect to a set of

predefined rules and applying statistical techniques to select the most probable one. However,

one of shortcomings in most previous approaches is that they required a set of manually

constructed rules to restrict generating candidates of an unknown word boundary. To get rid of

this limitation, this paper proposes a method to generate a set of all possible candidates without

constraining by any handcrafted rule. However, with this relaxation, a large set of candidates

may be generated, inducing the problem of unbalanced class sizes where the number of positive

unknown word candidates is dominantly smaller than that of negative candidates. To solve the

problem, a technique called group-based ranking evaluation (GRE) is incorporated into ensemble

learning, namely boosting, in order to generate a sequence of classification models that later

collaborate to select the most probable unknown word from multiple candidates. As the boosting

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 27: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

270

step, given a classification model, the GRE technique is applied to build a dataset for training the

succeeding model, by weighing each of its candidates according to their ranks and correctness

when the candidates of an unknown word are considered as one group. By experiments, the

proposed method, namely V-GRE, is evaluated using a large Thai medical text.

Although research on unknown word recognition in Thai language has not been widely

conducted as done in other languages, two approaches have been proposed in detecting

unknown words from a large corpus of Thai texts, later called Machine Learning-based (ML-

based) approach and dictionary-based approach (Theeramunkong et al., 2000; Theeramunkong

and Tanhermhong, 2004). In the ML-based approach, unknown word recognition can be viewed

a process to detect new compound words in a text without a process of using a dictionary to

segment the text into words. The dictionary-based approach attempts to identify the boundary of

an unknown word when a system faces with a character sequence which is not registered in a

dictionary during segmenting a text into a sequence of words. As an early work of the first

approach, Sornlertlamvanich and Tanaka (1996) had presented a method to use frequency

difference between the occurrences of two adjoining sorted n-grams (a special case of sorted

sistrings) to extract open compounds (uninterrupted sequences of words) from text corpora.

Moreover, competitive and unified selections are applied to discriminate between an illegible

string and a potential unknown word. By specifying a different threshold of frequency

differences, the method can detect a various number of extracted strings (unknown words) with

an inherent trade-off between the quantity and the quality of the extracted strings. As two

limitations, the method requires manual setting of the threshold, and it applies only frequency

difference that may not be enough to express the distinction between an unknown word and a

common prefix of words. To solve these shortcomings, some works (Kawtrakul et al., 1997;

Theeramunkong et al., 2000; Sornlertlamvanich et al., 2000; Theeramunkong and Tanhermhong,

2004; Sornil and Chaiwanarom, 2004) applied machine learning (ML) techniques to detect an

unknown word by using statistical information of contexts surrounding that potential unknown

word. Sornlertlamvanich et al. (2000) presented a corpus-based method to learn a decision tree

for the purpose of extracting compound words from corpora. In the same period, a similar

approach was proposed in (Theeramunkong et al., 2000; Theeramunkong and Tanhermhong,

2004) to construct a decision tree that enables us to segment a text without making use of a

dictionary. It was shown that even no dictionary, the ML-based methods could achieve up to

85%-95% of word segmentation accuracy or word extraction rate. As the second approach,

Kawtrakul et al. (1997) used the combination of a statistical semantic segmentation model and a

set of context sensitive rules to detect unknown words in the context of a running text. The

context sensitive rules were applied to extract information related to such an unknown word,

mostly representing a name of an entity, such as person, animal, plant, place, document, disease,

organization, equipment, and activity. Charoenpornsawat et al. (1998) considered unknown

word recognition as a classification problem and proposed a feature-based approach to identify

Thai unknown word boundaries. Features used in the approach are built from the specific

information in context surrounding the target unknown words. Winnow proposed by Blum

(1997) is an ML algorithm used to automatically extract features from the training corpus.

As a more recent work, Haruechaiyasak et al. (2006) proposed a semi-automated framework

that utilized statistical and corpus-based concepts for detecting unknown words and then

introduced a collaborative framework among a group of corpus builders to refine the obtained

results. In the automated process, unknown word boundaries are identified using frequencies of

strings. In (Haruechaiyasak et al., 2008), a comparison of dictionary-based approach and ML-

based approach for word segmentation was presented where unknown word detection is

implicitly handled. Since either of the dictionary-based and ML-based approaches has its

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 28: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

271

advantages, most previous works (Kawtrakul et al., 1997; Charoenpornsawat et al., 1998;

Theeramunkong et al., 2000) combined them to handle unknown words. Although several works

have been done in both approaches, they have some shortcomings: 1) most works dominantly

separated learning process from word segmentation process; 2) they used only local information

to learn a set of rules for word segmentation/unknown word detection by a single-level learning

process (a single classifier); and 3) they required a set of handcrafted rules to restrict generating

candidates of an unknown word boundary. To overcome these disadvantages, this work provides

a framework to combine word segmentation process with learning process that utilizes long-

distance context in learning a set of rules for unknown word detection in word segmentation

process, where no manual rules are required. Moreover, our learning process also occupies

boosting techniques to improve classification accuracy.

6.3.1. Thai Unknown Words as Word Segmentation Problem

Most word segmentation algorithms used a lexicon (or a dictionary) to parse a text at the

character level. In general, when a system meets an unknown word, three possible segmented

results can be expected as an output. The first one is to obtain one or more sequences of known

words from an unknown (out-of-dictionary) word, especially for the case of a compound word.

For example, มะม่วงอกร่อง (meaning: a kind of mango) can be segmented into มะม่วง (meaning:

mango), อก (meaning: breast), and ร่อง (meaning: crack). All of these sub words are found in the

lexicon. The second one is to gain a sequence of unknown segments which are undefined in the

lexicon. For example, we cannot detect any sub word from an out-of-dictionary word วิสญัญี

(meaning: Anesthetic) since all of its substrings do not exist in the dictionary. The last one is to

get a sequence of known words mixed with unknown segments. For instance, an unknown word

ลูคีเมีย (meaning: Leukemia) can be segmented into two portions: an unknown segment (ลูคี,

meaning: unknown) and a known word (เมีย, meaning: wife).

In terms of processing, these three different results can be interpreted as follows. When we

get a result of the first type, it is hard for us to know whether the result is an unknown word

since it may be misunderstood to be multiple words existing in a dictionary. This type of

unknown words is known as a hidden unknown word. Called as an explicit unknown word, the

second type is easily recognized since the whole word is composed of only unknown segments,

Namely a mixed unknown word, a third-type unknown word is also hard to recognize since the

boundary of the unknown word is unclear.

Furthermore, it is also difficult for us to distinguish between the second and the third type.

However, the second and third types will have unknown segments, later called unregistered

portions, that trigger us to know existence of an unknown word. This work focuses on

recognition of an unknown word of the second and third types, a detectable unknown word.

6.3.2. The Proposed Method

This section describes the proposed method in short. The reader can find the full description in

(TeCho et. al, 2009b). The proposed method consists of three processes: (1) unregistered portion

detection, (2) unknown word candidate generation and reduction, and (3) unknown word

identification, as shown in Figure 6-4.

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 29: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

272

Figure 6-4: Overview of the proposed method (source: TeCho et al., 2009b)

Unregistered Portions Detection

Normally when we apply word segmentation on a Thai running text with some unknown words,

we may face with a number of unrecognizable units due to out-of-vocabulary words. Moreover,

without any additional constraints, an existing algorithm may place segmentation boundaries at

obviously incorrect positions. For example, the system may place an impossible word boundary

between a consonant and a vowel. To resolve such obvious mistakes, recently several works

(Sornil and Chaiwanarom, 2004; Haruechaiyasak et al., 2006; Theeramunkong and Usanavasin,

2001; Viriyayudhakorn et al., 2007; Limcharoen, 2008) have applied a useful concept, namely a

Thai Character Cluster (TCC) (Theeramunkong et al., 2000; Theeramunkong and Tanhermhong,

2004), which is defined as an inseparable group of Thai characters based on the Thai writing

system. Unlike word segmentation, segmenting a text into TCCs can be completely done without

error and ambiguity by a small set of predefined rules. The result from TCC segmentation can be

used to guide word segmentation not to segment at unallowable positions. To detect unknown

words, TCCs can be used as basic units of processing.

Using techniques originally proposed by TeCho et al. (2008a; 2008b; 2009a; 2009b), this

work employs the combination of TCCs and the LEXiTRON dictionary (2008) to facilitate word

segmentation. In this work, the longest word segmentation (Poowarawan, 1986) is applied to

segment the text from either left-to-right (LtoR) or right-to-left (RtoL) manner and then the

results are compared to select one with the minimum number of unregistered portions. If the

number of unregistered portions from LtoR longest matching equals to that of RtoL, the result of

the LtoR longest matching will be selected.

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 30: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

273

Unknown Word Candidate Generation and Reduction

For candidate generation, ±h TCCs surrounding an unregistered portion are merged to form an

unknown word candidate. By this setting, (h + 1)2 possible candidates can be generated for each

unregistered portion. Since, a word in Thai cannot comprise of any special characters, it is

possible to have a smaller number of candidates using surface constraints, such as space or

punctuation. To filter out unrealistic candidates, two sets of separation markers are considered.

The first set contains four types of marker words, such as (1) Conjunctives words: e.g., ก็ต่อเม่ือ

(meaning: when), นอกจากน้ี (meaning: Besides this) etc., (2) Preposition words: e.g., ตั้งแต่ (meaning:

since), ส าหรับ (meaning: for) etc., (3) Adverb words: e.g., เด๋ียวน้ี (meaning: at this moment), มากกวา่

(meaning: more than) etc., and (4) Special verbal words: e.g. , หมายความวา่ (meaning: means),

ประกอบดว้ย (meaning: comprised with) etc. The second set includes five types of special characters

as follows: (1) Interword seperations: i.e., a white space, (2) Punctuation Marks: e.g., ?, -, (…) etc.,

(3) General typography signs: e.g.,%, ฿ etc., (4) Numbers: including both Arabic (0, …, 9) and Thai

(๐, …, ๙) numbers, and (5) Foreign characters: English alphabets (including capital letters).

Unknown Word Identification

In the past, most previous works on Thai unknown word recognition (Sornlertlamvanich and

Tanaka, 1996; Theeramunkong and Tanhermhong, 2004; Sornil and Chaiwanarom, 2004;

Kawtrakul et al., 1997; Charoenpornsawat et al., 1998) treated unknown word candidates

independently. However, in the real situation, a set of candidates generated from an unregistered

portion, should be considered dependently and treated as a group. In the learning process, each

candidate in a group is labeled as a positive or negative instance. Although several candidates can

be generated from an unregistered portion, typically only few (just one or two) candidates are

the potential unknown words. This phenomenon forms an unbalanced dataset problem. For

example, Table 6-9 shows the rank of each candidate, where only two out of forty two candidates

are eligible unknown words, i.e., rank 1 and 32. After the ranking process, the most probable

candidate is selected as a suggested unknown word.

Table 6-9: Example output of predicted unknown word candidates ranked in a group by the

proposed method (source: TeCho et al., 2009b)

Rank Unknown Word Candidate (c) P(+|c) Actual Class Predicted Class

1 คโีตโคนาโซล 9.9988510-01 Y Y

2 ใชแ้ชมพคู ี 9.9899510-01 N Y

3 ใชแ้ชมพคูโีต 9.9552110-01 N Y

… … … … …

30 มพคูโีตโคนาโซล 4.3351510-04 N N

31 แชมพคูโีตโคนาโซ 1.0661210-04 N N

32 แชมพคูโีตโคนาโซล 8.5327910-05 Y N

… … … … …

40 คโีตโค 2.6328910-22 N N

41 คโีต 8.8801710-53 N N

42 ค ี 4.5928810-97 N N

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 31: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

274

Feature Extraction

As stated in the previous section, TCCs are used as processing units. We therefore use a sequence

of TCCs instead of a sequence of characters to denote an unknown word candidate. To specify

whether a candidate is the most probable unknown word or not, a set of suitable features need to

be considered. In this work, several statistics collected from context around an unknown word

can be considered as features. Next, in order to fasten the process to collect statistics from a text,

we apply the algorithm proposed by Nagao and Mori (1994) that utilizes sorted sistrings. For

each sistring (i.e., unknown word candidate), eight types of features, (f1)-(f8), are extracted. To

explain these features, the following description is first given.

Let A be a set of possible Thai characters, B be a set of possible TCCs, E be a set of possible

special characters, C = c1c2c3 . . . c|C| (ci A E) be a corpus, di D be the i-th document which is

a substring of C, Di = C[bi:ei] (bi is the position of the first character in the document Di, ei-1 is the

position of the last character in the document Di, C[ei:ei] is a special character specifying the end

of the document Di), bi = ei-1+1, T = t1t2t3 . . . t|T| (ti B) be the segmented corpus of C as a TCC

sequence, t1 = C[1:u], ti = C[v:w], ti+1 = C[(w+1):x], and t|T| = C[y:|C|], and W be a set of all possible

words in the dictionary. An unknown word candidate S can be defined by a substring of T, ST =

T[p:q] (= tp . . . tq) where p and q are the starting and ending TCC positions of S, respectively. Also,

the candidate S can be expressed by a substring of C, SC = C[r:s] (= cr . . . cs) where r and s are the

starting and ending character positions of S, respectively. As one restriction, no special character

is allowed in S. With the above description, eight features, (f1)-(f8), can be formally defined in

sequence as follows.

(f1) Number of TCCs (Nt)

The number of TCCs can be used as a clue to detect unknown words. Intuitively, several unknown

words are technical words each of which is a transliteration of an English technical term, and

many of them are very long. Formally, the number of TCCs in an unknown word candidate S,

Nt(S), can be defined as follows.

Nt(S) = |ST|

(f2) Number of Characters (Nc)

Similar to the number of TCCs, the number of characters in a sequence is another factor to

determine whether the sequence is a potential word or not. Concretely, an unknown word tends

to be long. The number of characters in an unknown word candidate S can be defined as follow.

Nc(S) = |SC|

(f3) Number of known words (Nw)

Like several languages, some unknown words in Thai language can be viewed as a compound

word that contains a number of known words. Therefore, when we recognize a sistring as an

unknown word, the number of known words in such sistring can be used as a clue to identify

whether the sistring is an unknown word. The number of known words can be defined as follows.

Nw(S) = |{w|w=S[a:b] w W}|

where S[a:b] is a substring of S starting from a to b.

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 32: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

275

(f4) Sistring Frequency (Nf )

The sistring frequency is useful information for determining whether the sistring is a word. The

number of occurrences of a sistring which is an unknown word tends to be higher than that of a

sistring which is not possible to be a word. The definition of sistring frequency is as follows.

Nf(S) = |{C[c:d] | C[c:d] = S 1 c d |C|}|

where C[c:d] is a substring of C starting from c to d and c, d range from 1 to |C|.

(f5) Left and Right TCCs variety (Lv,Rv)

The variety expresses the potential TCCs which come before or after a string. It implies the

impurity or uncertainty. Left (Right) variety is defined as the number of distinct TCCs actually

occurring before (after) an unknown word candidate. The high variety of distinct TCCs on the

left-hand side (right-hand side) is one of indicators to guess whether the candidate should be

detected as an unknown word. We therefore used the number of distinct TCCs on the left and

right-hand side as a feature. The definitions of left and right TCC variety are as follows.

Lv(S) = |d({T[a:a] | T[a+1:b] = S 1 a b |T|}|

Rv(S) = |d({T[b:b] | T[a:b-1] = S 1 a b |T|}|

where the d(L) returns the set of distinct elements in L, T[a:b] is a substring of T starting from a

to b and a, b range from 1 to |T|. T[a:a], T[b:b] are the TCC that co-occured on the left-hand side

and right-hand side of S in the corpus, respectively.

(f6) Probability of a special character on left and right (Ls,Rs)

The probability that a special character co-occurs on the left-hand side and the right-hand side of

the considering candidate indicates that the candidate is located near delimiters and should be

detected as an unknown word. We, therefore, used them as a feature. The definitions of

probability of a special character on left and right (Lv and Rv) are as follows.

LS(S) =

RS(S) =

where C[d:e] is a substring of C starting from d to e and d, e range from 1 to |C|, Nf (S) returns a

number of unknown word candidate S occurred in the corpus.

(f7) Inverse Document Frequency (IDF)

The inverse document frequency is a good measurement to specify the importance of a sistring.

Since an unknown word is a word that does not happen frequently in several documents, but it

appears frequently in only some specific documents. In addition, the high IDF means the sistring

is likely to be a unknown word. It was obtained by dividing the number of all documents by the

number of documents containing the term, and then taking the logarithm of that quotient. The

formal definition of IDF(S) (the inverse document frequency of S) is as

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 33: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

276

where log is the natural logarithm, |D| is the total number of documents in the corpus, and |DS| is

the number of documents where S appears.

(f8) Term Frequency with Inverse Document

Frequency (TFIDF) The TFIDF is a weight often used in information retrieval and text mining.

This weight is a statistical measure used to evaluate how important a word in a document in a

collection or corpus is. The importance increases proportionally to the number of times a word

appears in the document but is offset by the frequency of the word in the corpus. The definition

of TFIDF(S) is as

TF

.

Emsemble Classification with Group-based Ranking Evaluation Technique

This section describes four main aspects of our proposed approach in learning an ensemble

classifier for identifying unknown words. As the first aspect, exploiting the features extracted

from the training corpus, naïve Bayesian is applied to learn a base classifier to assign a

probability to each unknown word candidate, representing how likely the candidate is a suitable

unknown word for an unregistered portion. Second, a mechanism namely Group-based Ranking

Evaluation (GRE) is introduced to select the most probable unknown word for an unregistered

portion with the consideration of ranking in a group of unknown word candidates generated

from the same unregistered portion at a specific location. Third, a GRE-based boosting is

employed to generate a sequence of classifiers, where each consecutive classifier in the sequence

works as an expert in classifying instances that were not classified correctly by its preceding

classifier and a confidence weight is given to each generated classifier based on its GRE-based

performance. Fourth, a so-called Voting Group-based Ranking Evaluation (V-GRE) technique is

implemented to combine the results obtained from a sequence of classifiers in classifying a test

instance, with the consideration of the confidence weight of each classifier. The details of these

aspects are illustrated in order as follows.

Naïve Bayesian Classification

Based on naïve Bayesian method, the probability that a generated candidate c (characterized by a

set of features F = {f1, f2, . . . , f|F|}) is an unknown word, can be defined as follows.

where , is the probability that the candidate c is an unknown word, is the

unnormalized probability that the candidate c is an unknown word (positive class), and

is the unnormalized probability that the candidate c is not an unknown word (negative

class), (or is the prior probability that the class is positive (or negative),

(or ) is the probability that the feature is when the class is positive (or negative). Here,

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 34: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

277

both and are derived from the independence assumption of naïve Bayesian. For

continuous attributes ( ), Gaussian distribution with smoothing can be applied as follows.

where (or ) is the mean of the positive (or negative) class, (or ) is the standard

deviation of the positive (or negative) class, and ϵ is a small positive constant used for smoothing

to resolve sparseness problem. It is set to 0.000001 in our experiments.

Group-based Ranking Evaluation

Unlike the evaluation model in a traditional classifier, our proposed technique Group-based

Ranking Evaluation (GRE) categorizes all candidates produced by the same unregistered portion

location into the same group. This technique ranks all candidates with respect to their group

based on their probabilities to be an unknown word, and selects then the candidate with the

highest probability within that group as the potential prediction for the unknown word.

where is the most probable candidate of the i-th group, is a group of candidates generated

from the i-th unregistered portion, and is the probability that c is an unknown word. To

be more flexible, it is also possible to relax to accept top-t candidates as the potential unknown

words.

GRE-based Boosting

AdaBoost (Freund and Schapire, 1999) is a technique to repeatedly construct a sequence of

classifiers based on a base learning method. In this technique, each instance in the training set is

attached with a weight (initially set to 1.0). In each iteration, the base learning method constructs

a classifier using all instances in the training set, and with their weights showing the importance.

After evaluation the obtained classifier, the weights of the misclassified examples are increased

to make the learning method focus more on the misclassified examples in the next iteration.

Originally, AdaBoost evaluates each instance and updates its weight individually. This

technique is not suitable for the unknown word data that we treat them as groups of unknown

word candidates. We then propose a new technique called GRE-based Boosting to efficiently

apply AdaBoost technique to the unknown word data. In this technique, a weight is assigned to

each group of candidates. After constructing a base classifier, each group is evaluated‘ based on

the GRE technique explained in the previous section. The classifier is considered to misclassify a

group when the top ranked candidate in the group is not a correct unknown word. The weight of

that group is then increased to make the group be more focused in the next iteration.

Figure 6-5 shows the overall process of the proposed GRE-based boosting technique. Initially,

, a training set with all groups weighted by 1.0, are fed to INDUCER, a base learning method, in

order to generate a classifier . The obtained model is passed to GRE-INCOR to evaluate and

obtain the misclassified groups. Then, , a confidence weight of the classifier, and , a ratio of

success to unsuccess rate, are calculated from the misclassifying rate (as explained in Algorithm

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 35: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

278

1). The confidence weight ( ) represents the performance of the classifier. It is later used to

represent the strength of the classifier when the results from several classifiers are combined in

the evaluation step. The ratio of success to unsuccess rate ( ) is used as the new weights of the

misclassified groups in the next iteration. Basically, this is larger than 1. Hence, the classifier

constructed in the next iteration will be specialized to the previously misclassified instances.

Figure 6-5: GRE-based Boosting (source: TeCho et al., 2009b)

Algorithm 1: GRE-based Boosting

Input: is an initial training set with all weights set to 1.0

is the number of iterations.

Output: is a set of base classifiers

1: ;

2: ;

3: for k=1 to K do

4: ;

5: - ;

6: ;

7: ;

8: ;

9: ;

10: foreach do

11: if then

12: ;

13: else

14: ;

15: end

16: end

17: ;

18: end

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 36: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

279

Algorithm 1 shows the GRE-based boosting technique in details. The algorithm starts with the

initial training set with and , where

is the group of unknown word candidates generated for the i-th unregistered portion, is the

j-th candidate of the i-th unregistered portion, is an initial weight (set to 1 at the first

iteration) given to , is the number of unregistered portions, is the number of unknown

word candidates generated for the i-th unregistered portion, is the set of feature values

representing , and is the target attribute of (designated as the class label),

stating whether is the correct unknown word (+1) or not (-1). iterations are conducted to

construct a sequence of base classifiers. At the k-th iteration, a training set is fed to INDUCER

to construct a base classifier mk. The classifier is then evaluated by GRE-INCOR yielding ,

a set of misclassified groups. , the error rate of the classifier , can be calculated from . It is

used to calculated , and which are the parameters showing the confidence level of the

classifier, and the weight for the iteration. Finally, the weight of the misclassified group is set

to . Otherwise, it is set to 1.

Voting Group-based Ranking Evaluation

From the previous step, we obtain a sequence of base classifiers. Each classifier is attached with

its confidence weight ( ). In this section, we propose a technique called Voting Group-based

Ranking Evaluation to evaluate a group of unknown word candidates, and predict the unknown

word by combining votes from all base classifiers. Figure 6-6 shows the process to evaluate a

given group of unknown word candidates. Each candidate in the group is fed to all the classifiers

to obtain the probabilities that the candidate is a correct unknown word. Each probability is

weighted by the confidence weight of the corresponding classifier. These weighted probabilities

are summed for each candidate. Finally The candidate with the highest summed probability value

is chosen as an unknown word.

Figure 6-6: Voting Group-base Ranking Evaluation (source: TeCho et al., 2009b)

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 37: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

280

Algorithm 2: Voting GRE (V-GRE)

Input: is a set of base classifiers

is a set of unknown word groups.

Output: is the set a member of which is the set of the p suggested unknown words

for each unregistered portion.

1: ;

2: foreach do

3: ;

4: foreach do

5:

6: foreach do

7: ;

8: ;

9: end

10:

;

11: end

12: - -

;

13: ;

14: end

Algorithm 2 shows the evaluation process in details. This algorithm uses as inputs a set of

classification models and a testing set with

and , where is the model generated at the k-th iteration, is the

confidence weight of . is the group of unknown word candidates generated for the i-th

unregistered portion, is the j-th candidate of the i-th unregistered portion, n is the number of

unregistered portions, is the number of unknown word candidates generated for the i-th

unregistered portion. Then, each base classifier and each candidate are fed to the function

CLASSIFIER to get , the probability that the candidate is an unknown word based on the

model. This probability is weighted by , and added into the corresponding summation .

Finally, the top-t candidates are chosen and returns as a set of predicted unknown works by

TOP-t-CANDIDATE.

6.3.3. Experimental Settings and Results

In the experiment, we used a corpus of 16,703 medical-related documents gathered from WWW

taken from (Theeramunkong et al., 2007) with a size of 8.4 MB for evaluation. The corpus is first

preprocessed by removing HTML tags and all undesirable punctuations. To construct a set of

features, we apply TCCs and the sorted sistring technique. After applying word segmentation on

the running text, we have detected 55,158 unregistered portions. Based on these unregistered

portions, 3,209,306 unknown word candidates are generated according to the process described

previously. Moreover, these 55,158 unregistered portions came from only 3,763 distinct words.

In practice, each group of candidates may contain one or two positive labels. Therefore, 62,489

unknown candidates were assigned as positive and 3,146,819 unknown candidates were

assigned as negative. The average number of unknown candidates in a group is around 58. Based

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 38: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

281

on preliminarily statistical analysis of the Thai lexicon, we found that the average number of

TCCs in a word is around 4.5.

In this work, to limit the number of generated unknown word candidates, the maximum

number of TCCs surrounding an unregistered portion (h) is set to nine. This number is twice of

the average number of TCCs in a word. With h=9, the number of generated unknown word

candidates becomes 100. Moreover, it is possible to use two sets of separation markers in Sect.

4.2 to reduce the number of candidates. Table 6-10 shows the number of candidates generated

with/without applying two sets of separation markers to reduce the number of candidates. The

second and fourth columns indicate the distinct and total numbers, respectively. The third and

fifth columns shows the ratio over the number of candidates generated without considering any

separation markers for the cases of distinct and total numbers, respectively.

Table 6-10: Numbers of candidates generated with/without applying two sets of separation

markers and their portions compared to ‘None’ (source: TeCho et al., 2009b)

Marker Set # Distinct % Portion # Total % Portion

None 2,567, 463 100.00 7,632,300 100.00

First Set 2,363,829 92.07 7,158,875 93.80

Second Set 1,295,737 50.47 4,241,097 55.57

First + Second Set 1,153,867 44.94 3,891,845 50.99

Exploiting a naïve Bayes classifier as the base classifier, the proposed methods, GRE-based

boosting (later for short, GRE) and V-GRE, are used to learn ensemble classifiers and to identify

an unknown word. For V-GRE, the boosting iteration is set to ten. That is, sequentially ten

classifiers are generated and used as Classification committees. Moreover, to evaluate our

proposed method in detail, we have conducted the experiments to examine the effect of eight

features, (f1)-(f8), on the classification result by comparing performance of each possible feature

combination with the others.

In the experiments, 10-fold cross validation is employed to compare the proposed methods

(GRE and VGRE) to the record-based naive Bayesian method (R-NB). The R-NB is a traditional

naive Bayesian method, where all instances in the training/testing set are assumed to be

independent of each others.

We investigate the performance of GRE, V-GRE, and R-NB when the top-t candidates with t

ranging from 1 to 10, are considered as correct answers. Table 6-11 displays the performance of

two group-based evaluations; GRE and V-GRE, as well as R-NB in cases of the all-feature set (f1-

f8) and the best-5 feature sets ((f3,f4,f7), (f3,f4,f5), (f3,f4,f8), (f2,f4,f6,f7), (f4,f6,f8)). More

precisely the all-feature set performs well at rank 12 among all possible combinations (255

methods). According to the result, a number of conclusions can be made as follows.

Firstly, V-GRE outperformed GRE in both the all-feature set and the best-5 feature sets for all

top-t ranks. For the top-1 rank of the all-feature case, V-GRE achieved an accuracy of

90.93%±0.50 while GRE gained 84.14%±0.19. For a higher rank, V-GRE still outperformed GRE

even the grap becomes smaller, e.g., at rank-10 V-GRE and GRE gains 97.90%±0.26 while GRE

gains 97.25%±0.17. V-GRE outperforms GRE with a gap of 6.79 (90.93%-84.14%) for the top-1

rank. This gap is very small for the top-10 rank, i.e., 0.01 (97.26%-97.25%). In cases of the best

feature set (f3,f4,f7), V-GRE can achieve up to 93.93%±0.22 and 98.85%±0.15 accuracy, for the

top-1 and top-10 rank, respectively while GRE obtains 84.15%±0.64 and 97.24%±0.27,

respectively. The result indicates that VGRE is superior to GRE with the gaps of 9.78 and 1.61 for

the top-1 and top-10 rank, respectively.

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Page 39: Sponsored by AIAT.or.th and KINDML, SIIT · Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based

282

Secondly, V-GRE obtains higher accuracy than the record-based naive Bayesian (R-NB) does

in most cases. However, GRE may not be superior to R-NB in the case of the top-1 rank but it

outperforms R-NB in the case of the top-2 rank. Thirdly, our proposed V-GRE and GRE can find

the correct unknown words within the rank of 10 (top-10) with the relatively high accuracy of

97%-98%.

Table 6-11: Accuracy comparison among GRE, V-GRE and a naïve Bayes classifier. Here, h is set to

nine. (source: TeCho et al., 2009b)

Feature set Evaluation

Techniques

1 2 3 4 5 6 7 8 9 10

(f3,f4,f7) GRE 84.150.64 91.180.43 93.490.40 94.850.36 95.740.33 96.280.33 96.590.32 96.820.30 97.040.28 97.240.27

V-GRE 93.930.22 95.440.26 96.301.78 97.150.23 97.810.23 98.150.20 98.410.20 98.590.19 98.720.20 98.850.15

R-NB 89.960.21

(f3,f4,f5) GRE 89.480.46 90.430.41 92.570.29 94.980.41 95.360.37 95.670.34 95.820.33 96.070.29 96.180.28 96.420.26

V-GRE 93.480.46 94.430.41 94.941.18 95.360.37 95.670.34 95.820.33 95.070.29 96.180.28 96.420.26 96.570.29

R-NB 81.960.11

(f3,f4,f8) GRE 90.630.32 95.280.21 95.420.32 95.890.26 96.120.27 96.430.25 96.570.22 96.790.23 96.940.22 97.150.19

V-GRE 92.630.32 95.420.32 95.880.27 96.120.27 96.430.25 96.570.22 96.790.23 96.940.22 97.150.19 97.280.21

R-NB 89.700.05

(f2,f4,f6,f7) GRE 86.030.59 98.900.18 93.500.40 95.350.36 96.650.34 97.570.27 98.080.19 98.520.20 98.770.19 99.060.15

V-GRE 91.950.39 96.520.20 97.440.18 98.010.13 98.450.10 98.660.11 98.850.11 98.980.14 99.060.15 99.120.16

R-NB 89.700.08

(f4,f6,f8) GRE 78.870.49 94.620.26 87.010.46 89.250.51 90.550.43 91.650.38 92.600.30 93.430.26 94.040.29 95.130.27

V-GRE 90.780.76 93.990.42 94.830.34 95.310.33 95.730.30 96.080.34 96.410.30 96.690.29 96.850.28 97.030.25

R-NB 87.890.07

(f1-f8) GRE 84.140.19 91.710.22 93.520.33 94.860.25 95.740.24 96.290.20 96.600.17 96.830.19 97.040.22 97.250.17

V-GRE 90.930.50 94.920.43 96.050.42 96.630.43 97.040.43 97.270.40 97.490.36 97.650.34 97.790.31 97.260.26

R-NB 82.480.12

Conclusions

Section 6.3 presents an automated method to recognize unknown words from a Thai running text.

We described how to map the problem to a classification task. The naïve Bayes with a smoothing

technique classifier is investigated using eight features: number of TCCs, number of known

words, string length, number of left and right TCCs variety, probabilistic of special character

occurring on left and right, number of document found, term frequency and TFIDF scores, for

evaluating the model. In practice, the unknown word candidates actually have relationship

among them. To reduce the complexity in unknown word boundary identification, reduction

approaches are employed to decrease a number of generated unknown word candidates to 49%.

This paper also proposed the group-based ranking evaluation technique. This technique

considered the unknown word candidates as groups that can solved the unbalanced datasets

problem. To further improve the prediction of a classifier, we apply a boosting technique with

voting under group-based ranking evaluation (V-GRE). We have conducted a set of experiments

on real-world data to evaluate the performance of the proposed approach. From the experiment

results, the proposed technique achieves the accuracy of the order of 90.93%0.50 to 97.90%0.26

at the first rank and tenth rank. Our proposed ensemble method can achieve an increase in

classification accuracy of the order of 6.79% to 8.45% at the first rank when compared to the

ordinary evaluation and group-based ranking evaluation (GRE) technique, respectively. For more

detail, the reader can find in (TeCho et al., 2009b).

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND