Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
1
Bar Ilan University
The Department of
Computer Science
Keyword based Text Categorization
by
Libby Barak
Submitted in partial fulfillment of the requirements for the Master's Degree in the Department of Computer Science, Bar Ilan University.
Ramat Gan, Israel June 2008, Sivan 5768
2
This work was carried out under the supervision of Dr Ido Dagan
Department of Computer Science,
Bar-Ilan University.
3
Acknowledgements
I would like to take this opportunity to thank the people whose joint efforts assisted
me in writing this thesis.
First and foremost, my greatest thanks go to Dr. Ido Dagan for introducing me to the
wonderful world of Natural Language Processing, and for supervising this research.
His constant support, thorough guidance, and great patience enabled this work.
My gratitude goes also to all my NLP lab members for sharing with me their time and
moral support. I especially want to express my appreciation to Idan Szpektor, Roy
Bar-Haim and Shachar Mirkin, for sharing with me their words of wisdom,
experience and advice when needed.
I would like to thank Michael Gutkin and Eyal Shnarch for their assistance beyond the
research processes throughout various academic tasks required along the way.
I am grateful for our Italian colleagues, Alfio Glizzo and Carlo Strapparava from ITC-
Irst for setting up the ground work for this research. I wish to thank them for their
help acquiring data structures and results and for their assistance in implementing
some of the methods.
I want to thank my parents and my brothers for encouraging me to pursue my
academic goals and dreams, and for giving me the special kind of support only family
can provide. I would also like to thank my husband, Oren, for his unique support,
understanding and faith in me, which encouraged me greatly throughout this work.
This thesis was partly supported by the Negev Consortium (www.negev-
initiative.org), funded by the Israeli Ministry of Industry, Trade and Labor.
4
Table of Contents
Table of Contents .............................................................................................................................. 4
List of tables and figures .................................................................................................................. 6
Abstract ............................................................................................................................................ 7
1. Introduction ............................................................................................................................. 10
2. Background ............................................................................................................................. 13
2.1. Supervised text categorization ........................................................................................... 13
2.2. Keyword based TC ........................................................................................................... 14
2.3. Categorization based on category name ............................................................................. 15
2.4. Lexical entailment ............................................................................................................. 17
2.4.1. Lexical Entailment Resources .................................................................................. 18
3. Text Categorization based on category name .......................................................................... 21
3.1. Research goals .................................................................................................................. 21
3.2. Categorization tasks .......................................................................................................... 23
3.3. Scoring methods ............................................................................................................... 23
3.3.1. Vector Space Model based on category seed terms ................................................... 23
3.3.2. Entailment expansion methods ................................................................................. 25
3.3.3. Context Similarity methods ...................................................................................... 31
3.3.4. Gaussian Mixtures model ......................................................................................... 34
3.3.5. Combination of knowledge and context .................................................................... 35
3.4. Binary classification methods ............................................................................................ 36
4. Evaluation ............................................................................................................................. 38
4.1. Data set and Pre processing ............................................................................................... 38
4.1.1. Experimental settings ............................................................................................... 41
4.2. Ranking ......................................................................................................................... 42
4.2.1. Ranking measure ..................................................................................................... 42
4.2.2. Ranking results ........................................................................................................ 43
4.2.3. Analysis ................................................................................................................... 47
4.3. Classification .................................................................................................................... 54
4.3.1. Classification measure ............................................................................................. 54
4.3.2. Classification results ................................................................................................ 55
4.3.3. Analysis ................................................................................................................... 57
4.4. Reuters-10 results .............................................................................................................. 64
5. Conclusion and future work .................................................................................................... 68
References ....................................................................................................................................... 71
Appendix A – Latent semantic Analysis......................................................................................... 75
Appendix B – Gaussian Mixtures ................................................................................................... 77
5
Appendix C – Support Vector Machines........................................................................................ 80
Abstract (Hebrew) .......................................................................................................................... 83
6
List of tables and figures
List of tables:
Table 1- WordNet synsets ....................................................................................... 18
Table 2- Initial seeds for the 20 Newsgroups collection ........................................... 39
Table 3 - Initial seeds for the Reuters-10 collection ................................................. 41
Table 4 - MAP values for the 20 Newsgroups collection. ......................................... 46
Table 5 – Document samples for the passing reference phenomenon. ...................... 49
Table 6 - Document samples for missing annotations. ............................................. 52
Table 7 - Micro average classification the 20 Newsgroups collection....................... 55
Table 8 - Classification results per category for the 20 Newsgroups collection. ....... 56
Table 9 - Confusion matrix for the Simcombined score based method …………………60
Table 10 - Confusion matrix for the Simcontext score based method ………………….62
Table 11 - MAP values for for the Reuters-10 collection ......................................... 64 Table 12 - Micro average classification results for the Reuters-10 collection. .......... 65
Table 13 - Classification results per category for the Reuters-10 collection. ............. 66
List of figures:
Figure 1 - Recall-Precision curves for overall baselines ....………………………… 44 Figure 2 - Recall-Precision curves for entailment baselines …...……………………45
Figure 3 - Recall-Precision curves for specific categories . ...................................... 47
Figure 4 - Context Scoring influence ....................................................................... 53
7
Abstract This thesis investigates Keyword-based Text Categorization (TC) using only a topical
taxonomy as input. TC task is mostly approached via supervised or semi-supervised
methods. Supervised TC requires excessive manual labor in order to manually
annotate text samples as training data for the supervised TC. Although there are
several legacy categorization systems which already acquired labeled text samples
this solution is not feasible for most systems currently. New taxonomies, new TC
collections which require classification and rapid growing of unlabeled text
documents are only some of the reasons to seek for a more automated TC method.
Keyword-based semi-supervised TC methods have made the first step towards
a more automated TC framework. These methods recognize the great computational
power which lies in the excessive amount of unlabeled data available currently for
various domains and applications. The basic idea of these methods is to represent the
categories by a set of characteristic keywords, and to set a similarity measure to
determine the similarity between texts and categories. Those keywords should contain
the meaning of the category topic. The supervised aspect of these methods lies in the
specification of the characteristic keywords instead of manual classification of a large
amount of documents. This step is considered to require less work than the one
required by the fully supervised methods. Nevertheless, it still requires specific
manual annotation for each category, which requires certain expertise. Therefore, new
taxonomies require specific manual effort by domain experts once more.
Our research is based on a new approach, first proposed in (Gliozzo et al.,
2005), which do not require specific manual annotation for each category. This
research relies on the assumption, used also in previous works, that the category name
itself should be highly informative for the TC goal. Each category name is selected by
domain experts to represent most accurately the category topic. It, therefore,
encompasses useful information for the TC purpose. To obtain a set of characteristic
terms for each category the method utilized an automated expansion method of the
initial category name. In (Gliozzo et al., 2005) the expansion is based on co-
occurrence information extracted from the TC collection used. Using Latent Semantic
Analysis (LSA) and a standard similarity measure they obtain an initial set of
automatically labeled documents, which are then used to train a supervised classifier,
and by that acquire the final classification.
8
Basing the similarity measure on co-occurrence data has several
disadvantages. First and for most, the co-occurrence data does not capture the exact
semantic relation needed to assess classification decision. Co-occurrence data
typically models the broader context of the text and not the specific topic it discusses.
High similarity according to co-occurrence data assures that the text is relevant in a
general context sense to the category topic. It does not assure that the topic itself is
mentioned in the text. For example, a text which discusses certain computer software
is relevant to the general computers context, however its context may not be directly
related to any specific computer branch.
In this research we offer a novel taxonomy based approach for keyword-based
TC, which bases its similarity measure on a Lexical Entailment (LE) measure instead
of a context measure only. LE defines a more accurate semantic relation, which aims
to identify whether the meaning of a certain text is referenced by another text. This
measure aims at a more appropriate relation to base the TC assumption on, since it
requires the actual reference to the category topic in the text, rather than general
context similarity.
In order to identify whether the topic is addressed by the text as the main topic
and not as one of the text minor topics, we integrate the context model in our overall
framework. Once a reference to the category topic, i.e. an entailment evidence, is
recognized in a text, we also measure its context similarity to the category topic.
Using this novel integrated framework we achieve a complementary semantic
measure which quantifies the topics mentioned and the contextual relevancy at the
same time.
We utilize two preliminarily resources for the LE methods. The first LE
knowledge base used is based on the WordNet (Fellbaum, 1998) semantic relation
ontology. This resource enables us to extract semantic relations from a dictionary
style knowledge base. It supplies necessary morphological variations and useful
entailing terms. As a complimentary resource, we utilize a Wikipedia LE knowledge
base. This encyclopedia oriented knowledge base supplies us with entity names,
commercial products and general knowledge terms. The two resources are
complementary by nature and as expected they contribute for different types of
categories and relations addressed in this research.
Our context based method is based on the co-occurrence based method used in
(Gliozzo et al., 2005). We utilize a Latent Semantic Analysis (LSA) method to
9
represent the context similarity of documents and categories. LSA is a dimensionality
reduction method which maps similar terms, by means of co-occurrence data, to a
lower dimensional space in which terms and documents are represented by
"concepts". Those "concepts" aims to capture the context similarity of the data. LSA
has the advantage of modeling both first order and second order similarity, and by that
offers a powerful context-similarity measure. It measures not only the likelihood of
terms to appear in the same document as standard co-occurrence based methods, but it
also captures the likelihood of terms to co-occur by their joint mapping to the same
LSA "concepts".
We applied the similarity measure described above for two TC goals. The first
is ranking of documents for each category according to the similarity score, and the
second is binary classification of documents to one or more categories. We compared
our results to the research results of (Gliozzo et al., 2005), as well as a comparison of
the component methods implemented as part of our full system. Ranking evaluation
enables the analysis of the accuracy gained for each category, since it creates a
separate list for each category. It also enables a more concrete analysis of the
methods' precision since it inspects the relative scoring of the documents inside the
category. Classification, on the other hand, enables the analysis of the comparative
scores obtained for a document per each category, and analysis of the inter-category
relations. It also enables the analysis of false negative misclassification, and by that
implies on necessary improvements.
Positive empirical results are presented for our complete method. It indeed
shows higher precision results which support the hypothesis that the LE based
approach is more accurate than the context based approach. The results are
accompanied by comprehensive analysis of the expansion types and various
mechanisms needed to further improve the results.
10
1. Introduction
Text categorization (TC) is the task of classifying textual documents into preset
(topics) categories. The majority of research works in the TC area use supervised
methods, which relies on a vast, sometime prohibitive, amount of labeled training data
to achieve accurate classification. The lack of sufficient available resources of labeled
training data for those kinds of tasks requires manual annotation of textual data, which
is often a long and very expensive task. On the other hand, there are large available
resources of unlabeled training data which can be of aid for TC tasks. In the recent
years there were several research efforts which tried to base their methods on
unlabeled data rather than resorting to manual annotation, one group of these methods
is Keyword-based Text Categorization.
Keyword-based topical TC relies on keyword representation of categories and
documents, which requires only manual specification of the keyword representation
for the topical categories. The documents can be naturally represented by a processed
collection of the keywords contained in them. However, since the quality of the
classification is highly dependent on the accuracy of the representing keywords of the
topical category it may require careful manual specification of those keywords.
Methods such as (Liu et al. 2004) tried to make the manual specification process more
efficient by partially automating it using a clustering method that creates a candidate
list of representing keywords for each category. Nevertheless, the method still
requires manual specification as part of the classification process.
To acquire a fully unsupervised set of the categories' characteristic keywords the
TC algorithm should use unsupervised methods to find relevant keywords in
unlabeled data. Therefore the available input within the collection for such method is
the names of the topical categories and unlabeled data alone. (Ko and Seo, 2004;
Gliozzo et al.,2005) recognized the category names as a significant part of the input
since they should capture the meaning of the topic. For that reason the category names
could be used as a seed for the collection of category keywords in unsupervised TC
methods.
Collecting category-representing keywords by using the category name as seed
relies on context models which use co-occurrence information to extract related
keywords from the text. (Ko and Seo, 2004) used a co-occurrence metric to employ a
feature-projection method in order to extract keywords for each category. Another
11
context model, Latent Semantic Analysis (LSA), which also relies on co-occurrence
data, was used in (Gliozzo et al. 2005) to measure similarity based on LSA topics.
However, basing the classification decisions on co-occurrence data has a major
drawback of relying on a weaker semantic relation than needed for this type of task.
Term co-occurrence only indicates that words tend to co-occur together and therefore
have high probability to be related to the same context. LSA captures a stronger
aspect of co-occurrence, since it captures both tendencies of terms to appear together
in the same document and to appear in similar contexts of other words. However it
still captures only the co-occurrence similarity and not necessarily a reference to the
category topic. For that reason, documents which share co-occurrence data with
categories characteristic terms might by topically related to the category, but do not
necessarily discuss specifically the category's topic.
In this thesis we propose to base the classification method on a different lexical
relation, namely Lexical Entailment inference (LE). Lexical Entailment inference
(Glickman et al.,2006) aims to define a more concrete criterion for word similarity, than
the criterion defined by distributional similarity, to enable deduction whether a certain
textual meaning can be inferred from a specific text. Generally speaking, a word w is
lexically referenced by a text t if there is an implicit or implied reference from a set of
words in t to a possible meaning of w. For instance, the word "Mercedes" entails the
word "car", since instead of "Danny drives his Mercedes", one may say "Danny has a
car". Entailment models should be helpful for the text categorization process in
finding texts for a specific category, which do not contain the exact keyword from the
initial characteristic terms. Texts about cars might contain only words like
automobile, vehicle or even cars' manufactures and not the exact phrase – "car".
Lexical entailment can help us enrich the seed keyword of a category name with
entailing words, as shown above.
We propose a TC method which uses only the category names as seeds for an
expansion step with virtually no manual processing requirement during the
classification process. The method relies on external resources to acquire knowledge
needed for the TC, instead of requiring manual analysis per category. Our method
relies on LE expansions which are integrated with co-occurrence data. The
combination of LE, which indicates that the topic is indeed referred to in the
document, with a context model, which indicates that the topic was addressed
prominently within the document context, is more likely to capture the meaning
12
needed for accurate classification. In addition, we use the automatic expansions to
create an initial set of classified documents which are then used as input for a
supervised learner in a bootstrapping procedure in order to acquire a final
classification.
In section 2 we provide some background on recent works and the resources used
for our method. We describe the entailment and context models used in our method in
sections 3.3.2 and 3.3.3. Section 4 discusses the evaluation of the proposed method
and analyzes the results of each step of the method. We show that using an initial
entailment method as the basis for the classification decision provides preliminary
promising results, which are restricted mostly by the recall of the LE resource in use.
The proposed method reaches higher precision results which imply that indeed the
entailment assumption is more suitable to the needs of the TC task. The analysis of
the method in section 4 describe the aspect in which the entailment based method out
performs the context based method. With the ongoing development of promising LE
resources it is highly expected that TC methods based on the LE approach can reach
further improved results.
13
2. Background
Text categorization (TC) is the task of clustering textual documents by a given set
of categories. In this thesis we focus on keyword-based TC which represents
documents and categories as set of keywords. This section describes related work and
provides motivation for our method. Supervised text categorization is presented (2.1),
and then keyword based text categorization methods are described (2.2). Next,
unsupervised methods for TC are presented and the framework and motivation of the
method we employ is presented (2.3). Finally, background on the lexical entailment
framework and resources and the motivation to use it are explained.
2.1. Supervised text categorization
The supervised approach for TC uses a set of labeled documents to train a supervised
classifier (learner). Most work in text categorization has used a "bag of words"
representation, in which each feature corresponds to a single word, which is then used
to train the supervised classifier. (Tan et al., 2002) added bigram features to the
standard use of unigrams by selecting bigrams according to their Information Gain for
the category, and showed improvement of F1 and break-even point measures. Other
works tried to improve the accuracy of supervised TC tasks by means that are
independent from the amount of labeled documents the method requires as input.
Several works, for example, tried to exploit lexical relations, such as hypernyms and
synonyms, to enrich the "bag of words" representation.
(Cai and Hofman, 2003) used context models to enhance the feature
representation of documents. They used Probabilistic Latent Semantic Analysis to
automatically extract semantic concepts in order to achieve robustness with respect to
linguistic variations such as vocabulary and word choice. They used Adaboost, a
boosting algorithm proposed by (Freund et al., 1998), to combine the hypotheses
based both on the semantic concepts and on word features from the documents. The
combination of the two types of hypotheses showed an overall improvement of about
5% in accuracy.
WordNet was used as a source for synonyms and hypernyms to enhance
feature data for TC methods in several works. (de Buenaga Rodriguez et al. 1997)
utilized WordNet as a source for synonyms based on the assumption that the name of
the category can be a good predictor of its occurrence. They used WordNet synsets to
14
perform a category expansion, similar to query expansion, using the category
synonyms. This information was added to labeled training examples as the input of
supervised learning algorithms. The integrated algorithm achieved an improvement of
20 points in precision and was found as extremely helpful for low frequency
categories which have a lower number of training examples. Another research that
combined WordNet information with labeled training data is of (Scott and Matwin,
1999) who used WordNet as a source for synonyms and hypernyms which were
added to the representation of each document.
2.2. Keyword based TC
This research focuses on keyword based TC which represents both categories
and documents by a set of keywords. One approach of supervised keyword based TC
is to ask the user to create a list of representing keywords for each category
(McCallum & Nigam, 1999), which will be used to identify an initial set of
documents for each category, instead of labeling training documents manually. (Liu et
al. 2004) recognized this step as difficult in the overall procedure, since the user can
only provide a limited set of words which might be inefficient for accurate learning.
They proposed a keyword based method constructed by the following steps: (a)
Cluster the unlabeled texts with k-means algorithm using the cosine similarity metric
from information retrieval (Salton & McGill, 1983) as the distance measure. The
words from each cluster are ranked by their information gain value and are given to
the user as candidate words for the initial feature vector. The user chooses the most
descriptive words from the list for each category and is given the option to add more
keywords which do not appear in the top ranked list. (b) Create an initial set of
labeled documents using the cosine similarity metric (Salton & McGill, 1983) to
measure the similarity between documents and representing keywords, and finally (c)
train a Naïve Bayes (NB) classifier using the Expectation Maximization algorithm. In
each iteration the EM algorithm trains a NB classifier and re-estimates the probability
of a document to be classified to a category.
(Liu et al. 2004) performed their evaluation on four easier categories of the 20
categories of the 20 Newsgroups dataset composed by 4 – 5 categories each. The
evaluation compared the results for choosing initial category seed characteristic
keywords with the use of the original k-means keywords. It shows that classification
based on the selected keywords obtains better results by a large margin.
15
A more recent attempt to approach keyword based TC by categorizing
unlabeled text examples has been reported by (Ko and Seo, 2004). The first step of
the keyword based TC used a bootstrapping algorithm on the co-occurrence
information of the unlabeled data to extract keyword lists. The list were based on
frequent co-occurrences with seed keywords constructed from the category names
knowledge; for the second step the authors used a NB classifier to create an initial set
of labeled documents. The third step trained a classifier (Ko and Seo, 2002) based on
a feature projection technique which is robust to noisy data.
For evaluation, they compared the results to a semi-supervised and a
supervised method, and showed comparable results to the supervised categorization
method. The semi-supervised approach used experts to choose initial keywords as
seeds for the procedure described above. Using experts' knowledge was found to be
an expensive but worthwhile task, as it achieved significant improvement in one of
the data sets used for evaluation.
2.3. Categorization based on category name
TC approaches which do not require manual effort over the course of the
classification procedure have been attempted rather rarely in the literature. One of the
approaches to perform TC without any manual effort during the classification process
is to use keyword based methods by automated creation of category representations.
Based on the semi-supervised keyword based TC methods described in section 2.2,
(Gliozzo et al. 2005) introduces an unsupervised bootstrapping keyword based
method for text categorization, which uses only the category name as the input for the
bootstrapping algorithm. Their method was constructed from the three following
steps:
(i) Initial creation of representing vectors for each category - the category
name was generalized using Latent Semantic Analysis (LSA) in which
documents and categories are represented in a latent semantic space. LSA is a
dimension reduction method which decreases the number of dimensions in the
document-by-term matrix. It converts the co-occurrence data represented in
the matrix to a representation of implicit semantic concepts in the latent space.
(ii) Initialization of labeled documents set for the supervised learning – A
Gaussian Mixtures (GM) algorithm was employed to obtain uniform
classification using the similarity scores in the latent semantic space as input.
16
The GM algorithm outputs new values for the probability of each document's
to be classified to a specific category given the similarity score. An initial set
of labeled documents is classified according to those probabilities, where each
document is classified to the best scoring category.
(iii) Supervised classifier training - SVM classifier was trained to categorize
the unlabeled text based on the initial categorized set.
The authors reported results on two data sets – 20 News groups1 and Reuters-
10 (the 10 most frequent categories2 in Reuters-21578
3), showing improvement
relative to earlier keyword based methods. A more detailed description of each of the
three methods used in this research can be found on Appendix A, B and C
respectively.
Another TC approach which does not require manual effort over the course of
the classification procedure is topical TC via clustering. This approach uses
unsupervised clustering methods to split the data into a preset number of clusters
which are then matched to the predefined categories. (Sahami et al., 1996; El-yaniv
and Souroujon, 2001) used clustering methods in both a semi-supervised setting, in
which a small set of training examples were used, and in an unsupervised setting,
where no training examples were used. Both methods were evaluated over small
datasets which were obtained from subsets of the Reuters collection and the 20-
Newsgroups collection.
We base our method on the automated keyword-based approach, and in
particular the approach described in (Gliozzo et al. 2005) by creating a two phased
method, (1) automatically create category representations to acquire an initial set of
labeled documents based on a similarity score between the categories and the
document representations, (2) classify the unlabeled documents based on the initial
categorized set using a SVM based classifier. As opposed to creating categories
representation based on context models such as LSA, we utilize an integrated model
based on an entailment requirement instead of just co-occurrence data. We will next
describe the lexical entailment framework and the lexical semantic relations resource
which were used to acquire lexical entailment rules.
1 The collection is available at www.ai.mit.edu/people/jrennie/20Newsgroups. 2 The first 10 categories are: Earn, Acquisition, Money-fx, Grain, Crude, Trade, Interest, Ship, Wheat
and Corn.
3 Available at http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html).
17
2.4. Lexical entailment
The ability to identify semantically equivalent pieces of text is important to various
NLP tasks. The Textual Entailment (TE) framework addresses this task by trying to
formulate the degree of semantic matching between snippets of text (Dagan et al.,
2006), that is to decide whether the meaning of one text, termed hypothesis, can be
inferred (entailed) from another text. This framework was identified as a core
semantic inference paradigm in the NLP field (Giampiccolo et al., 2007). It can
contribute to various tasks such as Information Retrieval (IR), Query Expansions (in
IR) and Question Answering (QA). For example, to be able to answer on a question
such as “Whom did SCO sue?” a QA system should be able to deduce that "SCO sued
IBM" can be inferred from "SCO won a lawsuit against IBM".
This thesis focuses on a subtask of the TE framework, proposed in (Glickman
et al., 2006), which is termed Lexical Entailment (LE). LE aims to recognize whether
each lexical meaning in the hypothesis text is referenced by some meaning in the
entailing text. More concrently, a word w is lexically referenced by a text t if there is
an implicit or implied reference from a set of words in t to a possible meaning of w.
LE relations may be represented by rules, denoted LHS ⇒ RHS, where the term on
the right hand side is entailed by the term on the left hand side. For instance, the rule
"Toyota ⇒ car" which is equivalent to a hyponym (is-a) relation can be useful to help
classify documents related to a "Cars" category. Another example, is the rule
"Prisoner’s dilemma ⇒ game theory" which can be useful to help classify a document
discussing the "Prisoner's dilemma" to a "Game theory" category. We'll refer to this
type of rules as entailment rules for the TC task utilization.
In this thesis we aim to use entailment rules to expand the seed terms of the
category name in order to improve the accuracy of the TC task. Our application of LE
rules for the TC task is similar to the application of the LE framework for the Query
Expansion task, in which the entailing words are considered as expansions for the
query as suggested in (Clinchant et al., 2005). To integrate the LE rules in the TC
scheme described above, the initial seeds based on the category name are expanded
with entailing terms extracted from the LE rules. For each rule in which the RHS of
the rule is one of the seed terms for a specific category, the LHS term of this rule is
added to the seed terms of this category to create the set of representing keywords for
18
the category. Below we describe the external resources used by our method to extract
LE rules.
2.4.1. Lexical Entailment Resources
In the absence of dedicated comprehensive knowledge bases for lexical entailment
rules we based our entailment rules extraction methods on external resources
available online. The resources utilized for this purpose are a lexical resource, the
WordNet lexical ontology, and a textual resource, Wikipedia the online encyclopedia.
Given the different nature of the two resources, the methods applied to each of them is
quite different. Below we give a short description of each resource and its
characteristics.
# Terms Gloss
1 Infinite the unlimited expanse in which everything is located
2 - an empty area (usually bounded in some way between things)
3 - an area reserved for some particular purpose
4 Outer Space any location outside the Earth's atmosphere
Table 1- The 4 top synsets for the noun "space". Each row presents other terms in the synset (if exist) and the gloss for this sense.
WordNet WordNet4 is a fundamental computational lexical resource (Fellbaum,
1998) developed by a group of lexicographers led by Miller, Fellbaum and others at
Princeton University. It is a lexical ontology of semantic relations, available online,
widely used for natural language processing systems while being updated and
growing over the last fifteen years. It provides a large repository of English lexical
items and consists of nouns, verbs, adverbs and adjectives which are organized into
synsets.
WordNet sysnsets are synonym sets which represent a single sense of an
English term. A sense is defined to be the meaning of a single term for a given Part
Of Speech (POS). A gloss definition of the concept of this sense is given for each
sense. The senses are ordered by decreasing sense frequency according to their sense
frequency in the SemCor corpus (Miller et al.,1993). Since the frequency is estimated
according to corpus statistics, synsets with no measured estimation appear at the
4 We used version 3.0 of WordNet available at http://WordNet.princeton.edu/obtain
19
bottom of the synset list. Table 1 presents an example of the top four synsets for the
term "space", their glosses and their order according to the sense frequency.
WordNet contains two kinds of relations which link the different synsets,
lexical and semantic. Lexical relations hold between morphologically related word
forms such as derivations (formation of words from bases/words), while semantic
relations hold between word meanings. Among the semantic relations WordNet
consists of are hyponyms (is-a relation) and meronyms (is-part-of relation) which are
used in our method. For instance, some of the hyponyms of the word "auto" are "cab"
and "minivan", its meronyms include the words "bumper" and "window" and its
derivations are "automobile" and "automobilist". Nouns and verbs are organized in a
hyponymy/meronymy hierarchy while adjectives are organized in clusters with a head
synset for which all related synsets have similar meaning. Adverbs usually point to
the adjective from which they are drived.
Various NLP tasks exploited WordNet as a source for lexical expansion.
Among those tasks are the TC tasks described earlier in section (2.1) such as (Scott
and Matwin, 1999) who used WordNet as a source for synonyms and hyponyms.
Automatic indexing has been improved by adding the synsets of query words and
their hypernyms to the query (Mihalcea and Moldovan, 2000). Our method exploits
derivations, synonyms, hyponyms and meronyms of the seed term to acquire LE rules
based on the WordNet knowledge, as explained in section 3.3.2.
Wikipedia Wikipedia5 is collaborative online encyclopedia which covers a wide
variety of domains. Wikipedia is constantly growing and evolving based on the
contribution of online users, and had more than 1,700,000 articles on the English
version as of March 2007 (Kazama and Torisawa, 2007). (Gilies, 2005) shows that the
quality of Wikipedia articles is comparable to those of the Britannica internet
encyclopedia.
Each Wikipedia article describes a unique concept, mostly named entities. The
article's text is a promising source for knowledge regarding the article's title as an
encyclopedia definition of it. Wikipedia also contains sources for knowledge
regarding the article titles which are common to encyclopedias or Web knowledge
bases. As most encyclopedias, Wikipedia uses term canonization so that a single
5 We used the English version from February 2007 available at
www.ukp.tudarmstadt.de/software/JWPL
20
article would represent a group of analogous terms, and denotes this connection as
redirection. All terms contained in this type of equivalence group are redirected to the
same article with a single title, and the redirection relation between them can be
extracted. Another typical type of connection, which is common to Web based pages,
is hyperlinks, which connect terms in an article's text to the article defining them in
Wikipedia. The common structure employed for all Wikipedia articles contains
several data fields and specific structures.
The Wikipedia-based LE resource, developed by Eyal Shnarch for his MSc thesis,
aims to exploit the knowledge available in Wikipedia as a general online resource, but
without committing the method to Wikipedia specific structure. Therefore, it uses
only the article's text, and the common Web encyclopedia relations, redirection and
hyperlinks. The evaluation of our method in section 4 shows that, as expected,
WordNet and Wikipedia are complementary resources, providing the typical
knowledge which can be found in each of them as a dictionary and an encyclopedia.
We describe the details of this method in section 3.3.2.
21
3. Text Categorization based on category name
3.1. Research goals
As described in section 1, keyword based TC is constructed of two steps (1) setup, in
which a set of characteristic terms for each category is assembled, constituting the
category's feature vector, and (2) classification, in which the term-based feature vector
of the classified document is compared with the feature vectors of all categories.
The framework for keyword based TC which omits the manual annotation in
the setup phase must face the challenge of assembling the characteristic terms for
each category automatically. The most natural seed for the representing terms is the
category name itself, as was suggested by several semi-supervised and unsupervised
methods (McCallum & Nigam, 1999; Ko and Seo, 2004; Gliozzo et al. 2005). The
category name is selected by domain experts to represent the category's topic as
precisely as possible, and therefore is likely to be the most appropriate seed for this
purpose.
Based on some analysis of labeled data, we identify two requirements that
document which belongs to a category should satisfy:
(i) Entailment requirement - the category name should be entailed by the
document text.
(ii) Context requirement – the category's general context should be
matched by the document.
The first requirement is that the category topic should be referred at some
semantic level in the text. This can often be identified by the appearance of the terms
in the document which entail the topic name. We will refer to terms which entail the
category name as entailing terms for that category. This group of terms consists of
terms which entail at least one of the terms denoting the category's name, that is,
terms which appear on the left hand side of lexical entailment rules whose right hand
side is identical to the category name. For example, the category "Autos" is entailed
by the term "Car" according to the rule "car ⇒ auto" and by the term "Ford Escort"
extracted from the rule "Ford Escort ⇒ auto" etc. Therefore, the entailment
implementation for the setup phase will be done by expanding the category seed (its
name) with terms which entaile it.
22
The second requirement is that the overall context of the document should be
typical for the category topic. This is needed to assure that the (i) entailing terms for
that category appear as part of the main topic of the text, and (ii) not in different sense
than the one entailing the category name. This requirement can be captured by a
group of terms which describe typical category contexts, even though they do not
necessarily entail the category. Such terms frequently appear in the category context
and therefore tend to co-occur with the category's entailing terms. Occurrence of such
terms implies that the text might be related to the category. For example, the word
wheel doesn't entail the category "Autos", as it can appear within the context of
several other vehicle categories. However, the presence of a significant amount of
such context words in a document increases the likelihood that this document may be
related to the category's topic. On the other hand, the lack of any context word in a
document decreases the likelihood that this document is relevant to the category's
topic. For that purpose, we will use context models based on co-occurrence data of
terms.
Following this idea, the goal of our approach is to combine the likelihood that
a certain document entails the category name and the likelihood that its context is
relevant to the category. Each measure will be based on the seed category names and
the results will be combined to obtain a unified score. We aim to improve the
precision of the classification by forcing an entailment evidence for each
classification decision
Overall, our method consists of the following steps:
(i) Initialization and scoring
a. Seeds: initiate each category vector by the category seed terms, which
correspond to the category name.
b. Entailment: represent each category by its seed terms along with the
entailing terms for the seeds, which together form the category's
entailment terms feature vector; and obtain an entailment similarity
score between the vectors of each document-category pair.
c. Context: represent each category and document by a co-occurrence-
based vector, and compute a context similarity score for each
document-category pair.
d. Combine the entailment score and context score to a single
categorization score for each document-category pair.
23
(ii) Classification –
a. Initial labeled set creation: use the scores obtained in step 1.d to
classify each document to the category with the best score.
b. Bootstrapping: use the initial labeled set to train a supervised classifier.
3.2. Categorization tasks
The categorization scoring of documents may be applied in two different task settings.
The first is ranking, where a ranked list of documents is created for each category.
The documents are sorted in a descending order according to their categorization
score. The second goal is classification of the documents, where each document is
classified to the best category according to its similarity score.
Ranking the documents aims at achieving better precision at the top of the
sorted list, which means ranking true category documents at the top of the list while
ranking irrelevant documents at the bottom. Ranking allows evaluation of the scoring
method quality per category.
On the other hand, binary document classification provides a complementary
aspect of the scoring method. Binary classification makes a binary classification
decision for each document-category pair, whether the document belongs to this
category according to the similarity score between them. It reflects the method's
ability to differentiate between categories and classify documents to the right
category. Section 4 evaluates and analyzes the methods investigated in this research
from the perspective of each of these two tasks.
3.3. Scoring methods
This section describes the scoring methods utilized as part of our TC method. The
scoring method use a similarity measure to provide a similarity score for each
document-category pair. We first describe the basic similarity method which relies on
the categories' seed terms derived from the category names. We then describe the
entailment similarity method (3.3.2) and context similarity method (3.3.3) used to
calculate the total similarity score provided by our method (3.3.5).
3.3.1. Vector Space Model based on category seed terms
Keyword based TC is often approached by exploiting a Vector Space Model scheme
as in (Liu et al. 2004; Gliozzo et al. 2005). Documents and categories are represented
24
as vectors and their similarity is measured in the vector space. For each category a
ranked list of documents can be created using the similarity score, and the
classification of each document is set to the most suitable category according to the
similarity score.
The most natural representation for the documents' feature vectors is as
vectors in the term space, similar to what suggested in (Salton & McGill, 1983). The
vectors are made out of feature-value pairs where in most cases each word in the
vocabulary corresponds to a single feature and the value corresponds to its document
frequency or standard Term Frequency Inverse Document Frequency (TF-IDF) score.
We expanded the vectors to be unigrams and bigrams of POS-tagged lemma with the
square root value of their frequency in the document as feature value, that is tf . The
reason for using bigrams as well as unigrams is to capture expanding multi-word
phrases as well as expanding words in the lexical expansion step of the method (for a
more detailed explanation of this, see 3.3.2). The vectors are filtered using a common
feature selection which removes the most common and the least common features in
the corpus.
Vector space models use a vector representation, similar to the document
feature-value representation, for the categories as well. The categories are represented
as feature vectors in the word space, containing characteristic words most suitable to
describe the category and a weight to signify their importance to the category. Since
the creation of a taxonomy of topics for TC task requires manual specification done
by domain experts, it can be assumed that the terms chosen to represent each topic as
the topic title in this process are selected carefully to encapsulate the common topic
the documents share. Therefore, those manually selected terms should be the best seed
terms to represent each category in its feature vector. Thus, as a baseline, we use a
category vector that includes only the seed category name or immediate variations of
it (without expansions). For example, the seed for the category "comp.graphics" was
simply taken to be the term "Graphics" as a noun term, for the "talk.politics.mideast"
on the other hand the term "Middle East" was chosen as the most suitable seed term
since the use of "mideast" is not common.
Then, we use a standard cosine similarity measure to measure the similarity
between document vectors and category vectors. Using cosine similarity as the
similarity measure, and (the square root of) word frequency as the value of the
25
features, the similarity between the vectors described above is analogous to measuring
the frequency of the category name in each document. The highest scores will be
given to documents with the highest frequency of the seeds. Since we used the square
root value of the term frequency of each term w, denoted ( )tf w , instead of simple
term frequency (tf), the impact of high term frequency of a single term on the
classification decision regarding a certain document is decreased, giving a higher
weight to documents that contain several of the seed terms.
More formally, let t T∈ be a document described by a vector V
t R∈r
in the
text collection T, and let ( )seed c V⊂ be the seed of category c C∈ , where V is the
lexicon and C is the set of categories in the collection taxonomy.; let V
c R∈r
be the
vector representing the category c C∈ which at this stage contain the terms in
( )seed c . The similarity function, Simseed, between the document vector and the
category representing vector is defined to be
( , ) cos( , )c t
sim c t c tc t
⋅= =
⋅
rrr rr r
rr
where for each term i
t t∈r
its value is set to be the square root value of its term
frequency, ( )i i
t tf t= , and for each category representing term ic c∈
r, its value is set
to be 1i
c = , meaning the weight of the seed terms in the category's vector are all
equally weighted and equal 1.
3.3.2. Entailment expansion methods
Entailment expansions for the category vectors are done by expanding the seeds of
each category, ( )seed c , using entailment rules. The seeds expansion is similar to the
notion of query expansion in Information Retrieval (IR), where the seed is analogous
to the query expanded. Each category is expanded by the left hand side (LHS) of all
lexical entailment rules (defined in 2.4) whose right hand side (RHS) is one of the
category seeds. We will refer to the set of terms which appear on the LHS of the
entailment rules extracted for a category as the entailing terms for this category. The
entailing terms are added to the feature vector representing the category, which was
described in the previous section. For example, the vector representing the category
"Autos" will be expanded by the rule "automobile ⇒ auto".
26
In the lack of a generic lexical entailment rules knowledge base, we used two
preliminary lexical entailment methods to obtain rules. The first method obtains
entailment rules from WordNet, exploiting the lexical semantic relations it contains.
The second method extracts entailment rules from Wikipedia, exploiting the vast
amount of general definitions and information it holds. Both methods are applied for
each of the categories, and all the LHS terms in the corresponding rules are then
merged into the category feature vector.
WordNet The appearance of the category name in a document, although very
precise, overlooks occurrences of entailing terms such as synonyms and derivations.
The need for this type of relation motivates us to use WordNet as a large lexical
resource which includes information of various lexical relations. For example, terms
such as 'car' and 'automobile' can be extracted using the synonym lexical relation
which associates them with the category name "Autos". Similarly, terms such as
'medical' and 'medication' which are morphological derivations of the category name
"Medicine", and may be extracted as well.
Ambiguity is a common problem when using WordNet lexical relations for
expansions. Ambiguity of seed terms constructed from category names is rare in the
TC framework, since the required sense is mostly the dominant sense in the TC
corpus corresponding to the given taxonomy. However, when an irrelevant sense is
expanded via an external resource, the expansions of the irrelevant senses may be
frequent in the corpus. For example, for the category "Space" which is part of the
science hierarchy the "any location outside the Earth's atmosphere" sense can be
considered as relevant while the "an empty area" sense is clearly irrelevant as a source
of expansions for this category. Using entailment rules expanded from the irrelevant
sense might add significant number of frequent words irrelevant to the category
expanded. Since WordNet knowledge is organized by synsets, synonym sets which
represent one underlying sense, we can utilize this mechanism to guaranty that only
the required senses will be used for the expansions, so that irrelevant expansions
would not be used.
We base our utilization of WordNet on the assumption that during the manual
specification of the corpus taxonomy it is possible to provide the information of the
relevant WordNet synset(s) for each category, since the taxonomy creator addresses
difficulties such as disambiguation in the taxonomy creation process. Moreover, the
27
relatively small number of categories in an average corpus makes it worthwhile to add
this information in order to overcome ambiguity difficulties. Thus, in our
experiments, we have manually indicated the appropriate WordNet sense(s) for each
category seed word (see Table 2 and Table 3).
Our approach differentiates between several possible types of topics, which
were identified by manual analysis of sample taxonomies for TC tasks. The manual
analysis raised the different needs of expansion for different types of topics such as
topics which describe a general subject and topics which refer to a collection of
component parts. For example, the topic "Middle East" relates to a geographic region
and therefore requires expansion to its geographic parts. On the other hand a topic
such as "Medicine", which relates to a general scientific subject, requires expansions
such as branches of medicine. The differentiation between the topic types revealed
that some types of categories can be described as class topics and therefore require
expansion based on the hyponymy relation (is-a), while other types can be described
as ensembles of components and therefore require expansions based on the meronymy
relation (is-a-part-of). The hyponymy relation enables us to expand seeds to the
members of their class type, for example "Autos" can be expanded with "ambulance"
and "taxi" which are types of cars. The meronymy relation expands the topic to
members of its group, such as "Iran" and "Israel" for the "Middle East" topic.
Accordingly, meronyms of the group as "Class type" were found overall less
useful for entailment rule extraction in many categories, since they mostly describe
common parts of the class and therefore relate as meronyms to most of the co-
hyponyms of transportation means in general. For example, the meronyms "wheel",
"door" and "window" of the topic "Autos" describe technical components of that
concept, while the meronyms described above for the "Middle east" category are part
of that specific entity. For that reason, we've automated the extraction of rules for
those two groups of categories by extracting rules based on the meronymy relation
only when the topic seed has no hyponyms in WordNet. The WordNet expansion
method starts its expansion process from the seeds based on the categories' name,
which are then expanded in an iterative manner. For each term expanded in one of the
expansions' steps the method checks weather it has hyponyms to be expanded to or
should be expanded to its meronyms.
The core of the WordNet extraction method is expanding terms to their
hyponyms, or to their meronyms if no hyponym exists for this term. This core step of
28
the method is augmented by basic steps of derivations and synonym expansion. The 4
steps of our WordNet expansion procedure are as follows:
(i) Expand each ( )seed c representing the category c C∈ to its
derivations, ( ( ))der seed c , and synonyms, ( ( ))syn seed c .
(ii) For each term ( ( )) ( ( ))w der seed c syn seed c∈ ∪ , if w has any
hyponyms,
a. Expand w to its hyponyms and add them to form the group of terms
( ) ( ( )) ( ( )) ( )core c der seed c syn seed c hyponym w= ∪ ∪
b. Else, expand w to its mernonyms and add them to form the group of
terms ( ) ( ( )) ( ( )) ( )core c der seed c syn seed c meronym w= ∪ ∪
(iii) Expand each term ( )w core c∈ to its derivations and synonyms to form
the group ( )wn c .
(iv) For each ( )w wn c∈ , add w to the expanded category vector cr
, denoted
( )wn cr
Accordingly, the likelihood of category c, represented by the expanded vector
( )wn cr
, to be entailed by each document represented by the term vector tr
is measured
by the same similarity function used to measure the seeds and documents similarity.
The WordNet similarity function, Simwn is defined to be
( ( ), ) cos( ( ), )wn
Sim sim wn c t wn c t= =r rr r
The weights of the expanded terms in the new category vector remain equal and set to
one as before, that is for each ( ) ( )iwn c wn c∈r
, ( ( ) ) 1iweight wn c = .
As described above, our utilization of WordNet addresses the ambiguity of the
seed terms by specifying the required senses for them. In addition, our method partly
addresses the potential ambiguity of the expanding terms extracted. The expanding
terms are extracted from synsets of different sense frequencies, as they are extracted
from the sense which entails the seed terms sense. Infrequent senses of the expending
term might cause misclassification due to their ambiguity, while they rarely increase
the recall since they seldom appear in the required sense. Our method uses the
WordNet synset order to filter infrequent synsets. WordNet synsets are ordered by
their frequency in the SemCor corpus (Miller et al., 1993), where the most frequent
sense is listed first. For example, the verb "steal" entails the seed term "Baseball" in
its fifth sense, while its frequent sense in the collection is its first sense "take without
29
the owner's consent", which is equally likely to appear in the context of any of the
other categories. Our method filters infrequent synsets using a configuration
parameter maxSense, which specifies the maximum ordinal number of a synset to be
used as a source for entailment rule extraction.
Another parameter which influences the accuracy of the expanding terms
extracted from WordNet is the depth within the hierarchy used to expand terms. The
WordNet structure enables recursive expansion, for example by expanding a term w
to all the hyponyms of all its descendents in the hyponyms hierarchy. We will refer to
this parameter, denoted by maxDepth, within the parameter setting discussed in 4.1.1.
Wikipedia Wikipedia is a collaborative online encyclopedia which covers a wide
variety of domains. Extraction of entailment rules from an online encyclopedia can
not only extract entailment rules from the definitions themselves but also extract rules
based on HTML links and references. The Wikipedia resource, being an encyclopedic
resource containing cultural and day-to-day terms, is complementary by nature to the
type of rules extracted from the WordNet resource, which provides language oriented
terms similar to terms which can be found in a dictionary. We used a lexical
entailment resource extracted from Wikipedia which has been developed and utilized
by Eyal Shnarch for his MSc thesis, and integrated it into the general scheme of our
TC method to expand the seed terms for each category.
Each wikipedia article describes a specific subject which is denoted by the title
of this wikipedia entry. Following the notion suggested in (Kazama and Torisawa,
2007), the extraction method assumes that the best source for that definition lies in the
article's opening sentence. The motivation for this assumption is that the definition of
the topic mostly appears at the beginning of an encyclopedia article. It is noted that
extending the source of the definition to be the first paragraph as a whole was found
to be less accurate, and not beneficial in terms of lexical entailment rules recall.
The subject of an encyclopedia entry is mostly generalized by its definition.
Therefore, the rules extracted from each entry are entailment rules in which the title of
the article appears on the left hand side of the rule and the terms extracted from the
definition sentence appear on the right hand side, that is, the terms from the definition
are assumed to entail the article's title. The terms extracted are nouns and noun
phrases from the title and the definition, since the majority of Wikipedia titles are
noun phrases. For example, the rule "Yamaha SR500 ⇒ motorcycle" can be extracted
30
from the article defining "Yamaha SR500", to expand the seed name for the category
"Motorcycles".
Several extraction methods from the definition have been explored. We chose
to use only the extraction types that were found to be the most precise. Prior to all
rule extraction the Wikipedia-based method parses the definition sentence to facilitate
extraction of the preferred noun phrases. This method extracts the noun phrases which
appear as a nominal complement of the 'be' verb in the definition.
In addition to extraction from the definition, the encyclopedia structure was
also utilized to extract rules. Similar to traditional encyclopedias and dictionaries,
Wikipedia authors also provide manual canonization of the terms defined in it. All
terms contained in the same canonized group are redirected to the same Wikipedia
article. Since all terms in such a group are semantically equivalent, the rules extracted
by this type are considered bi-directional. For example, the search for the term "mac",
redirects the search to the article titled "Macintosh" as the category name. Based on
this knowledge the Wikipedia-based method can extract the bi-directional rule "mac
⇔ Macintosh", which results in the expansion of the seed term "Macintosh" to the
term "mac".
Unsurprisingly, significant part of the terms extracted from Wikipedia based
rules are noun phrases (NP) longer than one word. For example, the expanding terms
extracted from Wikipedia include names of car types for the category "Autos" and
names of baseball players for the category "Baseball". The complex NPs extracted
from this resource motivated us to incorporate word bigrams as part of the features
extracted from the corpus of documents to be classified. The addition of bigrams to
the set of features included in the corpus vocabulary does not overload the
classification in terms of noise or overhead in calculations, since the cosine similarity
measure only considers terms which appear in the category's representing vector.
Thus, the vast majority of bigrams extracted from the corpus will be ignored.
Moreover, only NP bigrams are included by our method in the corpus vocabulary.
Consequently, the bigrams influence the classification measure only if they are
extracted by the entailment expansion method and also appear in the corpus, and
therefore are assumed to be of high importance.
Due to the detailed nature of an encyclopedia definition, we employed feature
selection based on frequency statistics to filter common English words extracted from
31
Wikipedia, in addition to the feature selection done on the corpus itself. The feature
selection performed on the corpus data was based on the corpus statistics data; on the
other hand the feature selection performed on the Wikipedia output rules was based
on general English statistics data. The first efreqFilter most frequent words in the
English language according to the Brown corpus6 were filtered. On the contrary,
terms extracted from WordNet did not require the same type of filtering since they
were more precise than the terms extracted from Wikipedia.
We tried to enhance the terms extracted by the Wikipedia resource using
several methods. First, we used the Wikipedia method to expand the characteristic
terms of each category as they were extracted from WordNet. That is, expanding the
seeds with their WordNet expansions, and then expanding those terms by Wikipedia
expansion. For example, instead of expanding the category seed "Auto" first by
WordNet to acquire terms such as the seed's synonyms ("Car", "Automobile"), and
hyponyms ("Convertible", "Minivan"), those entailing terms extracted from WordNet
can be used as input for the Wikipedia method. Although this addition increased the
recall, it decreased the overall precision of the classification more substantially.
Accordingly, the seed terms based on the category names are expanded
independently from each resource and then the extracted terms are merged to generate
a single vector, ( )entail cr
for each category c. The entailment similarity score, Simentail
is obtained by
( ( ), ) cos( ( ), )entail
Sim sim entail c t entail c t= =r rr r
Our system allows applying each of the expansion methods for each of the two
resources separately, to evaluate partial configurations. We denote as Simwn, the
similarity score between the vector ( )wn cr
which contains expansions from WordNet
and each document vector tr
, that is Simwn. Similarly, the similarity score between the
vector ( )wiki cr
containing the expansions from Wikipedia and each document vector
tr
is referred to as Simwiki, where ( ( ), ) cos( ( ), )wikiSim sim wiki c t wiki c t= =r rr r
.
3.3.3. Context Similarity methods
The occurrence of entailing words in a document suggests that the topic they entail
was mentioned in the document. However it does not guarantee that this is one of the
main topics discussed. Such occurrence of the category's entailing terms may appear
6 Available at http://www.edict.com.hk/lexiconindex/frequencylists/words2000.htm
32
sometimes as a passing reference that is when the term is mentioned in the correct
sense as part of the context of another topic, or due to ambiguity of the entailing term.
For example, appearance of the term "car" which entails the category "Autos" may
also appear in "politics" context as in a document which include a political discussion
over environmental pollution: "...equal to the combined formic acid contributions of
automobiles...". A different example, for ambiguity of an entailing term, is the word
"race", which entails the category "Autos", in the sense of "a contest of speed". This
word may also appear in its other sense, "people who are believed to belong to the
same genetic stock" in a politics discussion regarding racism as part of the "Politics"
category, which may result in scoring errors.
Hence, the general context of the category topic should be prominent in the
document. We define context words as words which are likely to appear in the context
of a certain topic, although they do not entail the topic directly. For instance, words
such as driver and wheel do not entail the topic "Autos", although they tend to co-
occur with this topic. Based on that assumption, we aim to measure the likelihood of a
document to belong to a certain context to complement the entailment measure
described above.
For this purpose our method the Latent Semantic Analysis (LSA) method
which is described below.
Latent Semantic Analysis To measure context similarity we employed a Latent
Semantic Analysis (LSA) method (Deerwester et al., 1990), on the documents'
vectors. LSA is a dimensionality reduction method for co-occurrence data. The main
idea of LSA is to map each document vector into a lower dimensional space in which
the vector will be represented by "concepts" instead of terms. The dimension
reduction is performed using Singular Value Decomposition (SVD) to map the
document by term matrix into lower dimensional latent space.
The latent semantic vectors for terms and vectors were calculated by a
variation of the fold-in documents methodology suggested by (Berry, 1992; Gliozzo
and Strapparava, 2005). As explained in Appendix A, LSA reduces the dimensions of
the term-by-document matrix to obtain a new representation of each term in a lower
dimensional space. The matrix contains the co-occurrence data of the terms in the
document level. That is, given the matrix Mr
, a term-by-document matrix which
33
represents the terms in the original space, each ,i jm M∈r
, , ( , )i j j im tf w d= , where tf is
the term frequency of wj, which is the jth term in the lexicon, in di, which is the ith
document. The Mr
matrix is of size V N× , where V is the number of distinct terms
in the corpus and N is the number of documents in the corpus. Similarity between
terms is measured as the similarity between their representing vectors. It therefore
considers both first order similarity in which terms tend to appear together in the same
document, and second order similarity in which terms tend to appear in similar
contexts of other words. Since terms which tend to appear together are likely to be
mapped to the same "concept" in their representation in the latent space, similarity
between vectors will measure closeness of those concepts as expected by second order
similarity.
The terms in the reduced space are represented as the vectors in the latent
space, denoted ( )iLSA wr
. We follow the scheme used in (Gliozzo and Strapparava,
2005) in which an Inverse Document Frequency (idf) scheme is used on top of the
LSA representation for each term. The scheme multiplies each tem by its corpus idf
value and normalizes the vector:
( ) ( )( )
( ), ( )
i inorm i
i i
idf w LSA wLSA w
LSA w LSA w
⋅=
rr
r r
The representing vectors of the documents in the latent space, referred by us as
( )LSA tr
, are obtained by averaging the latent space vectors ( )norm iLSA wr
for each iwr
which appears in this document:
( ) ( , ) ( )i
i norm i
w t
LSA t tf w t LSA w∀ ∈
= ⋅∑r
r r r
where ( , )i
tf w tr
is the term frequency of the term wi. Similarly, the representing
vectors for the categories, ( )LSA cr
, are obtained by averaging the LSA vectors of the
seed terms of each category constructed from the category name. The ( )i
LSA wr
for
each unigram term was supplied to us by A. Gliozzo and C. Strapparava. Hence, the
documents were represented by unigram features in the LSA implementation. Bigram
terms, which represent a category, were taken as unigram terms and treated as
separate words for the LSA implementation.
The LSA similarity score between documents and categories, SimLSA, is
obtained by calculating the cosine similarity between the representing LSA vectors
34
( ( ), ( )) cos( ( ), ( ))LSASim sim LSA c LSA t LSA c LSA t= =r rr r
3.3.4. Gaussian Mixtures model
The final similarity score is obtained by employing a Gaussian Mixture (GM) model,
which rescales the scores obtained for each document-category pair, to obtain scores
on a common scale for all categories. The importance of the GM utilization stems
from the need to classify each document to the best scoring category to acquire a final
classification. Comparison between the scores of a document for each of the
categories is more reliable when the scores are normalized through the GM
estimation. The GM algorithm is described in detail in Appendix B.
In essence, the GM algorithm aims to estimate the probability of classification
of a document to a category given the similarity score of the document and the
category, i.e. ( ( , ))LSAC Sim c tΡrr
(referred to as ( , )c iSim id d in Appendix B). The
estimation of this probability is based on the assumption that the similarity scores are
derived from two Gaussian Probability Density Functions (PDF) for category and
non-category documents. The GM use the similarity scores achieved by the methods
described above to obtain the parameters of the two Gaussians using an EM
algorithm.
In our implementation the similarity scores are taken from the cosine
similarity results between the LSA vectors of documents and category seeds. For each
SimLSA score, measured by the cosine similarity function, the algorithm performs the
following steps:
(i) It assumes that the similarity scores of the positive and negative
documents in the (unlabeled) training set can be described by two
"hypothetic" Gaussian distributions for C and C , which compose the
empirical distribution by a Gaussian mixture.
(ii) The conditional probability ( ( , ))LSAC Sim c tΡrr
is estimated by applying
the Bayes theorem on the distribution C and C .
The GM probabilities are then used as the scoring method for the context step
in our method, denoted as Simcontext. Each document is classified to the category with
the maximum value of ( ( , ))LSAC Sim c tΡrr
:
35
( ( , ))context LSASim C Sim c t= Ρrr
Unfortunately, it was impossible to use the GM model to rescale the
entailment similarity score as well, since an entailment similarity score is obtained
only for documents which include at least one of the category entailing terms. Due to
the lack of sufficiently many entailment rules at this point of our research, most of the
documents obtain a zero score and therefore do not provide input data needed for the
GM algorithm. The zero scores could not be used as input for the GM estimation,
since they cause the GM model parameters (mean and variance) to diverge to zero and
the algorithm reaches computational underflow. We also tried to disregard the zero
scores, under the assumption that they were not actually classified as negative, but
were unclassified due to data sparseness. Disregarding the zero scores given by
document which were not classified to any category is still insufficient due to the
sparseness of the data. Most of the document is classified to 1-2 categories. For that
reason, most of the documents are given some positive score for one or two
categories, while they obtain a zero score or all other categories. Given that, omitting
the documents which obtained zero score for all categories still outputs a set of mostly
zero scores for each category and the original problem holds for this scored set as
well.
GM models could be employed on the LSA similarity results since for LSA
vectors a positive score is obtained for each category and document pair. The set
obtained from the similarity score for each document is sufficient as input for the GM
algorithm to obtain the two groups of negative and positive classifications for each of
the categories.
While in our system the scoring method which uses the GM probabilities as
the similarity scores of documents and categories is one of the steps with in the
complete method, it was used as the sole scoring method in (Gliozzo et al. 2005). We
denote the scoring method based on the GM probabilities as Simcontext, as defined
above, to reflect the probability that a document would be classified to the category.
We refer to Simcontext as a baseline for evaluation purposes in section 4.
3.3.5. Combination of knowledge and context
As mentioned in section 2, basing TC methods on context similarity alone overlooks
the importance of the actual appearance of the category topic within a document and
36
evaluates only the general context for the category. We aim at a more accurate
measure of relevance for a category by basing our method on entailment of the
category name instead of mere evidence of contextually related text, and combining it
with a context model.
To combine the scores obtained by these two components of our scoring
method, we examined integration of the two scores both by addition and by
multiplication. The latter was found as a more suitable method which produced better
accuracy. Moreover, using a multiplication integration scheme obtains a combination
of the scores without being obligated to address the different scaling of the two
methods. Therefore, the combined similarity score, denoted as Simconbined, is obtained
as follows
combined entail contextSim Sim Sim= ⋅
Using multiplication as the integration method of the scoring methods reduces
the score of documents which contain entailing terms but relate to irrelevant context.
Moreover, when the score obtained by the entailment scoring method is equal to zero
the integrated score would also be zero. Ideally, given perfect entailment knowledge it
means that when the text does not entail the category topic, it would not be classified
to it even if it involves related context.
For that reason, we find the combination of each entailment scoring method
with the context scoring method by multiplication as useful to increase precision and
gain a more accurate measure of classification. The results obtained by using the
Simcombined similarity measure and all other possible combination of entailment scores
available in our system with the context score will be evaluated and compared in
section 4.
3.4. Binary classification methods
One of the goals of TC is to obtain a binary classification of each document to decide
whether it belongs to a given category or not. Categorization of documents can be
done in two different approaches, classifying each document to a single category,
referred here as single-class, or classifying each document to several categories,
referred here as multi-class. Using the similarity score obtained by one of the scoring
models described in sections 3.2 and 3.3, our method can initialize a preliminary set
of labeled documents by classifying each document to the category for which it
37
obtained the best score. This set of labeled documents can then be used to train a
supervised classifier and obtain classification results, allowing also multi-class
classification.
Classification bootstrapping The similarity scores obtained by the Simcombined
measure were used to produce an initial labeled set of documents to train a supervised
classifier. We used the initial labeled set, in which each document is considered as
classified only to the best category, to train a SVM classifier for each category. For
this purpose, we used SVMlight
(Joachims, 1999), a state-of-the-art SVM classifier,
with the documents feature-value vectors as input. The features used were the same
terms used for the scoring methods described above, meaning the unigrams and noun
phrase bigrams, with the square root value of the term frequency as value. The vectors
were fed to the SVM learning module whose goal is to find an optimal separating
hyperplane between the positive and negative classification of the documents. We
used the default parameters settings for most of the parameters, excluding parameters
j and c which were manually tuned to obtain optimal classification. The SVM scores
for each document-category pair could be used for each of the classification
approaches depending on user choice.
Appendix C includes more technical detail of the training procedure and the
SVMlight
parameters. More details about the parameter settings can be found in
Section 4.1.1 below.
38
4. Evaluation
The evaluation of the categorization method was preformed over a datasets dedicated
for topical text categorization. Those datasets supplies a pre-defined set of categories
consisting of training and test sets with gold standard annotation. The evaluation
analyzes the performance of our method compared with a baseline of the work of
(Gliozzo et al. 2005). For this purpose we replicated their work and compared our
scoring method which uses the Simcombined similarity score with their scoring method
based on the SimContext similarity score. We also applied the bootstrapping phase on
both scoring methods for comparison. We used the components of our scoring
method, Simseed, Simwn, Simwiki and Simentail as additional baselines, to evaluate the
contribution of each. We first describe the data sets and settings of our evaluation
(4.1), followed by an evaluation of our method in a ranking task (4.2), an evaluation
of our method as a classification task (4.3) and then evaluation on the Reuters-10
collection (4.4)
4.1. Data set and Pre processing
In this section we describe the datasets used for the evaluation of our method and the
data pre-processing steps.
20 Newsgroups The 20 Newsgroups corpus is a collection of newsgroup
documents gathered from 20 different categories from the Usenet Newsgroups
hierarchy, which are detailed in Table 2. We used the "bydate" version7 of the corpus
which is the recommended version for TC tasks. This version contains approximately
20,000 documents partitioned (nearly) evenly across the 20 categories and divided in
advance to train (60%) and test (40%) parts.
The categories are taken from the Usenet hierarchy, originally consisting of
eight major topics at the top level of the hierarchy. This hierarchy also contains non
topical category which is more difficult to classify since the documents that belong to
it do not "discuss" this topic in the common manner. For example, the "Forsale"
category, which is a non-topical category, belongs to the Miscellaneous branch,
containing topics irrelevant to the other seven major topics. Moreover, some of the
7 The collection is available at www.ai.mit.edu/people/jrennie/20Newsgroups
39
categories are topically closer than others, such as the three religion categories:
"Atheism", "Religion" and "Christianity". Since those categories originate from
different major topics at the top level of the hierarchy they are not organized as sub-
categories or any other formal representation of each other. Therefore, the semantic
relations between those categories do not reflect in the taxonomy hierarchy and
creates difficulties in the categorization as described in section 4.2.3.
Category Seed
alt.atheism atheism#n#1_2
comp.graphics graphic#n#1
comp.os.ms-windows.misc microsoft windows#n#-1
comp.sys.ibm.pc.hardware ibm#n#-1;pc#n#1
comp.sys.mac.hardware mac#n#-1;macintosh#n#-1
comp.windows.x x11#n#-1;x-windows#n#-1
misc.forsale sale#n#1
rec.autos car#n#1
rec.motorcycles motorcycle#n#1
rec.sport.baseball baseball#n#1
rec.sport.hockey hockey#n#2
sci.crypt cryptography#n#1_2
sci.electronics electronics#n#1
sci.med medicine#n#1_3
sci.space outer space#n#1
soc.religion.christian christian#n#1;christian#a#2
talk.politics.guns gun#n#1_2
talk.politics.mideast mideast#n#1;middle east#n#1
talk.politics.misc politics#n#3_4_5
talk.religion.misc religion#n#1
Table 2- Initial seeds for the 20 Newsgroups, each seed is descibed in the structure
lemma#pos#wordnet-sense
Overall, although the categories are represented in a flat hierarchy in the 20
Newsgroups collection, they can be divided into six main topically related subjects:
scientific, computers, religion, politics, recreation and miscellaneous, which contains
only the topic "Forsale". Another example for the complexity of this collection is that
it also contains miscellaneous topics within the six subjects described above. For
example, the politics branch contains categories such as "Guns" and "Middle East", in
addition to a miscellanies "Politics" category. Finally, it should be noted that manual
examination of documents raised the hypothesis that some of the documents belong to
40
more than the single category they belong to according to the gold standard. Given
that the collection is based on online collaborative posting of opinions and discussions
it is highly possible that some of the posting should have been classified to several
categories.
Despite its limitations, the 20 Newsgroups is a widely used corpus for TC
tasks, which enables comparison and further development of methods. In addition,
the collection does not include cross-posts (duplicates) and non-textual headers,
which often help classification (Such as Xref, Newsgroups, Follow-up-To, Date).
The seeds used for category expansion and the corresponding WordNet senses
chosen for them are listed in Table 2. The seeds were derived from the topic name
itself and they are given in a term#pos#WordNet_sense structure.
Reuters-10 The Reuter-10 is a sub-corpus of the Reuters-215788 collection,
constructed from the 10 most frequent categories in the Reuters taxonomy, which are
detailed in Table 3. We used the Apté split of the Reuters-21578 collection, which is
often used in TC tasks. The complete collection contains 12,902 documents for 90
categories, where the 10 top categories include 9296 documents. The documents are
divided unevenly over the 10 categories, where most of the documents belong to the
"Acquisition" and "Earn" categories. The documents are divided in advance to train
(70%) and test (30%) parts. The collection's gold standard is multi-class classified,
hence each document is classified to one or more categories.
The Reuters categories are domain specific, and are all relevant to economical
topics. The categories are organized in a flat taxonomy structure, which means there
is no defined hierarchy for the categories in the corpus. Moreover, some of the
categories are non-topical, such as the "Money-fx" category which contains
documents discussing foreign-exchange transactions, which are mostly in a specific
table structure, with no significant textual data contained in them.
For these reasons we chose to focus on the 20 Newsgroups collection for our
evaluation. We found it more appropriate to the type of topical categorization
addressed by our research. Furthermore, our research was conducted as part of a
media research project, and it aims, to improve TC accuracy for topical taxonomies.
The 20 Newsgroups collection contains various topical categories, common interest
8 Available at http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html).
41
categories and general subjects which are more suited to this type of research. Given
that, our analyses are based on the 20 Newsgroups collection, while the results for the
Reuters-10 dataset are shortly described in section 4.4.
Category Seed
acquisition acquisition#n#1_2
corn corn#n#1
crude crude#n#1
earn earn#v#1;earnings#n#1
grain grain#n#2
interest interest#n#4
money-fx money#n#3;foreign exchange#n#1
ship ship#n#1
trade trade#n#1
wheat wheat#n#1
Table 3 - Initial seeds for the Reuters-10 collection, each seed is descibed in the structure
lemma#pos#wordnet-sense
Pre-processing The textual documents were split to sentences and tagged for
Part-Of-Speech (POS) tags using the Opennlp toolkit9 and then the tagged terms were
lemmatize using the Gate toolkit10. The features considered for TC are nouns, verbs,
adjectives and adverb unigrams, and bigram nouns. We used standard feature
selection to remove highly frequent features which appeared in more than 20% of the
documents and infrequent features which occurred less than three times in the corpus.
All features were taken from the text part of the documents, omitting titles and
headers.
4.1.1. Experimental settings
The settings for the evaluation of our method require the tuning of several
parameters for the different methods. For that purpose, the training documents of the
20 Newsgroups collection were split into two groups, training (60%) and test(40%),
keeping the original proportion of the collection.
For the entailment method, described in section 3.3.2, three parameters were
tuned:
(i) maxSense – the number of senses considered to expand a term, which
controls the level of infrequent sysnsets taken, was set to 4 (inclusive), for
which the method obtained the most accurate results.
9 Opennlp tools available at http://opennlp.sourceforge.net/ version 1.3.0 10 The Gate toolkit available at http://gate.ac.uk/ version 3.1
42
(ii) maxDepth – the depth of the hyponymy/meronymy hierarchy considered
for the semantic relation expansions from WordNet was set to 1.
Preliminary experiments showed that when the entire hierarchy was
considered, many irrelevant terms were added, creating noisy classification
results.
(iii) efreqFilter- the number of most frequent English words, which were
filtered from the final expanding list (mostly from the Wikipedia
expansions since the intersection of WordNet expansions with this list was
empty). This parameter was set to the 500 most frequent words according
to the Brown corpus11
, for which the method obtained the most accurate
results.
The context model used required parameter settings for the dimNum parameter
which controls the number of dimensions in the latent semantic space. We adopted the
setting suggested by (Gliozzo et al. 2005) which set the number of dimensions to be
400.
The final step of our method is applying a bootstrapping method on the initial
labeled set of documents. As described in section 3.4.2 we used SVMlight for this
purpose with its default parameters, excluding the parameters j and c which were
manually tuned. The parameter j was set to correspond to the number of categories,
which hypothetically corresponds to the proportion between positive and negative
training instances in each category, as suggested by (Morik et al., 1999). Therefore,
the j parameter was set to 20 for the 20 Newsgroups collection, corresponding to the
20 categories, and to 10 for the Reuters-10 collection, corresponding to its 10
categories. The second parameter c was manually tuned and was set to 0.01, which
gave the best classification results on the development set as described earlier.
4.2. Ranking
4.2.1. Ranking measure
As described in section 3, the first categorization goal which we examine is ranking,
which ranks the documents for each category according to their score. A ranking
evaluation measure quantifies the system's ability to rank documents for a given
category, preferring a ranking which ranks documents that truly belong to the
11 The Brown corpus available at http://www.edict.com.hk/lexiconindex/frequencylists/words2000.htm
43
category before the others. Average precision is a common evaluation measure for
system rankings, and is computed as the average of the system's precision values at all
points in the ranked list where recall increases (Voorhees and Harman 1999). In our
case, a point where recall increases corresponds to ranking of category documents, as
annotated in the gold standard. More formally, it can be written as follows:
1
1 ( ) ( )( )
n
i
E i CorrectUpToRank iAP c
R i=
⋅ = ⋅
∑
where n is the number of documents classified to a specific category c in the test set,
R is the total number of correct classifications in the test set for this category, ( )E i is
1 if the ith document is classified to this category according to the gold standard and 0
otherwise, and i ranges over the documents, ordered by their ranking.
The score calculated by the average precision measure range between 0 – 1, where 1
stands for perfect ranking which places all the category documents before the non-
category ones. This value corresponds to the area under the non-interpolated recall-
precision curve for the target word. Mean Average Precision (MAP) is defined as the
mean of the average precision values for all the categories.
4.2.2. Ranking results
We evaluated the ranking quality of the scoring method Simcombined described in
section 3.3, using the MAP measure explained above. The scoring method ranks
documents which contain at least a single occurrence of an entailing term, as
explained in section 3. Given that, documents which do not contain any entailing term
are not ranked using our method. We therefore considered the parameter R in the
Average Precision measure to be the number of documents which contain at least one
entailing term, which corresponds to the maximal coverage that can be obtained by
the knowledge available to our system.
We compared our scoring method which uses the Simcombined similarity score to
three baselines. The first is ranking the documents by the Simseed similarity score as
explained in section 3.3.1. This illustrates a naive method which uses only the most
basic information provided by category names. The second baseline is using the
Simentail similarity score which uses only entailment expansions to set the score.
Finally, we took the Simcontext similarity score which was used in (Gliozzo et al. 2005),
using an LSA similarity score rescaled by the GM algorithm to rank each document
44
for all categories12. The methods were compared within the range of knowledge
acquired by our method, meaning the number of correct rankings considered for the
MAP calculation of all method was limited to the first R rankings which our method
also ranked, which is an average of 37% of the documents. This comparison
corresponds to comparing the area under the non-interpolated recall-precision curve
of each of the methods up to the recall level achieved by using the Simcombined
similarity score.
Error! Reference source not found. presents the recall-precision the curves
obtained by using the following similarity scores: Simcombined, Simseed, Simentail and
Simcontext. It shows that using Simcombined outperforms other methods (almost)
constantly by several percents. Although the score obtained by Simseed is extremely
precise, it is limited in recall since it relies on limited knowledge. Consequently, the
curve denoting this scoring method decreases rapidly since most of the categories'
recall does not reach 100% and the average precision is decreased. Comparison
between the two curves denoting the Simcombine scoring and the Simentail scoring shows
that integrating the entailment knowledge with the LSA-based context model achieves
higher precision, showing average improvement of 3 points. Moreover, the Simcombined
12 The context method originally suggested by (Gliozzo et al. 2005) was replicated by us and the results
reported here are based on that replication
Figure 1 - Recall-Precision curve for the sub-set of documents which match entailment
knowledge. The results are showen for the Simcombined, Simcontext, Simentail and Simseed similarity
methods within this range of documents.
45
score shows a steady advantage of an average of 5.5 points in the average precision
over the Simcontext scoring method suggested in (Gliozzo et al. 2005). It implies that
using the entailment hypothesis results in a more precise measure than relying on
context modeling alone, as will be discussed in detail in the analysis section below.
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.5 1
Recall
Pre
cis
ion
Simcombined
Simwiki * Simcontext
Simwn * Simcontext
In order to evaluate the contribution of each of the entailment expansion
methods used we also compared the Average Precision gained by each of those
scoring methods, Simwn ⋅⋅⋅⋅ Simcontext, Simwiki ⋅⋅⋅⋅ Simcontext and Simentail, as can be seen in
Figure 2. Since the scores obtained by the Simentail score are based on the knowledge
gained by the combination of the WordNet and Wikipedia resources used for the
Simwn and Simwiki scores respectively, the two scoring methods have lower or equal
number of correct classification than by using Simentail. Consequently, the curve of
those methods ends at a lower recall point. It can be seen that the combination of the
two methods not only improves the recall but also improves precision by relying on
more knowledge which results in more accurate estimation of the entailment evidence
in the documents. It is also clear that Simwiki score is highly precise although limited in
recall.
Figure 2 - Recall-Precision curve for the sub-set of documents which match entailment knowledge.
The results are showen for the entailment methods, Simentail, Simwn and Simwiki, integrated with the
context method, Simcontext, within this range of documents.
46
Simseed ⋅ Simcontext Simwn ⋅ Simcontext Simwiki ⋅ Simcontext Simcontext Simcombined
Atheism 0.31 0.66 0.67 0.56 0.67
Graphics 0.60 0.60 0.60 0.77 0.60
Ms-Windows 0.03 0.03 0.68 0.37 0.68
IBM PC 0.23 0.23 0.23 0.25 0.23
Macintosh 0.51 0.51 0.55 0.34 0.55
Windows-X 0.74 0.74 0.74 0.31 0.74
Forsale 0.77 0.77 0.77 0.92 0.77
Autos 0.74 0.83 0.75 0.86 0.83
Motorcycles 0.30 0.88 0.35 0.92 0.92
Baseball 0.55 0.68 0.56 0.97 0.69
Hockey 0.94 0.94 0.94 0.90 0.94
Cryptography 0.21 0.97 0.23 0.96 0.98
Electronics 0.57 0.60 0.57 0.23 0.61
Medicine 0.39 0.83 0.46 0.78 0.84
Space 0.01 0.91 0.03 0.87 0.90
Christian 0.26 0.59 0.26 0.57 0.57
Guns 0.62 0.77 0.62 0.84 0.72
Middle East 0.10 0.89 0.10 0.68 0.89
Politics 0.03 0.20 0.03 0.13 0.18
Religion 0.07 0.22 0.07 0.17 0.18
Average 0.40 0.64 0.46 0.62 0.67
Table 4 - MAP values for each of the methods used within the range of documents which contain
entailment terms.
To conclude the ranking evaluation, in Table 4 we present the MAP values
calculated for each category gained ranking according to each of the following scores:
Simcombined, Simwiki, Simwn, Simseed and Simcontext. Ranking by the Simcombined score
achieves higher MAP value than all methods. In particular, it achieves 5.5 points
higher MAP value than ranking according to the Simcontext score, and obtains higher
MAP value in 12 of the 20 categories and comparable MAP value in 2 of the
categories. It should be noted that the MAP value for the methods13 which rely on less
knowledge are lower in average since the R parameter is larger than the number of
correct classifications they achieved.
13 Those ranking are based on Simseed, Simwn and Simwiki which rely on partial entailment knowledge
relatively to the rankings according to Simentail and Simcombined.
47
4.2.3. Analysis
Ranking the documents according to their score revealed interesting characteristics of
each of the scoring methods and of the topical categorization in general. As opposed
to a single-class classification, the ranking approach shows the score that each
document obtained by the scoring method used. It allows us to explore the different
phenomena which effect the categorization. Below in a detailed analysis of some of
the aspects in which our method showed improvements, followed by error analysis
explaining some of the reasons of the errors and suggesting how our method can be
improved in future work.
Figure 3 - Recall-Precision curves for 4 categories for documents for which our method has knowledge
in the 20 Newsgroups collection. "Medicine" – which shows rescaling by context score to increase
accuracy of Simcombined method; "Electronics" – which demonstrates higher accuracy by entailment
score than for context score; "Atheism" – which demonstrates misclassification with high rank;
"Forsale" – which demonstrates results for non-topical category.
Successful ranking of category's documents We first describe several aspects
in which our method showed improvements and demonstrated appropriate behavior
1. Passing reference: Our preliminary manual analysis of TC behavior using the
ranking by the Simseed score as baseline showed that the dominant phenomenon which
causes misclassifications is passing references. Passing references tends to occur
when the topic name or any partial group of its characteristics terms appear in a
Atheism
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.2 0.4 0.6 0.8 1 1.2
Recall
Pre
cis
ion
Forsale
0.1
0.3
0.5
0.7
0.9
1.1
0 0.2 0.4 0.6 0.8 1 1.2
Recall
Pre
cis
ion
Medicine
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
0 0.2 0.4 0.6 0.8 1 1.2
Recall
Pre
cis
ion
Electronics
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.2 0.4 0.6 0.8 1 1.2
Recall
Pre
cis
ion
48
documents in their required sense, but not referring to the main topic of the document.
This phenomenon is relevant to all types of topics, including named entities such as
software names which are commonly mentioned, general topics which may be
discussed as an allegory or an object which is referred widely in the language. Table 5
shows several examples of documents which contain passing reference to one of the
20 Newsgroups collection topics.
Our method is designed to identify this phenomenon by two different
mechanisms. The first one is the entailment expansions of the characteristics terms
which results in higher scores for documents which contain multiple occurrences of
entailing terms. The entailment score integrated in our ranking method uses the cosine
similarity score based on the square root value term frequency of the entailing terms
which appear in the document. This scoring method gives higher ranks to documents
with occurrences of several different entailing terms. Re-occurring terms would also
enhance the document score, but more moderately since the term frequency is
measured using a square root function. Occurrence of different terms in the document
would contribute more to the total score of the document. This scheme prefers
documents which use multiple terms which entails the topic. Documents which only
address the topic as a passing reference would most likely use a single occurrence of
one of the terms and will be ranked lower.
The second mechanism which decreases the score obtained by documents with
a passing reference is the use of context models. When a term which entails a certain
topic appears out of context, a context model should give lower score to the document
since its context is irrelevant for this topic. Indeed, by multiplying the score gained by
the entailment scoring method with the context scoring method, documents which
obtain high entailment score were re-ranked due to low context score. Figure 3
demonstrates the improvement of accuracy when using this mechanism for the
"Medicine" category. It can be noticed that the irrelevant documents which decrease
the precision are re-ranked at the bottom of the ranked list in the Simcombined curve.
High scores of documents which received high entailment score are decreased by
context model multiplication. Unfortunately, when the context model fails to identify
the irrelevant context correctly this mechanism does not improve the ranking. We will
discuss the misclassifications caused by this reason later in this section. Figure 4
demonstrates the improvement gained by using the Simcontext score for the "Autos"
49
category, in which false positive rankings are rescaled to lower score and true positive
rankings are (mostly) left the same.
Gold Standard
Category
Method's
Classification
Document Example
Religion Christianity "…The fictional Christian or Moslem or Jew who is…"
Windows-X MS-Windows "…I was wondering if Microsoft had bought Xhibition? ... I thought Xhibition was for "X-
Windows"…"
Forsale IBM Pc "…the title says it all (not IBM brand)"
Graphics Macintosh "Which newsgroup discusses graphic design on PCs and macs? ..."
Guns Forsale "…my opinions are mine and are not for sale…"
Table 5 – Document samples for the passing reference phenomenon.
2. Topically close categories: topically close categories are mostly sister terms of the
same level in the topical taxonomy hierarchy. In the 20 Newsgroups collection, for
instance, topically close categories exist as sister terms in the computers group of
topics, such as "MS-Windows" and "IBM Pc". While topically close categories also
exist as topics in different branches of the taxonomy, such as the "Electronics" topic
which is a part of the science branch. This topic is highly related to the topics in the
computer branch, mainly the computer hardware topics.
Analysis of the ranking obtained using the Simcontext score shows that this
model fails to make clear differentiation between closely related topics. Mostly, for a
document which belongs to a category that has several close topics in the categories
set, the context model will score this document similarly for all those categories. A
document which received high score for computer "Graphics", for instance, would
mostly be ranked likewise in the "Electronics" science topic. However, the entailment
models expand each category name with terms, which entail specifically that
particular category, and therefore obtain better results empirically. The similarity of
the context of the terms has minimal effect on the entailment model ranking since it is
not designed to judge context. Nevertheless, there is yet another aspect of topically
close categories, in which our full model tends to misjudge documents. This aspect
will be discussed in more detail below, under the "Taxonomy structure" subsection.
Figure 3 demonstrate the superior results of the Simentail based ranking over the
Simcontext based ranking for the topically close category case, showing the curves for
the "Electronics" category. It can be noticed that the ranking based on the Simentail
50
score obtains the best accuracy for the "Electronics" category, while the ranking
according to the Simcontext score fails to obtain high precision. Moreover, the
combination of the two methods obtains comparable results to the using Simentail
alone, since the Simcontext scores are not accurate enough.
3. Infrequent seed term: a common difficulty which may cause classification errors
are terms which have low frequency in the collection, i.e. infrequent terms. Methods
which rely their scoring on the collection's data, such as distributional, statistical and
co-occurrence methods, (among them the Simcontext based method) can not obtain
sufficient data for those categories. It seems by analysis of the ranking according to
the Simcontext scores, for example, that the low frequency of certain terms results in
poor LSA vectors to describe those categories. When the topic name appears rarely in
the collection most of the examples seems as negative for the unsupervised LSA
method and the method can not collect sufficient data to obtain the category's
characteristics. For example, the category name of the "Windows-x" category is
relatively rare and the Simcontext scores obtained for it are inaccurate and result in poor
MAP results as can be seen in Table 4.
Our method is not restricted to the amount of knowledge which exists in the
collection itself. It uses external resources to enrich the knowledge about the
categories and exploits it to make better classifications. Categories such as "Middle
East", "Macintosh" and "Ms-windows" obtained poor ranking results using the
Simcontext score which is based on co-occurrence data from the collection, compared
with our method as demonstrated in Figure 3. Moreover, the extent of external
knowledge used for our system increases both recall and precision. The recall
increases since more entailing words are considered so a score can be provided for a
larger portion of the documents, and the Precision increases since the similarity score
calculation is based on more data and thereby reflects the frequency of entailing
words in this document more accurately. The effect of infrequent terms on the scoring
quality is also reflected in Figure 4, which demonstrates the Simcontext influence on the
Simcombined score. It can be seen, that the combined scores for "Middle East" are
mostly decreased by the context score, for both false and true positive rankings. The
context of the "Middle East" category is thus not recognized accurately enough.
Nevertheless, we should point out the advantage of a collection-based method,
which can be used to complement external knowledge. Collections are often biased
and reflect topic frequency which is different from their frequency in general English.
51
For example, the most frequent sense of the topic name "Space" in the 20
Newsgroups collection is space as a science, while in general English it is ranked as
the fourth sense according to the sense frequency. Using scoring methods based on
the collection dada can be helpful for disambiguation and tuning, while it does not
limit the knowledge learned by our method.
Error analysis Several error cases that were detected and categorized are
detailed below. The first type of error causes irrelevant documents to be ranked within
categories to which they do not belong. The second type causes scaling errors, where
documents are ranked higher (or at the same rank) for the parent of the category they
truly belong to than they are ranked for the true category according to the gold
standard. In this case the documents are mostly relevant to the parent category at
some level in terms of being a close topic or having some context relevancy. The third
type addresses collection annotation problems and discusses the single-class
classification complexity. The last case addresses a difficulty which origins in the
categories definition.
1. Ambiguity of expanding terms: the seed terms of our method are extracted from
the topical taxonomy of the TC collection, by taking the topic name as the category
seed. Ambiguity of the topic name within the collection is rare since it is chosen to be
very precise and to capture the full meaning behind the topic. However, by using
entailment expansions as part of our method, terms are being added to the seed term
to represent the category. One of the reasons for high ranking of irrelevant document
is ambiguity of the expanding terms.
As described in section 3.3.1 we employed an initial method to filter
ambiguous terms originated in the WordNet expansion, using the WordNet sense
information as a threshold parameter. Unfortunately, it only made a partial
improvement and infrequent senses of terms were still added as expanding terms. For
example, the noun "steal" was added as an expansion to the category "Baseball" in its
second sense.
2. Taxonomy structure: taxonomy structure is another aspect of topic closeness,
which was discussed earlier as a challenge which our method addresses relatively
successfully. Hypothetically, the structure of a given taxonomy for text categorization
should be hierarchical such that each topic should logically contain the terms below it
in the hierarchy. Miscellaneous topics, by this logic, should include documents that
52
belong to the parent topic while they do not belong to any co-hyponym topics under
that parent topic. In the 20 Newsgroups the documents are categorized by user
annotation due to their postings nature. The categorization, as well as the basic
structure of the taxonomy, does not always follow the hierarchical reasoning
described above.
For example, the category "Atheism" is not a sister-term of the category
"Religion" in the given taxonomy, nor does the category "Christianity". Each of these
three categories has a different parent in the hierarchy of the collection's taxonomy.
This topical closeness creates classification difficulties for the scoring methods, both
entailment-based and context-based. For the entailment methods, terms which entail
both the term and its hypernym (its parent in the hierarchy) can be used for expansion,
which results in ranking highly the same documents for both. The context model, on
the other hand, would identify similar contexts as the representing context of both
categories.
Figure 3 demonstrate this type of errors for the "Atheism" category. It can be noticed
that all methods obtain low precision at low recall. It indicates that irrelevant
documents, according to the gold standard, were ranked at the beginning of the list. It
can also be noticed that by using the Simentail score better results are obtained, since it
relies on entailing words which can be better distinguished than context for those
categories. We believe that the ranking can be improved so it would avoid this type of
mistakes by exploiting the taxonomy structure in the classification procedure. That is,
to rank the documents for each level of the taxonomy iteratively, corresponding to the
topics' semantic relations. We discuss the details of this idea in section 5.
Gold Standard
Category
Method's
Classification
Document Example
Christianity Atheism "…I am interested in finding out why people
become atheists…"
MS-Windows IBM Pc "Because of the technology… IBM PC can't
read them without special hardware…"
Macintosh Forsale "I have for sale a Hayes 2400…"
Religion Christianity "If a Christian means someone who believes in
the divinity of Jesus…"
Guns Autos "…question quiz and to drive a car around the
block…Most states do not require the
registration of cars that are not…"
Table 6 - Document samples for missing annotations.
53
3. Gold standard missing classifications: this type of errors is the cause of a
significant number of false rankings and might also be the cause of some errors of
other types described in this section. The gold standard of the 20 Newsgroups
collection classifies each document to a single category, based on the user posting of
the documents. Some of the documents contain a comparison between topics
discussed in different categories, such as "Baseball" vs. "Hockey", or "Autos" vs.
"Motorcycle". Other documents discuss topically related categories, such as a
"Religion" discussion which relates to different religions among them "Christianity".
Looking at the 10 top ranked documents of each category, out of 66 errors we identify
30 documents as documents which should be classified to both categories, the gold
standard and the one identified by our method. Table 6 shows several examples of
missing annotation in the 20 Newsgroups collection.
Figure 4 - Context Scoring influence
The influence of the context scoring on the entailment scoring as reflected in the Simcombined score,
where axis y is the score and axis x is number of documents. The dark (blue) curve stands for the
Simentail score and the bright (pink) curve stands for the Simcombined score. It can be noticed that the
context score decreases mostly false positive scores for the "Autos" category, where for the "Middle
East" category it decrease the score of true positive scores as well.
4. Non-topical categories: our method aims to capture the topic of each category in
order to make correct classification. Non-topical categories which do not gather
documents concerning a joint topic fall out of the scope defined for the text
classification task as we see it. The category "Forsale" in the 20 Newsgroups
54
collection is one example for non-topical categories. Figure 3 shows that indeed, our
method obtains lower accuracy for this category than the context model. We aim to
utilize a corpus of topical categories for our future work, since we believe it would
better reflect the essence of the task.
4.3. Classification
4.3.1. Classification measure
The second goal of TC is the binary classification of documents into the predefined
categories. We report the results of the scoring-based classification using cosine
similarity to obtain a classification score for each document-category pair. Similarly
to the ranking evaluation, we report results using the Simcombined score-based method as
the final scoring method of our system, and the Simcontext score-based method as a
baseline proposed by (Gliozzo et al. 2005). In addition, we present the results
obtained by each of the partial score-based methods from which the Simcombined score
is composed: Simseed, Simwn, Simwiki and Simentail. We also report the classification
results obtained for the bootstrapping step for our final method based on the Simcombine
score and the baseline method which is based on the Simcontext score.
Given the gold standard of the collection used for the evaluation, standard
accuracy measures can be calculated. For each method evaluated the following
measures were calculated:
#correct classifications per categoryRecall=
#document in category
#correct classifications per categoryPrecision=
#document classified to category
( )
2 Precision RecallF1=
Precision Recall
⋅ ⋅
+
We report the micro-average values of these measures for each of the methods
and the detailed results for each category for the methods based on Simcombined and
Simcontext. Similar to the evaluation done for the ranking goal, we calculate the results
of all methods according to the maximum knowledge obtained by our most
comprehensive score, Simcombined. For the 20 Newsgroups collection by using the
Simcombined knowledge about 60% of the documents in the test set were classified, and
therefore this portion was considered as the full collection for the calculations. Only
55
documents which were given a score by Simcombined participated in the evaluation.
Following this perspective we enable measuring the results on the amount of
knowledge we are capable to obtain at this time, and in which comparison between
the two methods can be considered relevant.
The classifications for all datasets were performed by the single-class
classification approach for all score-based methods. The classification for the Reuters-
10 collection according to the bootstrapping results in section 4.4 was preformed in
the multi-class classification approach, since the gold standard for this collection is
given by a multi-class classification.
4.3.2. Classification results
TC tasks have the advantage that each document can be classified to one or more
categories depending on the application requirements. We followed the classification
standard used in the collection settings. Therefore, the classification for the 20
Newsgroups collection was a single-class classification, meaning each document was
classified to a single category. Documents were classified to the best-scoring category
according to the score obtained by the scoring method employed.
Scoring method Recall Precision F1
Simseed 0.19 0.55 0.28
Simwn 0.29 0.56 0.38
Simwiki 0.22 0.57 0.31
Simentail 0.31 0.57 0.40
Simcombined 0.32 0.58 0.41
Simcontext 0.30 0.55 0.39
Bootstrap Simcombined 0.33 0.63 0.44
Bootstrap Simcontext 0.51 0.53 0.52
Table 7 - Micro average classification results for all method within the portion of the 20 Newsgroups
collection covered by the entailment knowledge. 14
Table 7 presents the classification results for the methods composing our
Simcombined score and for the Simcombined based method itself, showing the advantage of
the Simcombined based method in both recall and precision. In addition, Table 7 also
shows the classification results for the bootstrapping step employed on the initial
classified document set obtained by the Simcombined and Simcontext scoring methods.
14 The documents considered for the recall and precision calculations of the Simcontext based method are
only documents which contain at least single entailment evidence, which is 55% of the documents.
56
The bootstrapping step achieves different results based on each of the labeled
sets of documents created by the two methods. The bootstrapping step achieves higher
precision based on the Simcombined initial set of documents, while it achieves better
recall based on the training set constructed by the Simcontext score. The reason for that
lies in the accuracy obtained by the SVM algorithm separation on the training set (the
SVM algorithm is described in Appendix C). The SVMlight algorithm achieved
97.74% average precision on the training set according to its separation of the sample
documents obtained by using the Simcombined score, while it achieved an average
precision of 50.51% on the sample training documents based on the Simcontext score. It
implies that the separation obtained for the sample documents obtained using our
method, Simcombined, results in a more precise set, which enables a clear separation
with fewer misclassifications according to the SVM separation in the training set.
Simcombined Simcontext
Recall Precision F1 Recall Precision F1
Atheism 0.24 0.73 0.36 0.43 0.49 0.46
Graphics 0.19 0.50 0.28 0.22 0.43 0.29
Ms-Windows 0.37 0.66 0.48 0.35 0.61 0.44
IBM PC 0.24 0.26 0.25 0.32 0.30 0.31
Macintosh 0.35 0.54 0.43 0.14 0.19 0.16
Windows-X 0.06 0.75 0.11 0.06 0.24 0.10
Forsale 0.33 0.68 0.44 0.23 0.83 0.36
Autos 0.61 0.62 0.62 0.66 0.72 0.69
Motorcycles 0.48 0.90 0.63 0.50 0.90 0.65
Baseball 0.34 0.76 0.47 0.38 0.90 0.54
Hockey 0.34 0.96 0.50 0.43 0.72 0.54
Cryptography 0.38 0.86 0.52 0.38 0.93 0.54
Electronics 0.12 0.54 0.20 0.10 0.51 0.17
Medicine 0.38 0.60 0.47 0.28 0.77 0.41
Space 0.33 0.68 0.45 0.34 0.84 0.48
Christianity 0.60 0.57 0.58 0.56 0.58 0.57
Guns 0.25 0.52 0.34 0.32 0.66 0.43
Middle East 0.27 0.80 0.41 0.02 0.53 0.05
Politics 0.15 0.27 0.19 0.08 0.13 0.10
Religion 0.16 0.15 0.15 0.11 0.12 0.11
Micro average
0.32 0.58 0.41 0.30 0.55 0.39
Table 8 - Classification results per category for Simcombined and Simcontext methods for the 20
Newsgroups collection.
Table 8 presents the classification results gained by the Simcombined and
Simcontext scoring methods for each of the categories, as well as the micro-average
results obtained by these scoring methods. Our method shows comparable results to
57
the Simcontext based method. Details of the full analysis of both tables are given in the
following section.
4.3.3. Analysis
The type of classification used in our method was determined based on the method
applied for the gold standard of the test collection. The 20 Newsgroups collection
gold standard is annotated based on a single-class classification, and therefore the
analysis is based on the results obtained by this scheme for it. Single-class results tend
to be deceptive for three main reasons:
(i) Flat taxonomy structure – TC taxonomies should be hierarchical by
definition, to encapsulate the relations between the topics. The hypernym
of a group of topics should be their parent in the taxonomy. The 20
Newsgroups collection is not structured in this manner. For example, the
"Religion" topic is a sister topic of the topic "Christianity", and the "MS-
Windows" topic which is a sister term of the topic "Windows-X"'. The
intuitive annotation rules would automatically classify documents which
belong to the "Christianity" topic to the "Religion" topic as well, for
instance. Since single-class classification selects a single category per
document, this hierarchical classification can not be obtained. This is one
of the causes for classification errors which we identify later on in this
section.
(ii) Equal score for different categories – In case a document obtains identical
scores for several categories then only one of the categories can be
selected in the single-class scheme as the "best" category. The method can
then choose the first category in some order or one of the categories
randomly. Our method simply chooses the first category in a lexicographic
order. We also tried random selection which did not obtain better results,
and on the contrary caused inconsistent classification which made the
analysis more difficult.
(iii) Gold standard missing classification –as described in section 4.2.2, taking
those missing classification under account, it is possible that the
classification method classified a document of this sort to both required
categories while the one taken was only the one which the gold standard
classification left out.
58
Those three difficulties in single-class classification were taken into consideration
when analyzing the classification results. Those difficulties enhance the advantage of
the Ranking based evaluation, which enables to concentrate on the scoring results
with no necessary consideration for the evaluation weaknesses. Below we describe the
classification phenomena which correspond to each type of method, entailment and
context-based. The entailment analysis is based on the results using the Simentail score,
and the context analysis is based on using the Simcontext score. We first describe the
entailment phenomenon, followed by the context phenomenon and conclude with
error analysis.
Entailment behavior Document classification based on entailment knowledge
depends on the quality of the expansions of the seed terms. Following are the main
types of category topics, where each requires a different type of entailment
expansions. The expansions needed might be different both in the type of semantic
relation needed for their expansion, and in the potential resource in which they can be
found. These differences influence the amount of expansion rules our method
succeeded to extract for each category type and also on the accuracy of the
classifications. The four types are described below:
1. General topic: includes categories which describe a general topic or a field of
interest. Categories which belong to this group are the science categories, such as
"Medicine" and "Space", religion categories, such as "Atheism" and "Religion", the
"Politics" category and the technical categories "Graphics" and "Electronics" which
describe a general field in the technical world rather than a specific brand or machine.
This type of categories requires expansion to their sub-fields, such as types of
medicine, types of religion etc. This type can also benefit from expansions of names
of people who are identified within this field, such as known politicians, known
cryptography scholars etc.
Therefore, the quality of classification for this type of topics depended on the
amount of knowledge obtained from our resources and their likelihood to appear
outside their context. The popular fields of interest obtained greater coverage from
WordNet. "Space", for example, in the sense of science was added only in the version
of WordNet used for our method (3.0) and has limited amount of data related to it.
Moreover, the Wikipedia resource has been a potentially good resource for the second
type of expansion resource; however the expansion rules reached insufficient
59
coverage for this type of categories. We wish to research the use of derivation and
synonyms as input for the Wikipedia resource to enhance the amount of data and the
quality of data we can obtain from Wikipedia.
2. Commercial brands: this sub-group of categories relates to a type of products
available on the market. WordNet does not contain most of them, and therefore their
expansions are mostly based on Wikipedia data. We found Wikipedia as a promising
resource for expansions of this sort, due to its updating and growing nature. New
products and their commercial brands are often covered in Wikipedia articles. Indeed
those categories, "MS-Windows", "IBM Pc", "Macintosh" and "Windows-X"
obtained very precise expansions and showed significant improvement in recall due to
their expansions. The reason for this is the efficiency of the expansions compared
with the general context of those computer products.
3. Classes: under this group of categories we can find classes of NPs, such as "Autos"
which contain types of cars, "Motorcycles" which include types of Motorcycles,
"Baseball" and "Hockey" which include sport teams of their types and "Guns" which
describe the guns class. On top of synonyms and domain terms which entail the topic,
the main type of entailing terms needed as expansions for this group of categories are
class members for each of them. WordNet and Wikipedia are complementary
resources regarding the types of expansions they can supply, and for that reason they
are both needed in this type of categories to provide each type of expansions required.
The accuracy obtained for this group depends mainly on the likelihood of the
category's entailing terms to appear outside their context and their frequency in the
general language. The expansions obtained for this group are very preliminary in
terms of scope, and we aim to improve them in future work. Therefore, their
likelihood to appear in the context of other categories becomes more influential and
decreases precision on top of the low recall obtained due to the limited expansions
obtained.
It should be noted that the "Middle East" category falls between the types
defined by us for the categories. It is close to the Classes type since it include the
geographical regions and countries in the Middle East. As a collective group, most of
60
Table 9 - Simcombined confusion matrix. Each row specifies the categoires' distribution according to the gold standrad, while each column specifies the
category the document was classified to by the method. The true positive classifications, which were classified to the correct category by the method,
are shaded.
61
the expansions for that topic originated from the meronym relation in WordNet which
was found as highly precise in manual analysis.
4. Non topical: this group includes the "Forsale" category. It is hard to expand a
category name which is not a topic, and therefore the accuracy obtained for it mainly
depends on the accuracy of the original category name.
Context behavior The analysis of the context classification showed that the
context model used does not recognize the topic of a document but rather divides the
documents into semantic clusters. This phenomenon can be observed clearly from the
confusion matrix of the Simcontext based method in Table 10. Although this is the
expected behavior from a context model, in practice, it enables us to use it as an
indicator of a general context and not of a specific context. For instance, it can
indicate when the topic name "Medicine" appears in a "Religion" context, by giving a
low context score to the document for its likelihood to discuss a "Medicine" context.
However, if the "Atheism" topic name would appear in a "Religion" context then the
context model would be of little help for such disambiguation.
Overall, for each semantic domain which includes several topics we can
identify the dominant topic in this domain. This topic is classified for the largest
amount of documents from this semantic domain. For example, Most of the
documents in the religion topical context obtain the highest score for the
"Christianity" category which results in the highest number of classification of
religion related documents to this category. Another example can be found in the
technical categories domain for which the leading category in terms of scores and
classifications number is the "IBM Pc" category. The leading categories obtained the
best accuracy according to this method classification and the rest of the categories in
the domain received respectively lower accuracy results.
It should be noted that for infrequent category names the Simcontext scoring-
method obtained poor statistics and yielded low accuracy accordingly. Some of the
examples for such category name are the "Windows-x" and the "Middle East"
categories which obtained low accuracy compared with the Simcombined scoring-
method. Table 9 presents the confusion matrix for the Simcombined score, in which the
superior results for infrequent category names can be seen, as based on an external
resource for expansions. Moreover, as opposed to the phenomena noticed for other
categories, infrequent category names also obtain unexplainable classification
62
Table 10 - - Simcontext confusion matrix on the portion of the corpus Simcombined acquired knowledge. Each row specifies the categoires' distribution according to the
gold standrad, while each column specifies the category the document was classified to by the method. The true positive classifications, which were classified to the
correct category by the method, are shaded.
63
mistakes by the context model, which are probably made due to general context words
rather than topical context. For example, the space document summarized in the line
"Want to obtain fax/email address for Planetary Society" obtains a high score for the
"MS-Windows" category, with no clear reason for why it obtained this high score.
Error analysis The error analysis for the classification mistakes mostly
corresponds to the behavior described in this section for each of the methods and for
some of the error analysis described in section 4.2.2. Therefore, below we sum up the
main reasons for classification errors:
1. Taxonomy structure: classification to a single category when the category and its
hypernym are both sister terms in the flat structure given. The document would be
classified to the "stronger" category which might be the wrong one. For example if a
"Christianity" document uses mostly general religion terms it might be classified only
to the "Religion" category, when according to the gold standard it belongs to
"Christianity". Both Simcombined and Simcontext scores are affected by this cause for
errors. For instance, Table 9 presents the Simcombined misclassifications of the
"Windows-X" category documents to the "MS-Windows" category, as well as for the
"Atheism" and "Christianity" documents to the "Religion" category.
2. Identical scores: some times documents obtain identical scores for several
categories and are arbitrarily classified to one of them which may be the incorrect
one. This type of error mostly occurs when the Simcontext based method is used for
context identification for a category which belongs to a broader semantic class. For
example, for a document which received high scores for several computers categories
the context model would not be able to rescale the scores appropriately.
3. Ambiguity of expanding terms as for ranking: when a category name is
expanded with an ambiguous term it would result in low scores for documents from
several un-related categories in which this term appeared in one of its other senses.
An example for such a term is the noun "steal" which relates to baseball in its second
most frequent sense – "a stolen base; an instance in which a base runner advances
safely during the delivery of a pitch", however it is more frequent in its first sense –
"an advantageous purchase". Most of the documents which obtain a high score
according to this type of terms do not belong to the corresponding category.
Moreover, the score would mostly be based solely on the single ambiguous term. If
the document was given a score for this reason alone, and other categories did not
64
obtain any information which appears in it, the document would be classified
mistakenly to an inappropriate category.
4. Single appearance of a single word: the bottom of the ranking list consists mainly
of documents which obtained low scores based on a single appearance of a single
term. This section of the ranking list contains more errors and irrelevant documents.
The context model is sometimes unable to be used to significantly decrease the score
when the document appears in a related topical category, such as a computer category
other than the computer category the document belongs to. If the document obtained a
score for a single category it would be classified to it even though it is likely to be
wrong. We aim at expanding the amount of knowledge acquired by the entailment
methods to improve accuracy and diminish this error type. Another possible solution
for this type of error, which will be discussed in section 5, is applying the gm
algorithm or a similar algorithm to rescale the classification scores and enable the
creation of a common threshold to filter these scores. In ranking analysis this is not a
dominant error cause since those documents are ranked at the bottom of the list.
5. False negative errors: this type of errors is generally caused by infrequent terms
and insufficient data. The expansion method acquired minimal knowledge for those
topic names, such as "Space" in the science sense and "Windows-x". The result of this
amount of knowledge is low recall and sometimes decreased precision for documents
which obtain low scores for other categories and no score for the true category, so
they are misclassified due to lack of knowledge.
4.4. Reuters-10 results
Simseed ⋅ Simcontext Simwn ⋅ Simcontext Simwiki ⋅ Simcontext Simcontext Simcombined
Acquisition 0.24 0.78 0.24 0.93 0.80
Corn 0.66 0.94 0.66 0.61 0.94
Crude 0.21 0.91 0.21 0.80 0.91
Earn 0.08 0.83 0.11 0.53 0.82
Grain 0.42 0.98 0.42 0.93 0.98
Interest 0.47 0.47 0.47 0.84 0.47
Money-fx 0.32 0.51 0.38 0.58 0.51
Ship 0.87 0.99 0.87 1.00 0.99
Trade 0.75 0.74 0.75 0.89 0.81
Wheat 0.97 0.97 0.97 0.85 0.97
Average 0.25 0.41 0.25 0.40 0.41
Table 11 - MAP values for each of the methods used within the range of documents which contain
entailment terms. The best result for each category is given in bold indication.
65
In this section we present the results obtained for the Reuters-10 collection. As
motioned above, we decided to focus most of our work on the 20 Newsgroups
collection rather than the Reuters-10 collection for our evaluations. The main reason
for that is that the 20 Newsgroups categories better suit the topical categorization
scheme addressed by this research, its categories are not domain specific and its
taxonomy is better structured than the Reuters-10 flat taxonomy.
The ranking results for the Reuters-10 corpus for the Simcombined, Simwn, Simwiki,
Simseed and Simcontext are described in Table 11. The table presents the MAP values for
all 10 categories within the range of knowledge obtained by the entailment methods,
which is an average of 87% of the documents in the collection. It shows advantage for
the Simcombined method in half of the categories, that is 5 out of 10 categories. Overall,
the average MAP achieved by Simcombined slightly outperforms the average gained by
the Simcontext score, in 1.1 points. It should be noted that several of the categories in
which the Simcombined did not achieved better MAP value than the Simcontext, are non-
topical categories such as the "Money-fx" category.
Scoring method Recall Precision F1
Simseed 0.22 0.67 0.33
Simwn 0.67 0.78 0.72
Simwiki 0.24 0.68 0.35
Simentail 0.69 0.80 0.74
Simcombined 0.66 0.77 0.71
Simcontext 0.47 0.54 0.50
Bootstrap Simcombined 0.78 0.74 0.76
Bootstrap Simcontext 0.66 0.48 0.55
Table 12 - Micro average classification results for all method within the entailment knowledge portion
of the Reuters-10 collection15
.
The classification results for Reuters-10 are presented in Table 12, for Simseed,
Simwn, Simwiki, Simentail, Simcombined and Simcontext methods. First, it can be noticed that
the names of the Reuters-10 categories are very indicative and therefore acquire
relative high precision using the Simseed score which is based on the category name
seed alone. Moreover, another result of the indicative category names is that the
WordNet based expansions significantly improve the recall obtained by the Simwn
score-based classification. Some of the categories require just a synonym or
derivational expansion to reach high recall, such as the category "Corn", for which the
expansion based on the rule "maze ⇒ corn" alone nearly doubles the recall.
15 The documents considered for the recall and precision calculations of the Simcontext based method are
only documents which contain at least single entailment evidence, which is 87% of the documents.
66
On the other hand, Wikipedia based expansions make small contribution to the
recall since the categories mostly do not describe a broad topic, but rather specify a
narrow topic within the economy domain. Nevertheless, the Wikipedia based
expansions do contribute to the precision and increase it by two percents, which
implies that although the expansions do not result in significantly more classified
documents, they improve the accuracy of the score obtained for the documents which
contain the seed terms. Moreover, the combination of WordNet and Wikipedia
expansions result in higher precision and recall than each one of them alone, since the
expansion are complementary in this dataset domain as well.
The context model accuracy for this dataset is not comparable to our method
accuracy, as can be seen from Table 12. It should be noted that the results obtained by
the SimLSA score-based method were higher for the Reuters-10 dataset and obtained
10% more of the Simcontext score-based method in all measures, Recall, Precision and
F1. This phenomenon can be explained by the vulnerability of our system to identical
scores. The Simcontext scores, are based on the SimLSA scores rescaled by the GM
algorithm as explained in section 3.3.4, are often identical for several categories since
the high scores are mapped to a probability values of 1. The system then classifies the
document to a single arbitrary category, while the rest of the categories do not obtain
any classification for this kind of documents, such as the documents which belong to
the "Interest" category but classified to other categories falsely.
Simcombined Simcontext
Recall Precision F1 Recall Precision F1
Acquisition 0.80 0.65 0.72 0.90 0.42 0.57
Corn 0.54 0.93 0.68 0.52 0.71 0.60
Crude 0.72 0.91 0.81 0.60 0.83 0.69
Earn 0.68 0.96 0.79 0.14 0.99 0.24
Grain 0.34 0.98 0.50 0.26 0.85 0.40
Interest 0.32 0.35 0.33 0.00 0.00 0.00
Money-fx 0.31 0.56 0.40 0.72 0.58 0.64
Ship 0.42 0.97 0.59 0.73 0.73 0.73
Trade 0.96 0.54 0.69 0.91 0.76 0.83
Wheat 0.62 0.91 0.74 0.58 0.73 0.64
Micro average
0.66 0.77 0.71 0.47 0.54 0.50
Table 13 - Classification results per category for Simcombined and Simcontext methods for the Reuters-1016
collection.
16 The documents considered for the recall and precision calculations of the Simcontext based method are
only documents which contain at least single entailment evidence.
67
Even though the results for the SimLSA are higher than those obtained by the
Simcontext based method, the results obtained are still significantly lower than the
results obtained by our method, based on the Simcombined score. It can be explained
simply by the type of categories composing the Reuters-10 category set, where
categories are all domain specific. As mentioned in the analysis section earlier, the
context model is not highly sensitive regarding context differentiation within specific
domains. The close nature of the categories' context in the Reuters-10 data yield less
accurate results.
Table 13 presents the results obtained for each of the categories for the
Simcombined and Simcontext score-based methods. It shows the high recall achieved by the
Simcombined method relative to the recall obtained for most of the 20 Newsgroups
categories, presented in Table 8. It is a result of the indicative nature of the Reuters-10
categories as described above. For that reason, the precision obtained for most
categories is also high.
68
5. Conclusion and future work
The main contribution of this thesis is a novel approach to TC which is based on the
entailment model. The proposed method integrates entailment models and context
models as a scoring method for the text categorization task, which was approached so
far mostly by context models utilization. Our investigation highlights the importance
of the entailment assumption for the TC task and the complementary nature of
entailment and context models. We suggest that this line of research may be
investigated further to enrich and optimize the entailment models used in order to
exploit additional entailment knowledge.
Our research revealed several important conclusions about the integration of
the two models and about each type of model separately:
(i) Indeed, our analysis reveals that the entailment requirement as the basis for
the TC score helps to classify documents according to the topic they actually
discuss as opposed to using context models which only reveal the documents
broader context. Strong entailment evidence within the documents implies that
the topic is discussed specifically in the document, as a main topic or one of
the sub-topics.
(ii) Notably, the context model's score complements the entailment model
score to distinguish cases where the entailment model recognized a passing
reference of a topic, or entailing terms in a different sense than the one which
entails the category topic. The multiplication scheme for the integration of the
similarity scores obtained by the two models was found effective, yielding
noticeably improved accuracy.
(iii) Context models tend to split the documents into general semantic
clusters. They do not recognize the specific context discussed in the
document, but rather recognizes the domains of contexts the collection
discusses. Therefore, categorizing documents according to context-based
models results in categorizing to the category which most prominently
represents the semantic cluster.
(iv) Combination of dictionary based expansions and encyclopedia based
expansions gives a more complete perspective and expansion abilities. On one
hand, the need for morphological and general definitions is needed due to
69
language richness. On the other hand, to better deal with named entities and
current general topics, encyclopedias give important complementary
definitions and knowledge.
Our study highlights the potential in combining the two methods and constructing
meaningful scores by using them. Yet, the results we achieved may be improved in
many aspects. Obviously, the recall of our method can be improved by utilizing
further entailment knowledge resources. By extracting more entailment rules, more
evidence is obtained, statistics become more significant and the number of entailing
terms is likely to increase. To assure increase of accuracy as a whole rather than
increase of recall alone, the accuracy of the additional entailment rules should be
guaranteed so that the precision would increase as well. Apart from technical issues,
such as improvements of the pre-processing steps or employment of additional
Wikipedia-based rule types, we will now describe several promising research
directions:
(i) Hierarchy based categorization: the topical hierarchical taxonomy, which
stands at the base of the topical TC tasks, can be exploited in more ways than just
using its category names as seeds for expansion. Hierarchy related errors might be
avoided by utilizing a hierarchy-oriented categorization approach. Firstly, the
categorization method can be performed iteratively for each level of the taxonomy
hierarchy. By that, documents which belong to a certain branch of the taxonomy
would first be categorized to it, and then distributed between the sub-topics
constructing that branch of the taxonomy. Moreover, the taxonomy hierarchy can
be exploited to identify and eliminate the use of entailing terms as expansions for
multiple sister categories, and by that eliminating classification of documents to
such multiple categories, while the document should belong to only one of them.
Finally, the hierarchy topic names can be used to disambiguate the required sense
of the topic name at the leaves of the taxonomy, by measuring the association
between the entailing term and the topic names which construct the path to that
leaf.
(ii) Entailing terms weighting scheme: our method used a uniform weight scale for
all the entailing terms. Future work may consider using the entailment rule
weighting scheme to apply weights for the entailing terms, or use an independent
scheme to weigh the entailing terms directly. For example, a weighting scheme
70
based on SemCor probabilities (Miller et al., 1993) or on Information Gain
statistics of the terms can be utilized. Using weights for the entailing terms can
help diminish the influence of ambiguous terms and terms which entail senses of
the topic name other than the sense denoted by the.
(iii) Re-scaling of the complete score – currently, the GM re-scaling is utilized
only for the similarity score obtained by the LSA, due to the sparseness of the data
obtained by the entailing methods. Augmenting the number of entailment rules
used in our method may result in possible utilization of the GM for them.
Otherwise, in case the problem of data sparseness is not resolved, a different re-
scaling method may be considered to allow setting a threshold to filter documents
scores.
(iv) Context models research – the context model in use for our method is LSA
similarity re-scaled using a GM model. This model tends to obtain similar scores
for similar context, i.e. most documents would obtain equal score for topically
related categories, such as the computer categories. In these cases the context
model does not provide us with the context differentiation we need in order to
filter entailment scores which are mistaken. Since the LSA model is difficult to
analyze and improve, it may be useful to evaluate other context models such as
co-occurrence.
(v) Evaluation and Analysis – we believe that evaluation on a topical collection
with standard and fully defined taxonomy structure may diminish errors originate
by this technical problems, and even raise additional interesting analysis
conclusions and research direction. Moreover, such collection with multiple-class
annotations may be useful to the full analysis of the results obtained for each
category.
71
References
Berry, M. 1992. Large-scale sparse singular value computations. International
Journal of Supercomputer Applications, 6(1):13Œ49.
Cai, L. and T. Hofman. 2003. Text Categorization by Bossting Automatically
Extracted Concepts. In Proc.of the 26th Annual Int.ACM SIGIR Conference on
Research and Development in Informaion Retrieval, Toronto,Canada,2003.ACM
Press.
Clinchant, S., C. Goutte, and E. Gaussier. 2006. Lexical entailment for Information
Retrieval. In Proceedings of the 28th European Conference on Information Retrieval,
volume 3936 of Lecture Notes in Computer Science, pages 217–228. Springer-
Verlag, 2006.
Dagan, I., O. Glickman, and B. Magnini, editors. 2006. The PASCAL Recognising
Textual Entailment Challenge, volume 3944. Lecture Notes in Computer Science.
Deerwester, S., S. Dumais, G. Furnas, T. Landauer, and R. Harshman. 1990. Indexing
by latent semantic analysis. Journal of the American Society of Information Science.
El-Yaniv, R., and O. Souroujon. 2001. Iterative double clustering for unsupervised
and semi-supervised learning. Advances in Neural Information Processing Systems
(NIPS) 14, 2001.
Fellbaum, C., editor. 1998. WordNet : An Electronic Lexical Database (Language,
Speech and Communication).The MIT Press.
Freund, Y., R. Iyer, R. E. Schapire, and Y. Singer. 1998. An efficient boosting
algorithm for combining preferences. In Proceedings 15th International Conference
on Machine Learning, pages 170-178.
72
Giampiccolo, D., B. Magnini, I. Dagan, and B. Dolan. 2007. The third pascal
recogniz-ing Gilies, 2005) textual entailment challenge. In Proceedings of ACL-
WTEP Workshop.
Glickman, O., E. Shnarch and I. Dagan. 2006. Lexical reference : a semantic
matching subtask. In Proceedings of EMNLP.
Gliozzo, A. and C. Strapparava. 2005. Domains kernels for text categorization. In
Proc.of the Ninth Conference on Computational Natural Language Learning
(CoNLL-2005), Ann Arbor, June.
Gliozzo, C. Strapparava, and I. Dagan. 2005. Investigating unsupervised learning for
text categorization bootstrapping. In Proc. of the Joint Conference on Human
Language Technology / Empirical Methods in Natural Language Processing
(HLT/EMNLP), Vancouver.
Joachims, T. 1999. Making large-scale SVM learning practical. In B. Scholkopf, C.
Burges, and A. Smola, editors, Advences in kernel methods: support vector learning.
MIT press, Cambridge, MA, USA, chapter 11, pages 169-184.
Kazama, J. and K. Torisawa. 2007. Ex-ploiting Wikipedia as external knowledge for
named entity recognition. In Proceedings of EMNLP-CoNLL.
Ko, Y. and J. Seo. 2002. Text categorization using feature projections. In Proc. of
COLING'2002.
Ko, Y. and J. Seo. 2004. Learning with unlabeled data for text categorization using
bootstrapping and feature projection techniques. In Proc. of the ACL-04, Barcelona,
Spain, July.
Liu, B., X. Li, W. S. Lee, and P. S. Yu. 2004. Text classification by labeling words. In
Proc. of AAAI-04, San Jose, July.
73
McCallum, A. and K. Nigam. 1999. Text classification by bootstrapping with
keywords, EM and shrinkage. In ACL99 – Workshop for unsupervised Learning in
Natural Language Processing.
Mihalcea R. and D. Moldovan. 2000. Semantic Indexing using WordNet Senses. In
Proceedings of ACL Workshop on IR and NLP.
Miller, G.A., C. Leacock, R. Tengi and R.T. Bunker. 1993. A semantic concordance.
In Proceedings of HLT.
K. Morik, P. Brockhausen, and T. Joachims. 1999. Combining statistical learning with
a knowledge-based approach - A case study in intensive care monitoring. Proc. 16th
Int'l Conf. on Machine Learning (ICML-99).
M. de Buenaga, J.M. Gomez, and B. Diaz. 1997. Using wordnet to complement
training in formation in text categorization. In Recent Advances in Natural Language
Processing II: Selected Papers from RANLP'97, vol- ume 189 of Current Issues in
Linguistic Theory (CILT), pages 353-364. John Ben jamins, 2000.
Rodriguez et al. 1997)
Sahami,M.,Hearst,M.,and Saund,E.(1996).Applying the multiple cause mixture model
to text categoriza-tion. In Proceedings of the 13th International Machine Learning
Conference.
Salton, G. and M. H. McGill. 1983. Introduction to modern information retrieval.
McGraw-Hill, New York.
Scott, S. and S. Matwin. 1998. Text classification using WordNet hypernyms. In
Proceedings of the COLING / ACL Workshop on Usage of WordNet in Natural
Language Processing Systems. Montreal, Canada.
S.Scott and S.Matwin.(1999).Feature engineering for text classification.Proc.of 16th
International Conference on Machine Learning,Bled,Slovenia.
74
Chade-Meng Tan, Yuan-Fang Wang, Chan-Do Lee: The Effectiveness of Bigrams in
Automated Text Categorization. ICMLA 2002: 275-281
E. Voorhees and D. Harmann, editors. 1999. Proceedings of the Seventh Text
REtrieval Conference (TREC-7), Gaithersburg, MD, USA, July. NIST Special
Publication.
75
Appendix A – Latent semantic Analysis
Latent Semantic Analysis (LSA) is a dimension reduction method for co-occurrence
data. The main idea of LSA is to map the original representation of documents to a
lower dimensional space (latent space), in which documents will be represents by
"concepts" instead of terms, where the number of "concepts" is significantly lower
than the number of terms.
More formally, let t be the number of terms in the corpus, and N to be the total
number of documents. Define ( )ijM m=r
to be the term-by-document association
matrix with t rows and N columns, where mij is the weight of term i in document j.
The Mr
matrix is decomposed using SVD into three matrixes
tM K S D= ⋅ ⋅rr r r
Where Kr
is the matrix of eigenvectors derived from the term-to-term
correlation matrix given by tM M⋅
r r, and
tDr
is the matrix of eigenvalues derived from
the document-to-document matrix given by t
M M⋅r r
. Sr
is r r× diagonal matrix of
singular values where min( , )r t N= is the rank of Mr
.
The dimensions of the original space are than reduced by selecting a rank s
which will stand for the number of dimensions or "concepts" in the latent space. Only
the s largest singular values of Mr
are kept along with their corresponding columns in
Kr
and tDr
as a result. Accordingly, only the top s singular values in Sr
are kept, and
the rest are set to zero.
The rank s should be selected so that it will be big enough to represent all the
concepts in the original data, while it will also be small enough to filter out all
unnecessary data represented by the original number of dimensions (i.e. the number of
terms in the original corpus). The original representation of the co-occurrence
captures the first order similarity between terms, meaning their likelihood to co-occur
in the same text. The "concepts" amalgamate the co-occurrence data of the terms and
by that capture second order similarity, which means terms which tend to appear
together will be mapped to the same concepts and as a result the similarity will
measure re-occurrence of terms together.
76
Finally, the terms in the latent semantic space are represented by the rows of
the reduced matrix Kr
of the rank t s× . That is, the ith term in the lexicon wi is
represented by the ith row of the matrix Kr
. The document vectors, dr
are than
represented by the weighted sum of the LSA vectors representing their constructing
terms: td K⋅r r
.
77
Appendix B – Gaussian Mixtures
The Gaussian Mixture (GM) model aims to estimate the probability of a document to
be classified to a given class based on the similarity score between the class and the
document. In essence, the Gaussian Mixtures (GM) algorithm differentiates between
relevant and non-relevant category documents using similarity statistics of the
unlabeled data. The algorithm assumes that the distribution of a given similarity
function is in fact a mixture of two distributions, and approximates the unknown
density of the two assuming that they are Gaussian functions. Bellow we give a short
formal description of the algorithm.
For each document i
d T∈ and for each category c, where c
id V⊂ is the term-
based representation of category c, we define ( , )c i
Sim id d ∈� to be the similarity
function between the documents and the category idcs. The similarity function is taken
to be monotonically increasing according to the "closeness" of the documents and the
category idcs. As a first step, the algorithm obtains similarity scores for each
document-category pair, and assumes that the similarity scores obtained for each
category is a mixture of the distributions of the relevant and non-relevant category
documents.
In the second step, the algorithm aims to estimate the conditional probability
( )( , )c i
c Sim id dΡ as a mixture of the probability of this pair to be a positive example
and its probability to be a negative example. For that purpose, it defines the two
following probabilities: ( )( , )c iSim id d cΡ , the probability that the similarity between
a document di and a category id idc is drawn from the category's distribution function,
hence that they are a positive example, and ( )( , )c iSim id d cΡ which is the distribution
from which the negative examples are drawn. Each probability is assumed to be
drawn by a Gaussian distribution, for example the probability of the positive
examples:
( )2
( ( , ) )
21( , ) ( ( , ), , )
2
c i c
c
Sim id d
c i c i c c
c
Sim id d c G Sim id d e
µ
σµ σπ σ
−−
⋅Ρ = = ⋅⋅ ⋅
(Similarly for the probability of the negative examples). To achieve its goal
the algorithm should acquire an estimation of the value of the parametersc
µ , c
σ ,
78
cµ and
cσ , that is the mean and variance values of each probability function, as well
as the weight of each function in the overall Gaussian mixture. The mixture of the
functions is defined to be:
( ) ( ) ( )( , ) , , , , , ( , ) ( , )c i c c c c c c c c i c c iSim id d w w w Sim id d c w Sim id d cµ σ µ σΡ = ⋅Ρ + ⋅Ρ
Where wc is the weight of the positive Gaussian function as defined above,
and c
w is the corresponding weight of the negative Gaussian function. The weight of
the two functions are the prior probability of ( )cΡ and ( )cΡ , and therefore
( ) ( ) 1c cΡ + Ρ = . The calculation of those parameters is obtained via an EM procedure
described in detail in (Gliozzo et al., 2005). Using the two estimated functions the
algorithm can apply the Bayes rule to obtain the smoothed mixture of the two:
( ) ( )( ) ( )
( ) ( ) ( ) ( )
( , )( , )
( , ) ( , )
c i
i c i
c i c i
Sim id d c cc d c Sim id d
Sim id d c c Sim id d c c
Ρ ⋅ΡΡ = Ρ =
Ρ ⋅Ρ + Ρ ⋅Ρ
By acquiring the final estimation of the ( ) ( )( , )i c ic d c Sim id dΡ = Ρ
probability the algorithm achieves its goal to obtain the smoothed estimation of the
classification probability. Following the single-class paradigm it can then be used to
assign the most likely category to each document, that is ( )arg maxc i
c dΡ .
In short, the algorithm is constructed of the following steps:
(i) For each document i
d T∈ and category id, c
id V⊂ , calculate the similarity
score between them, ( , )c i
Sim id d ∈�
(ii) Estimate the probability ( ) ( )( , )i c i
c d c Sim id dΡ = Ρ using the following
steps:
a. Define ( )( , )c iSim id d cΡ and ( )( , )c iSim id d cΡ to be the
complementary probabilities for the positive and negative examples.
b. EM step: estimate the value of the parametersc
µ , c
σ , c
µ , c
σ , wc and
cw to acquire an estimation of:
( )( , ) , , , , ,c i c c c c c cSim id d w wµ σ µ σΡ =
( ) ( )( , ) ( , )c c i c c iw Sim id d c w Sim id d c⋅Ρ + ⋅Ρ
79
c. Estimate ( ) ( )( , )i c ic d c Sim id dΡ = Ρ using Bayes rule:
( ) ( )
( ) ( ) ( ) ( )
( , )
( , ) ( , )
c i
c i c i
Sim id d c c
Sim id d c c Sim id d c c
Ρ ⋅Ρ
Ρ ⋅Ρ + Ρ ⋅Ρ
(iii) For each document di assign the best category according to ( )arg maxc ic dΡ
80
Appendix C – Support Vector Machines
Support Vector Machines (SVM) is a state-of-the-art framework for supervised
learning which can be used to train linear classifiers. Linear classification maps the
input data, constructed from positive and negative classes of vectors, to a higher
dimension space in order to separate the two classes by a hyperplane, that is a multi-
dimensional plane. The hyperplane is chosen to maximize the distance between the
closest vectors of each of the two classes. This distance is donated margin, and the
closest vectors to the hyperplane are named support vectors. Given a new vector, the
classifier will use this separating hyperplane to classify the new vector to one of the
classes, positive or negative.
The SVM algorithm goal is to maximize the hyperplane margin in order to
create a clear separation. On the other hand, it aims to minimize the risk for
classification mistakes, which may be resulted from a larger margin. To control the
tradeoff between those goals the SVM algorithm includes a regulation parameter,
donated c. The margin chosen by the SVM is smaller for higher c values, and
therefore the number of False Positive classifications decrease for higher c values. On
the other hand, for lower c values the margin is larger and therefore more vectors are
left unclassified, meaning False Negative errors are created. Hence, this parameter
can be used to control the tradeoff between Precision and Recall, since Precision is
associated with the portion of false positive classifications, and Recall is associated
with false negative classifications.
When using the SVM algorithm one must also regard the unbalanced nature of
the classified instances the SVM uses to create the separating hyperpalne. Often, the
number of negative instances is significantly larger than the number of available
positive instances. The j parameter is a cost-factor by which training errors on positive
instances outweigh errors on negative instances, providing a way for compensating
unbalanced training data.
We used the SVMlight
implementation of the SVM algorithm for the
implementation of our method, which supports the tuning of the two parameters
described above. Our settings of SVMlight and its parameters are described in section
4.1.1.
81
מדד הדמיון מבוסס ההקשר במחקר זה מבוסס על הגישה המוצעת בעבודת המחקר של
)Gliozzo et al, 2005 .(שיטת ה, לשם כך - LSA מומשה במסגרת עבודת המחקר של התיזה לייצוג
הינה שיטה לצמצום מימדים ממרחב LSA. דמיון מבוסס הקשר בין קטגוריות ומסמכים טקסטואליים
. ספר מימדים קטן בצורה משמעותית ממספר המימדים במרחב הייצוג המקורימקורי למרחב בעל מ
- ל, במובן של דמיון בין נתונים סטטיסטים של הופעות משותפות, השיטה ממפה מונחים דומים
הינו , כשיטה לייצוג הקשר LSA -היתרון של שימוש ב. אשר מייצגים הקשר סמנטי דומה, "קונספטים"
אשר מייצג דמיון של מילים אשר נוטות להופיע באותם , טי מסדר ראשוןממדל דמיון סמנ LSA - ש
שיטות סטנדרטיות לייצוג הקשר ממדלות . של מילים שנוטות להופיע ביחד, ודמיון מסדר שני, מסמכים
מייצג גם דמיון מסדר שני באמצעות המיפוי המשותף של המילים LSA. לרוב רק דמיון מסדר ראשון
. ובכך יתרונו היחסי, "קונספטים"לאותם
המטרה . מדד הדמיון שהוגדר לעיל יושם במחקר זה עבור שתי מטרות שונות של סיווג טקסטים
הראשונה הינה ביצוע של דירוג מסמכי הטקסט עבור כל אחת מהקטגוריות על פי תוצאת הדמיון בין
תוצאות . ה אחת או יותרהמטרה השנייה הינה ביצוע של סיווג טקסט בינארי לקטגורי. המסמך לקטגוריה
וכן בהשוואה לכל מדד ) Gliozzo et al, 2005(השיטה נבדקו בהשוואה לשיטת הסיווג שהוצעה על ידי
הערכות הביצועים עבור משימת הדירוג . דמיון המרכיב את השיטה הכוללת המוצעת במחקר זה
מאחר ועבור כל , )accuracy(מאפשרות ניתוח של נכונות תוצאת הדירוג המושגת עבור כל קטגוריה
דירוג המסמכים מאפשר ניתוח מפורט של , כמו כן. קטגוריה נוצרת רשימה נפרדת של מסמכים מדורגים
, מצד שני. מאחר והמסמכים מדורגים באופן יחסי על פי תוצאת הדמיון, דיוק הדירוג עבור כל קטגוריה
סית שנקבעה עבור כל מסמך עבור משימת סיווג מסמכי הטקסט מאפשרת ניתוח של תוצאת הדמיון היח
ניתוח , בנוסף. וכן מאפשרת לבחון את היחסים בין הקטגוריות על פי תוצאות הסיווג אליהן, כל קטגוריה
ולפיכך מעלה , שאינה זו שהם משתייכים אליה, שיטה זו מעלה שגיאות סיווג של מסמכים לקטגוריה
.כיווני מחקר אפשריים נוספים
. מפיריות חיוביות עבור השיטה השלמה לסיווג טקסטים המוצעת בתזה זואנו מציגים תוצאות א
אשר תומך בהנחה שמדד הדמיון של גרירה לקסיקלית הינו , השיטה המוצעת משיגה דיוק גבוה יותר, אכן
התוצאות מלוות בניתוח מקיף של סוגי ההרחבות ושל המנגנונים . מדויק יותר לצורך סיווג טקסטים
.פור נוסף של התוצאותהדרושים לצורך שי
82
תוצאות הסיווג הסופיות מתקבלות על ידי אימון מסווג טקסט מפוקח סטנדרטי על אוסף . פי תוצאת הדמיון
.מסמכים זה
על פי נתונים , ישנם מספר חסרונות לביסוס ההרחבה האוטומטית על ידי גישה מבוססת הקשר
שמידע סטטיסטי זה אינו , ובה שבהם הינההראשונה והחש. סטטיסטים של הופעות משותפות של מילים
מבוססי מידע , םמודלים אופייניי. מייצג את הקשר הסמנטי הדרוש לקבלת החלטות סיווג טקסטים
את הנושא הספציפי אממדלים את ההקשר הרחב של הטקסט ולאו דווק, סטטיסטי של הופעות משותפות
שותפות מצביע על שייכות כללית לתחום נושאי ציון דמיון גבוה בין טקסט לנתוני הופעות מ. שנדון בו
טקסט אשר עוסק בתוכנת , לדוגמא. עיסוק בנושא הספציפי אותו מתארת הקטגוריה אולאו דווק, דומה
ולפיכך דמיון מבוסס הופעות , מחשבים כלשהי יהיה שייך להקשר הנושאי של תחום מדעי המחשב
עי המחשב באופן כללי יותר ולא לתוכנת מחשב משותפות יעריך את מידת הקרבה של הטקסט לתחום מד
.זו או אחרת
במחקר זה אנו מציעים גישה חדשה לסיווג טקסטים באמצעות מילות מפתח המבוסס על
אשר מבסס את מדד הדמיון בין טקסטים וקטגוריות על גישת הגרירה הלקסיקלית , טקסונומיה נושאית
)Lexical Entailment – LE( , גישת הגרירה . הופעות משותפות של מילים בלבדבמקום על נתוני
שבמטרתו לזהות סימוכין בטקסט מקור ספציפי לקטע , הלקסיקלית מגדירה יחס דמיון סמנטי מדויק יותר
מאחר וקשר זה דורש אזכור , קשר סמנטי זה הינו מתאים יותר עבור משימת סיווג הטקסטים. טקסט אחר
.הנידון של הנושא ולא רק דמיון כללי של ההקשר
או שמא כאזכור של , על מנת לזהות האם נושא מסוים נידון בקטע הטקסט הנבדק כנושא המרכזי
אנו מציעים שילוב של גישת הגרירה הלקסיקלית עם הגישה מבוססת ההקשר , נושא צדדי בטקסט
, המוצעת כאן, שיטת סיווג הטקסטים. כמרכיב נוסף במערכת הכוללת של סיווג הטקסטים במחקר זה
, )מילה או ביטוי המורכב ממספר מילים(מציבה כדרישת בסיס אזכור אחד לכל הפחות של מקטע טקסט
ייבדק בנוסף מדד הדמיון מבוסס ההקשר , שמקיימים את דרישה זו, עבור טקסטים. אשר גורר את הנושא
שה הממדלת שימוש במדד דמיון מאוחד זה יוצר שיטה חדי. בינם לבין הקטגוריה שנושאה אוזכר בטקסט
.הפנייה נושאית והקשר נושאי בו בזמן
מקור . לצורך חישוב מדד הדמיון מבוסס הגרירה הלקסיקלית אנו משתמשים בשני מקורות ידע
הינו האונטולוגיה הסמנטית , ששימש אותנו במחקר זה, הידע הראשון לצורך איסוף מידע גרירה לקסיקלי
WordNet אשר פותחה על ידי)Fellbaum, 1998 .( מקור זה מאפשר איסוף של קשרים סמנטיים
וקשרי גרירה נוספים הדרושים תהוא מספק קשרים סמנטיים כגון הטיות מורפולוגיו. ממקור ידע מילונאי
מקור . Wikipediaאנו משתמשים במקור גרירה לקסיקלית מבוסס , כמקור ידע משלים. למשימת הסיווג
, מוצרים מסחריים ומונחים מתחומי ידע כלליים ועדכניים, שויותידע אנציקלופדי זה מספק מונחים כגון י
שמהווה את הבסיס להרחבה באמצעות גרירה סמנטית במסגרת , הקשורים סמנטית לשם הקטגוריה
שני מקורות אלו הינם משלימים מטבעם ולפיכך הם משמשים להרחבות הנובעות מקשרים . השיטה
.קטגוריות שוניםוכן להרחבות עבור סוגי , סמנטיים שונים
83
Abstract (Hebrew)
אשר מתבסס על טקסונומיה נושאית , תזה זו מבצעת מחקר בתחום סיווג טקסטים מבוסס מילות מפתח
הגישה המחקרית הרווחת עבור סיווג טקסטים הינה גישה מפוקחת . כקלט היחיד עבור הסיווג
)supervised ( או מפוקחת חלקית)semi-supervised .(פוקחת לסיווג טקסטים דורשת הגישה המ
. עבודה ידנית מרובה לצורך תיוג קטעי טקסט הנחוצים למשימה המפוקחת כדוגמאות אימון מסווגות
פתרון , אשר סווג עבורן ידנית אוסף מסמכים נרחב, למרות שקיימות מספר מערכות עבר של סיווג טקסט
קצב הצמיחה של אוספי טקסט . כיום, קסטשנחוץ עבורן סיווג ט, ביצוע עבור רוב המערכות- זה אינו בר
טקסונומיות נושאיות חדשות וכן קצב הצמיחה של כמויות הטקסט הלא מסווג הן רק חלק , חדשים
.מהסיבות לצורך בגישות סיווג טקסט בדרגת אוטומאציה גבוהה יותר
וג סיווג טקסט מפוקח חלקית המבוסס על מילות מפתח עשה את הצעד הראשון לעבר גישת סיו
, שיטות מבוססות גישה זו זיהו את הפוטנציאל החישובי הגדול. טקסט בדרגת אוטומאציה גבוהה יותר
אשר זמינים עבור תחומי ידע , העומד מאחורי הכמויות המשמעותיות של קטעי טקסט לא מסווגים
הרעיון הבסיסי מאחורי גישה זו הינו תיאור הקטגוריות על ידי אוסף של מילות מפתח . קציות שונותואפלי
ייצוג הקטגוריה באמצעות מילות . וקביעת מדד דמיון בין ייצוג הקטגוריה למסמכי הטקסט, מייצגות
מות הרכיב המפוקח במשי. המפתח צריך לייצג את נושא הקטגוריה באופן שלם ומדויק ככל האפשר
כתחליף לרכיב המפוקח , מבוססות גישה זו הינו איסוף של מילות המפתח המייצגות עבור כל קטגוריה
משימה זו דורשת עבודה ידנית בהיקף מצומצם לעומת . שדרש סיווג ידני של כמות טקסטים גדולה
עדיין , על אף הצמצום של כמות העבודה הנדרשת. העבודה הנחוצה בגישות הסיווג המפוקחות לחלוטין
אשר דורשת מידה של מומחיות בתחום הנושאי של , מדובר בעבודה ידנית ספציפית עבור כל קטגוריה
.טקסונומיות נושאיות חדשות ידרשו ניתוח ועבודה ידנית של מומחים מהתחום הנושאי, לפיכך. הסיווג
Gliozzo(אשר הוצעה לראשונה על ידי , המחקר בתזה זו מתבסס על גישה חדשה לסיווג טקסט
et al, 2005( ,גישה מחקרית זו מתבססת על ההנחה . שאינה דורשת ניתוח ידני עבור כל קטגוריה
שטוענת כי שם הקטגוריה הינו , ששמשה גם למחקרים מפוקחים ומפוקחים חלקית, המחקרית
שמות הקטגוריה נבחרים על ידי אנשי מקצוע. אינפורמטיבי ביותר עבור מטרת משימת סיווג הטקסט
שם , מסיבה זו. כך שהם יתארו באופן המדויק והשלם ביותר את נושא הקטגוריה, מהתחום הנושאי
. ויכול להוות נקודת פתיחה לאלגוריתם האוטומטי, הקטגוריה מכיל מידע שימושי למשימת סיווג הטקסט
ל שמות גישה זו מבצעת הרחבה אוטומטית ש, לצורך יצירה אוטומטית של אוסף מילות המפחת המייצגות
מבססת את ) Gliozzo et al, 2005( -שיטת הסיווג המוצעת ב. הקטגוריה לבנייה של האוסף המייצג
שיטת ההרחבה האוטומטית על הנתונים הסטטיסטים של ההופעות המשותפות של מילות הלקסיקון על פי
Latentת על ידי שימוש בשיט. להלן גישה מבוססת הקשר, המידע הקיים באוסף האימון הלא מתויג
Semantic Analysis )LSA (מיוצר סט ראשוני של מסמכים מסווגים על, ומדד דמיון וקטורי סטנדרטי
84
עידו דגן. עבודה זו נעשתה בהדרכתו של דר
למדעי המחשב מחלקהמן ה
.אילן-של אוניברסיטת בר
85
אילן- אוניברסיטת בר
למדעי המחשב מחלקהה
חסיווג טקסטים מבוסס מילות מפת
לבי ברק
עבודה זו מוגשת כחלק מהדרישות לשם קבלת תואר מוסמך
אילן- בפקולטה למדעי המחשב של אוניברסיטת בר
ח"תשס, סיוון, 2008יוני ישראל, רמת גן