27
Neural Text Categorizer for Exclusive Text Cate Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報報報 : 報報報

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Embed Size (px)

Citation preview

Page 1: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Neural Text Categorizer for Exclusive Text Categorization

Journal of Information Processing Systems, Vol.4, No.2, June 2008

Taeho Jo*

報告者 : 林昱志

Page 2: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Outline

Introduction

Related Work

Method

Experiment

Conclusion

Page 3: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Introduction

Two types of approaches to text categorization

①Rule based - Define manually in form of if-then-else

Advantage

1) High precision

Disadvantages

1) Poor recall

2) Poor flexibility

Page 4: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Introduction

② Machine learning - Using sample labeled documents

Advantage

1) Much High recall

Disadvantages

② Slightly lower precision than rule based

③ Poor flexibility

Page 5: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Introduction

Focuses on machine learning based , discarding rule based

All the raw data should be encoded into numerical vectors

Encoding documents leads to two main problems

1) Huge dimensionality

2) Sparse distribution

Page 6: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Introduction

Propose two way

1) String vector –

Provide more transparency in classification

2) NTC (Neural Text Categorizer) –

Classify documents with its sufficient robustness

Solves the huge dimensionality

Page 7: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Related Work

Machine learning algorithms applied to text categorization

1) KNN (K Nearest Neighbor)

2) NB (Naïve Bayes)

3) SVM (Support Vector Machine)

4) BP (Back Propagation)

Page 8: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Related Work

KNN is evaluated as a simple and competitive algorithm with

Support Vector Machine by Sebastiani in 2002

Disadvantage

1) Costs very much time for classifying objects

Page 9: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Related Work

Evaluated feature selection methods within the application of NB by Mladenic and Grobellink in 1999

NB for implementing a spam mail filtering system as a real system

based on text categorization by Androutsopoulos in 2000

Requires encoding documents into numerical vectors

Page 10: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Related Work

SVM becomes more popular than the KNN and NB machine learning algorithms

Defining a hyper-plane as a boundary of classes

Applicable to only linearly separable distribution of training

examples

Optimizes the weights of the inner products of training examples and input vector, called Lagrange multipliers

Page 11: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Related Work

Define two hyper-planes as a boundary of two classes with a maximal margin, figure 1.

Figure 1.

Page 12: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Related Work

Advantage

1) Tolerant to huge dimensionality of numerical vectors

Disadvantage

1) Applicable to only binary classification

1) Fragile in representing documents into numerical vectors

Page 13: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Related Work

A hierarchical combination of BPs, called HME (Hierarchical Mixture of Experts), instead of a single BP by Ruiz and Srinivasan in 2002

Observed that HME is the better combination of BPs

Disadvantage

1) Cost much time and slowly

2) Not practical

Page 14: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Study Aim

Two problems

1) Huge dimensionality

2) Sparse distribution

Two successful methods

1) String vectors

2) A new neural network

Page 15: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

Numerical Vectors

Figure 2.

Page 16: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

: Frequency of the word , wk

: Total number of documents in the corpus

: The number of documents including the word in the corpus

Figure 3.

Page 17: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

Encoding a document into its string vector

Figure 4.

Page 18: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

Text Categorization Systems

Proposed neural network (NTC)

Consists of the three layers

1) Input layer

2) Output layer

3) Learning layer

Page 19: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

Input Layer - Corresponds to each word in the string vector

Learning Layer - Corresponding to predefined categories

Output Layer - Generates categorical scores , and correspond to predefined categories.

Figure 5.

Page 20: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

String vector is denoted by x = [t 1,t2,...,td ] , ti , 1 ≤ i ≤ d

Predefined categories is denoted by C = [c1,c2,…..c |c|] , 1≤ j ≤ |C|

Wji denote the weight by

Figure 6.

Page 21: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

Oj : Output node corresponding to the category , Cj

Membership of the given input vector, x in the category, Cj

Figure 7.

Page 22: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

Each string vector in the training set has its own target label, Cj

If its classified category, Ck, , is identical to target category, C

Figure 8.

Page 23: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

Inhibit weights for its misclassified category

Minimize the classification error

Figure 9.

Page 24: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Experiment

Evaluate the five approaches on test bed, called ‘20NewsGroups

Each category contain identical number of test documents

Test bed consists of 20 categories and 20,000 documents

Using micro-averaged and macro-averaged average methods

Page 25: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Experiment

Back propagation is the best approach

NB is the worst approach with the decomposition of the task

Figure 10 . Evaluate the five text classifiers in 20Newsgroup with decomposition

Page 26: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Experiment

Classifier answers to each test document by providing one of 20 categories

Exits two groups

1) Better group - BP and NTC

2) Worse group – NB and KNN

Figure 11. Evaluate the five text classifiers in 20Newsgroup without decomposition

Page 27: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Conclusion

Used a full inverted index as the basis for the operation on string vectors, instead of a restricted sized similarity matrix

Note trade-off between the two bases for the operation on string vectors

NB and BP are considered to be modified into their adaptable versions to string vetors , but may be insufficient for modifying other

Future research for modifying other machine learning algorithms