Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems,...

Neural Text Categorizer for Exclusive Text Categorization

Journal of Information Processing Systems, Vol.4, No.2, June 2008

Taeho Jo*

報告者 : 林昱志

Outline

Introduction

Related Work

Method

Experiment

Conclusion

Introduction

Two types of approaches to text categorization

①Rule based - Define manually in form of if-then-else

Advantage

1) High precision

Disadvantages

1) Poor recall

2) Poor flexibility

Introduction

② Machine learning - Using sample labeled documents

Advantage

1) Much High recall

Disadvantages

② Slightly lower precision than rule based

③ Poor flexibility

Introduction

Focuses on machine learning based , discarding rule based

All the raw data should be encoded into numerical vectors

Encoding documents leads to two main problems

1) Huge dimensionality

2) Sparse distribution

Introduction

Propose two way

1) String vector –

Provide more transparency in classification

2) NTC (Neural Text Categorizer) –

Classify documents with its sufficient robustness

Solves the huge dimensionality

Related Work

Machine learning algorithms applied to text categorization

1) KNN (K Nearest Neighbor)

2) NB (Naïve Bayes)

3) SVM (Support Vector Machine)

4) BP (Back Propagation)

Related Work

KNN is evaluated as a simple and competitive algorithm with

Support Vector Machine by Sebastiani in 2002

Disadvantage

1) Costs very much time for classifying objects

Related Work

Evaluated feature selection methods within the application of NB by Mladenic and Grobellink in 1999

NB for implementing a spam mail filtering system as a real system

based on text categorization by Androutsopoulos in 2000

Requires encoding documents into numerical vectors

Related Work

SVM becomes more popular than the KNN and NB machine learning algorithms

Defining a hyper-plane as a boundary of classes

Applicable to only linearly separable distribution of training

examples

Optimizes the weights of the inner products of training examples and input vector, called Lagrange multipliers

Related Work

Define two hyper-planes as a boundary of two classes with a maximal margin, figure 1.

Figure 1.

Related Work

Advantage

1) Tolerant to huge dimensionality of numerical vectors

Disadvantage

1) Applicable to only binary classification

1) Fragile in representing documents into numerical vectors

Related Work

A hierarchical combination of BPs, called HME (Hierarchical Mixture of Experts), instead of a single BP by Ruiz and Srinivasan in 2002

Observed that HME is the better combination of BPs

Disadvantage

1) Cost much time and slowly

2) Not practical

Study Aim

Two problems

1) Huge dimensionality

2) Sparse distribution

Two successful methods

1) String vectors

2) A new neural network

Method

Numerical Vectors

Figure 2.

Method

： Frequency of the word , wk

： Total number of documents in the corpus

： The number of documents including the word in the corpus

Figure 3.

Method

Encoding a document into its string vector

Figure 4.

Method

Text Categorization Systems

Proposed neural network (NTC)

Consists of the three layers

1) Input layer

2) Output layer

3) Learning layer

Method

Input Layer - Corresponds to each word in the string vector

Learning Layer - Corresponding to predefined categories

Output Layer - Generates categorical scores , and correspond to predefined categories.

Figure 5.

Method

String vector is denoted by x = [t 1,t2,...,td ] , ti , 1 ≤ i ≤ d

Predefined categories is denoted by C = [c1,c2,…..c |c|] , 1≤ j ≤ |C|

Wji denote the weight by

Figure 6.

Method

Oj ： Output node corresponding to the category , Cj

Membership of the given input vector, x in the category, Cj

Figure 7.

Method

Each string vector in the training set has its own target label, Cj

If its classified category, Ck, , is identical to target category, C

Figure 8.

Method

Inhibit weights for its misclassified category

Minimize the classification error

Figure 9.

Experiment

Evaluate the five approaches on test bed, called ‘20NewsGroups

Each category contain identical number of test documents

Test bed consists of 20 categories and 20,000 documents

Using micro-averaged and macro-averaged average methods

Experiment

Back propagation is the best approach

NB is the worst approach with the decomposition of the task

Figure 10 . Evaluate the five text classifiers in 20Newsgroup with decomposition

Experiment

Classifier answers to each test document by providing one of 20 categories

Exits two groups

1) Better group - BP and NTC

2) Worse group – NB and KNN

Figure 11. Evaluate the five text classifiers in 20Newsgroup without decomposition

Conclusion

Used a full inverted index as the basis for the operation on string vectors, instead of a restricted sized similarity matrix

Note trade-off between the two bases for the operation on string vectors

NB and BP are considered to be modified into their adaptable versions to string vetors , but may be insufficient for modifying other

Future research for modifying other machine learning algorithms

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems,...

Documents

MULTILINGUAL INCIDENT CATEGORIZER (MIC) · analytics soluton which uses machine learning algorithms and data mining to provide predicti ons and recommendations for engineering ac

1996 Program Stripped - Academy of International Business · PDF fileMark C. Casson University of ... Warren J. Keegan Pace University Taeho Kim ... 09:30 - 11:00 Competitive Session

Content Categorizer Administration Guide - Oracle · 2007. 3. 26. · To have Content Categorizer ignore the defa ult value and apply the search rules to the Type field, you can edit

1 st OlymFair Workshop Hacking technique Taeho Oh ohhara@4dl.com ohhara@postech.edu ohhara

Boram Lee, Hyeon Kyeong Ji, Taeho Lee and Kwang-Hyeon Liu...2015/04/22 · Boram Lee, Hyeon Kyeong Ji, Taeho Lee and Kwang-Hyeon Liu College of Pharmacy and Research Institute of

Export Quota Allocations, Export Earnings, and Market ...€¦ · Export Quota Allocations, Export Earnings, and Market Diversifications Taeho Bark and Jaime de Melo Countries adopting

瑞昱半導體 - 國立臺灣大學temiac.ee.ntu.edu.tw/files/archive/1505_ae464a8c.pdf · 在參觀 wireless 實驗室、 wireless/ rf 實驗室、無線網路驗證中心、 wi-fi

昱捷股份有限公司 · PPS Data 溫度衝擊試驗 5分 3,000cycle 5分 150 1時間保持-55 1時間保持 A1100-H14 A5052-H34 破斷強度（MPa）破斷強度（MPa）温度衝撃試験前

Etuma Contact Center Categorizer

Factors Influencing Pelvic and Trunk Motions During One ... · Factors Influencing Pelvic and Trunk Motions During One-Leg Standing Taeho Kim Dept. of Rehabilitation Therapy (Physical

The Gene Ontology Categorizer C.A. Joslyn 1, S.M. Mniszewski 1, A. Fulmer 2 and G. Heaton 3 1 Computer and Computational Sciences, Los Alamos National

Company Profile - 台灣黃頁詢價平台s.web66.com.tw/_file/C11/113162/piclist/pic3.pdf · 2017. 9. 16. · STH Technology (ShenZhen) Co., Ltd. 昱翰科技(深圳) ... Bipolar

昱展科技 - yu-zhan.com Couplings/YZ-WTM-C.pdf · 昱展科技 Yu-Zhan Technology Co.,Ltd 系列 M-C 一體化全鋼簧片式聯軸器 - 高剛性、高扭矩。 - 低價性、無背隙。

OSP STAGE 2040 - Konkukdslab.konkuk.ac.kr/Class/2014/14SMA/Team_project/4/3 2040Presen… · OSP STAGE 2040 for Project FluxVator T3 Qui Shibo 201013760 Taeho Kang 201013275 Ingoo

本报记者卢昱本报称王大臣高陈派来行刺皇举朝上下议连paper.dz · 法、步法予以解析，张克明复到刘家庙与僧较，一交手即将僧击倒。张克明经20年探求，

板橋扶輪社 - pcrc.org.t · 盧昱宏、張浴澤、簡慶林、黃瑞桐、陳裕仁、陳文斌、許烈嘉板橋中區社：劉淑卿、林美枝、蕭明珠、林秀鳳、王玉芬、王玉玲

ꓺ곬륰깧ꪺ셻삳ꗎ뭐ꙷꗾꢾ앀 · 2013. 12. 2. · Operculectomy. Exposure. Crown Lengthening. Expose Implant. Bleaching. ... 牙科雷射的臨床應用與安全防護-林昱璿.ppt

The Gene Ontology Categorizer

昱盛鋼鐵有限公司 · 2020. 5. 6. · Standard JIS Specification SGCC SGC 340 SCC 400 SCC 440 SCC 490 SCC 570 SGCH SGCDI Order Width (mm) 940 940 1040 1040 1100 1100 1220 1220

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies