Hierarchical Classification with the small set of features

Hierarchical Classification

with the small set of features

Yongwook YoonApr 3, 2003

NLP Lab., POSTECH

2/24

Contents

Document Classification Bayesian Classifier Feature Selection Experiment & Results

Candidate Baseline System Future Work

telnet://pascal.postech.ac.kr/

3/24

Hierarchical Classification

Massive number of documents are produced daily From WWW or Intranet environment as within an Enterprise

There exist some topic hierarchies for those documents We need a classifier capable of classifying hierarchically ac

cording to some topics History of Hierarchical collections

MEDLINE (medical literature maintained by NLM) Patent collections of documents Recently, Web search sites, Yahoo, Infoseek, Google


4/24

Simple vs. Hierarchical classification

Root

D1 D2 D3 Dn

Business

Grain Oil

D1 D2 Di Dj Dj+1 Dn


5/24

Why Hierarchy? Simplistic approach breaks down

Flattened class space with only one root However, for a large corpus,

Hundreds of classes, and thousands of features The computational cost is prohibitive

The resulting classifier is very large Many thousand parameters leads to overfitting of the trainin

g data Loses the intuition that topics that are close to each other

have a lot more in common with each other In Hierarchical classification,

Feature selection can be a useful tool in dealing with this issue


6/24

Bayesian Classifier A Bayesian classifier is simply a Bayesian network a

pplied to a classification domain It contains a node C for the unobservable class variable And a node Xi for each of the features

Given a specific instance x = (x1, x2, … ,xn), It gives the class probability P(C=ck|X=x) for each possible cl

ass ck

Bayes Optimal classification Select the class ck for which the probability P(ck|x) is maximi

zed ck = argmax P(ck|x) = argmax P(x|ck)P(ck) ck ck


7/24

Bayesian Network (1/2)

Naïve Bayesian Classifier Very restrictive assumption: independency

P(X|C) = Simple and unrealistic, but widely used up to now

More complex form: more expressive Augmented some dependencies between features The computation of inducing an optimal Bayesian

classifier is NP-hard

)|( CXPi i


8/24

Bayesian Network (2/2)

Two main solutions to this problem Tree augmented network (TAN)

Restricts each node to have at most one additional parent an optimal classifier can be found in quadratic time in num

ber of features KDB algorithm

Each node has at most k parents Chooses as the parents of a node Xi

The k other features that Xi is most dependent on Using a metric of class conditional mutual information, I(Xi;Xj|C)

Computational complexity Network construction: quadratic in the total # of features Parameter estimation (i.e., conditional probability table constructio

n): exponential in k


9/24

Feature Selection (1/2)

We have a feature for every word that appears in any document in the corpus The computation would be prohibitive even if TAN or KDB

algorithm is applied So, Feature selection is imperative

But simple reduction of features without combination with the hierarchical classifier don’t get high performance Because the set of features required to classify the topics

varies widely from one node to the other Ex) (agriculture and computer) vs. (corn and wheat)

Adopt the method using Information Theoretic measures It determines a subset of the original domain features that

seem to best captures the class distribution in the data


10/24

Feature Selection (2/2)

Formally, the cross-entropy metric between two distribution μ and σ, is defined as

, the “distance” between μ and σ The algorithm

For each feature Xi, determines the expected cross-entropy δi = P(Xi) D(P(C|X),P(C|X-i)) where X-i is the set of all domain features except Xi

Then, eliminates the features Xi for which δi is minimized This process can be iterated to eliminate as many features as desir

ed To compute P(C|X), the algorithm uses the Naïve bayes model for s

peed and simplicity.

x

xxD

x log,


11/24

Experiment The source corpus

Reuter-22173 dataset Not have a pre-determined hierarchical classification scheme But each document can have multiple labels

Goal To construct two hierarchically classified document

collections Refinement to the corpus

Selects Two subsets from the collection named “Hier1” and “Hier2”

For each document, assign only one major topic and minor topic

They all would be grouped together at the top level named as “Current Business”

Next, each of these datasets was split 70%/30% into training and test sets


12/24

dataset


13/24

Experimental Methodology

Learning phase Feature selection for the first tier of the hierarchy

Using just the major topics as classes Next, build a probabilistic classifier with the reduced

feature set For each major topic, a separate round of probabilistic

feature selection is employed Finally, construct a separate classifier for each major

topic on the appropriate reduced feature set Testing phase

Test documents are classified through the first level classifier

then sending it down to the chosen second level classifier In the flat classification scheme,

Do Feature selection, but induce only one classifier


14/24

Results - Baseline Without employing any probabilistic feature selection

Very large number of features never helps performing better than the simple flat method Allows for the more expressive KDB algorithm to overfit the t

raining data These leads to the need for feature selection


15/24

Results – with feature selection (1/2)

Reduce the feature set in each node From 1258 to 20, and then to 10 Recall, however, that a potentially very different set

of 10 or 20 features is selected at each node in the hierarchy As a whole, actually examines a much larger set of

features


16/24

Results – with feature selection (2/2)

Overall improvement in accuracy over the baseline results And the hierarchical method over the flat method

Only one exception in the case of (Hire2, KDB-1) This classifier trained on only 24 instances: a quite

possible statistical anomaly In the Hire1 dataset, which has more instances for

induction do not encounter such problem


17/24

Results – analysis cont’d

In the case of (Flat, #features=50), The accuracy is very close to the “Hier” cases such as TAN

and KDB But, the complexity of algorithms for the task of learning cla

ssifier is not comparable Quadratic in the # of features (102 vs. 502)

Conclusion The feature selection at the same time applied with the hier

archical classification yields far better performance than in the simple classification scheme

Some simple Bayesian networks such as TAN or KDB can be combined well with the hierarchical classification scheme


18/24

Candidate System Requirements

Hierarchical classification Feedback of classification result – online adaptive

learning Support for various types of data – heterogeneous

data sources Experiment environment

Modeling after a real implemented system although running in a different hardware Share the same specification about the system

structure and functions Training and test dataset from the above real system


19/24

Enterprise Information Systems - KT

KMS (Knowledge Management System) 사내에서 생산되는 다양한 지식을 체계적으로 관리하고 그 활용도를

높여서 기업활동에 이바지 각자가 생산한 정보 ( 지식 ) 을 시스템에 등록하고 , 또한 업무에 필요한

정보를 검색 , 추출하여서 사용 지식후보 : 문서 , 회의자료 , 업무지식 , 제안 , 문헌 등 관리체계 : 569 개 인사직무체계로 지식 map 구성

종합문서시스템 (Groupware) 사내에서 유통되는 문서 , e-mail, message, 부서 / 개인정보 등을

체계적으로 관리 문서의 경우 기안 , 결재 , 생산 , 전달 , 보관 및 검색까지 모든 기능을

하나의 통일된 시스템으로 처리 Microsoft 사와 전략적 제휴로 공동 개발 – 웹 인터페이스 그 외 부서별 질의 / 응답게시판 , 사원간 메신저 기능


20/24


21/24


22/24


23/24

Future Work Target Baseline system

Basic Hierarchy Classification with some real data Research Issues (+α)

Hierarchy Utilize other implemented systems (BOW toolkit)

Online Learning Efficient and appropriate algorithm on adaptive-learning

Bayesian online perceptron, Gaussian online process Automatic expansion and extinction of the lowest level of Subtopics ov

er time dimension Pre-processing of law corpus

Integration of the heterogeneous data types Text, tables, images, e-mails, specially-formatted texts, etc.


The End

Documents

Hierarchical Classification with the small set of features