24
Hierarchical Classification with the small set of features Yongwook Yoon Apr 3, 2003 NLP Lab., POSTECH

Hierarchical Classification with the small set of features

  • Upload
    alka

  • View
    105

  • Download
    0

Embed Size (px)

DESCRIPTION

Hierarchical Classification with the small set of features. Yongwook Yoon Apr 3, 2003 NLP Lab., POSTECH. Contents. Document Classification Bayesian Classifier Feature Selection Experiment & Results Candidate Baseline System Future Work. Hierarchical Classification. - PowerPoint PPT Presentation

Citation preview

Page 1: Hierarchical Classification  with the small set of features

Hierarchical Classification

with the small set of features

Yongwook YoonApr 3, 2003

NLP Lab., POSTECH

Page 2: Hierarchical Classification  with the small set of features

2/24

Contents

Document Classification Bayesian Classifier Feature Selection Experiment & Results

Candidate Baseline System Future Work

Page 3: Hierarchical Classification  with the small set of features

3/24

Hierarchical Classification

Massive number of documents are produced daily From WWW or Intranet environment as within an Enterprise

There exist some topic hierarchies for those documents We need a classifier capable of classifying hierarchically ac

cording to some topics History of Hierarchical collections

MEDLINE (medical literature maintained by NLM) Patent collections of documents Recently, Web search sites, Yahoo, Infoseek, Google

Page 4: Hierarchical Classification  with the small set of features

4/24

Simple vs. Hierarchical classification

Root

D1 D2 D3 Dn

Business

Grain Oil

D1 D2 Di Dj Dj+1 Dn

Page 5: Hierarchical Classification  with the small set of features

5/24

Why Hierarchy? Simplistic approach breaks down

Flattened class space with only one root However, for a large corpus,

Hundreds of classes, and thousands of features The computational cost is prohibitive

The resulting classifier is very large Many thousand parameters leads to overfitting of the trainin

g data Loses the intuition that topics that are close to each other

have a lot more in common with each other In Hierarchical classification,

Feature selection can be a useful tool in dealing with this issue

Page 6: Hierarchical Classification  with the small set of features

6/24

Bayesian Classifier A Bayesian classifier is simply a Bayesian network a

pplied to a classification domain It contains a node C for the unobservable class variable And a node Xi for each of the features

Given a specific instance x = (x1, x2, … ,xn), It gives the class probability P(C=ck|X=x) for each possible cl

ass ck

Bayes Optimal classification Select the class ck for which the probability P(ck|x) is maximi

zed ck = argmax P(ck|x) = argmax P(x|ck)P(ck) ck ck

Page 7: Hierarchical Classification  with the small set of features

7/24

Bayesian Network (1/2)

Naïve Bayesian Classifier Very restrictive assumption: independency

P(X|C) = Simple and unrealistic, but widely used up to now

More complex form: more expressive Augmented some dependencies between features The computation of inducing an optimal Bayesian

classifier is NP-hard

)|( CXPi i

Page 8: Hierarchical Classification  with the small set of features

8/24

Bayesian Network (2/2)

Two main solutions to this problem Tree augmented network (TAN)

Restricts each node to have at most one additional parent an optimal classifier can be found in quadratic time in num

ber of features KDB algorithm

Each node has at most k parents Chooses as the parents of a node Xi

The k other features that Xi is most dependent on Using a metric of class conditional mutual information, I(Xi;Xj|C)

Computational complexity Network construction: quadratic in the total # of features Parameter estimation (i.e., conditional probability table constructio

n): exponential in k

Page 9: Hierarchical Classification  with the small set of features

9/24

Feature Selection (1/2)

We have a feature for every word that appears in any document in the corpus The computation would be prohibitive even if TAN or KDB

algorithm is applied So, Feature selection is imperative

But simple reduction of features without combination with the hierarchical classifier don’t get high performance Because the set of features required to classify the topics

varies widely from one node to the other Ex) (agriculture and computer) vs. (corn and wheat)

Adopt the method using Information Theoretic measures It determines a subset of the original domain features that

seem to best captures the class distribution in the data

Page 10: Hierarchical Classification  with the small set of features

10/24

Feature Selection (2/2)

Formally, the cross-entropy metric between two distribution μ and σ, is defined as

, the “distance” between μ and σ The algorithm

For each feature Xi, determines the expected cross-entropy δi = P(Xi) D(P(C|X),P(C|X-i)) where X-i is the set of all domain features except Xi

Then, eliminates the features Xi for which δi is minimized This process can be iterated to eliminate as many features as desir

ed To compute P(C|X), the algorithm uses the Naïve bayes model for s

peed and simplicity.

x

xxD

x log,

Page 11: Hierarchical Classification  with the small set of features

11/24

Experiment The source corpus

Reuter-22173 dataset Not have a pre-determined hierarchical classification scheme But each document can have multiple labels

Goal To construct two hierarchically classified document

collections Refinement to the corpus

Selects Two subsets from the collection named “Hier1” and “Hier2”

For each document, assign only one major topic and minor topic

They all would be grouped together at the top level named as “Current Business”

Next, each of these datasets was split 70%/30% into training and test sets

Page 12: Hierarchical Classification  with the small set of features

12/24

dataset

Page 13: Hierarchical Classification  with the small set of features

13/24

Experimental Methodology

Learning phase Feature selection for the first tier of the hierarchy

Using just the major topics as classes Next, build a probabilistic classifier with the reduced

feature set For each major topic, a separate round of probabilistic

feature selection is employed Finally, construct a separate classifier for each major

topic on the appropriate reduced feature set Testing phase

Test documents are classified through the first level classifier

then sending it down to the chosen second level classifier In the flat classification scheme,

Do Feature selection, but induce only one classifier

Page 14: Hierarchical Classification  with the small set of features

14/24

Results - Baseline Without employing any probabilistic feature selection

Very large number of features never helps performing better than the simple flat method Allows for the more expressive KDB algorithm to overfit the t

raining data These leads to the need for feature selection

Page 15: Hierarchical Classification  with the small set of features

15/24

Results – with feature selection (1/2)

Reduce the feature set in each node From 1258 to 20, and then to 10 Recall, however, that a potentially very different set

of 10 or 20 features is selected at each node in the hierarchy As a whole, actually examines a much larger set of

features

Page 16: Hierarchical Classification  with the small set of features

16/24

Results – with feature selection (2/2)

Overall improvement in accuracy over the baseline results And the hierarchical method over the flat method

Only one exception in the case of (Hire2, KDB-1) This classifier trained on only 24 instances: a quite

possible statistical anomaly In the Hire1 dataset, which has more instances for

induction do not encounter such problem

Page 17: Hierarchical Classification  with the small set of features

17/24

Results – analysis cont’d

In the case of (Flat, #features=50), The accuracy is very close to the “Hier” cases such as TAN

and KDB But, the complexity of algorithms for the task of learning cla

ssifier is not comparable Quadratic in the # of features (102 vs. 502)

Conclusion The feature selection at the same time applied with the hier

archical classification yields far better performance than in the simple classification scheme

Some simple Bayesian networks such as TAN or KDB can be combined well with the hierarchical classification scheme

Page 18: Hierarchical Classification  with the small set of features

18/24

Candidate System Requirements

Hierarchical classification Feedback of classification result – online adaptive

learning Support for various types of data – heterogeneous

data sources Experiment environment

Modeling after a real implemented system although running in a different hardware Share the same specification about the system

structure and functions Training and test dataset from the above real system

Page 19: Hierarchical Classification  with the small set of features

19/24

Enterprise Information Systems - KT

KMS (Knowledge Management System) 사내에서 생산되는 다양한 지식을 체계적으로 관리하고 그 활용도를

높여서 기업활동에 이바지 각자가 생산한 정보 ( 지식 ) 을 시스템에 등록하고 , 또한 업무에 필요한

정보를 검색 , 추출하여서 사용 지식후보 : 문서 , 회의자료 , 업무지식 , 제안 , 문헌 등 관리체계 : 569 개 인사직무체계로 지식 map 구성

종합문서시스템 (Groupware) 사내에서 유통되는 문서 , e-mail, message, 부서 / 개인정보 등을

체계적으로 관리 문서의 경우 기안 , 결재 , 생산 , 전달 , 보관 및 검색까지 모든 기능을

하나의 통일된 시스템으로 처리 Microsoft 사와 전략적 제휴로 공동 개발 – 웹 인터페이스 그 외 부서별 질의 / 응답게시판 , 사원간 메신저 기능

Page 23: Hierarchical Classification  with the small set of features

23/24

Future Work Target Baseline system

Basic Hierarchy Classification with some real data Research Issues (+α)

Hierarchy Utilize other implemented systems (BOW toolkit)

Online Learning Efficient and appropriate algorithm on adaptive-learning

Bayesian online perceptron, Gaussian online process Automatic expansion and extinction of the lowest level of Subtopics ov

er time dimension Pre-processing of law corpus

Integration of the heterogeneous data types Text, tables, images, e-mails, specially-formatted texts, etc.

Page 24: Hierarchical Classification  with the small set of features

The End