Hierarchical Classification
with the small set of features
Yongwook YoonApr 3, 2003
NLP Lab., POSTECH
2/24
Contents
Document Classification Bayesian Classifier Feature Selection Experiment & Results
Candidate Baseline System Future Work
3/24
Hierarchical Classification
Massive number of documents are produced daily From WWW or Intranet environment as within an Enterprise
There exist some topic hierarchies for those documents We need a classifier capable of classifying hierarchically ac
cording to some topics History of Hierarchical collections
MEDLINE (medical literature maintained by NLM) Patent collections of documents Recently, Web search sites, Yahoo, Infoseek, Google
4/24
Simple vs. Hierarchical classification
Root
D1 D2 D3 Dn
Business
Grain Oil
D1 D2 Di Dj Dj+1 Dn
5/24
Why Hierarchy? Simplistic approach breaks down
Flattened class space with only one root However, for a large corpus,
Hundreds of classes, and thousands of features The computational cost is prohibitive
The resulting classifier is very large Many thousand parameters leads to overfitting of the trainin
g data Loses the intuition that topics that are close to each other
have a lot more in common with each other In Hierarchical classification,
Feature selection can be a useful tool in dealing with this issue
6/24
Bayesian Classifier A Bayesian classifier is simply a Bayesian network a
pplied to a classification domain It contains a node C for the unobservable class variable And a node Xi for each of the features
Given a specific instance x = (x1, x2, … ,xn), It gives the class probability P(C=ck|X=x) for each possible cl
ass ck
Bayes Optimal classification Select the class ck for which the probability P(ck|x) is maximi
zed ck = argmax P(ck|x) = argmax P(x|ck)P(ck) ck ck
7/24
Bayesian Network (1/2)
Naïve Bayesian Classifier Very restrictive assumption: independency
P(X|C) = Simple and unrealistic, but widely used up to now
More complex form: more expressive Augmented some dependencies between features The computation of inducing an optimal Bayesian
classifier is NP-hard
)|( CXPi i
8/24
Bayesian Network (2/2)
Two main solutions to this problem Tree augmented network (TAN)
Restricts each node to have at most one additional parent an optimal classifier can be found in quadratic time in num
ber of features KDB algorithm
Each node has at most k parents Chooses as the parents of a node Xi
The k other features that Xi is most dependent on Using a metric of class conditional mutual information, I(Xi;Xj|C)
Computational complexity Network construction: quadratic in the total # of features Parameter estimation (i.e., conditional probability table constructio
n): exponential in k
9/24
Feature Selection (1/2)
We have a feature for every word that appears in any document in the corpus The computation would be prohibitive even if TAN or KDB
algorithm is applied So, Feature selection is imperative
But simple reduction of features without combination with the hierarchical classifier don’t get high performance Because the set of features required to classify the topics
varies widely from one node to the other Ex) (agriculture and computer) vs. (corn and wheat)
Adopt the method using Information Theoretic measures It determines a subset of the original domain features that
seem to best captures the class distribution in the data
10/24
Feature Selection (2/2)
Formally, the cross-entropy metric between two distribution μ and σ, is defined as
, the “distance” between μ and σ The algorithm
For each feature Xi, determines the expected cross-entropy δi = P(Xi) D(P(C|X),P(C|X-i)) where X-i is the set of all domain features except Xi
Then, eliminates the features Xi for which δi is minimized This process can be iterated to eliminate as many features as desir
ed To compute P(C|X), the algorithm uses the Naïve bayes model for s
peed and simplicity.
x
xxD
x log,
11/24
Experiment The source corpus
Reuter-22173 dataset Not have a pre-determined hierarchical classification scheme But each document can have multiple labels
Goal To construct two hierarchically classified document
collections Refinement to the corpus
Selects Two subsets from the collection named “Hier1” and “Hier2”
For each document, assign only one major topic and minor topic
They all would be grouped together at the top level named as “Current Business”
Next, each of these datasets was split 70%/30% into training and test sets
13/24
Experimental Methodology
Learning phase Feature selection for the first tier of the hierarchy
Using just the major topics as classes Next, build a probabilistic classifier with the reduced
feature set For each major topic, a separate round of probabilistic
feature selection is employed Finally, construct a separate classifier for each major
topic on the appropriate reduced feature set Testing phase
Test documents are classified through the first level classifier
then sending it down to the chosen second level classifier In the flat classification scheme,
Do Feature selection, but induce only one classifier
14/24
Results - Baseline Without employing any probabilistic feature selection
Very large number of features never helps performing better than the simple flat method Allows for the more expressive KDB algorithm to overfit the t
raining data These leads to the need for feature selection
15/24
Results – with feature selection (1/2)
Reduce the feature set in each node From 1258 to 20, and then to 10 Recall, however, that a potentially very different set
of 10 or 20 features is selected at each node in the hierarchy As a whole, actually examines a much larger set of
features
16/24
Results – with feature selection (2/2)
Overall improvement in accuracy over the baseline results And the hierarchical method over the flat method
Only one exception in the case of (Hire2, KDB-1) This classifier trained on only 24 instances: a quite
possible statistical anomaly In the Hire1 dataset, which has more instances for
induction do not encounter such problem
17/24
Results – analysis cont’d
In the case of (Flat, #features=50), The accuracy is very close to the “Hier” cases such as TAN
and KDB But, the complexity of algorithms for the task of learning cla
ssifier is not comparable Quadratic in the # of features (102 vs. 502)
Conclusion The feature selection at the same time applied with the hier
archical classification yields far better performance than in the simple classification scheme
Some simple Bayesian networks such as TAN or KDB can be combined well with the hierarchical classification scheme
18/24
Candidate System Requirements
Hierarchical classification Feedback of classification result – online adaptive
learning Support for various types of data – heterogeneous
data sources Experiment environment
Modeling after a real implemented system although running in a different hardware Share the same specification about the system
structure and functions Training and test dataset from the above real system
19/24
Enterprise Information Systems - KT
KMS (Knowledge Management System) 사내에서 생산되는 다양한 지식을 체계적으로 관리하고 그 활용도를
높여서 기업활동에 이바지 각자가 생산한 정보 ( 지식 ) 을 시스템에 등록하고 , 또한 업무에 필요한
정보를 검색 , 추출하여서 사용 지식후보 : 문서 , 회의자료 , 업무지식 , 제안 , 문헌 등 관리체계 : 569 개 인사직무체계로 지식 map 구성
종합문서시스템 (Groupware) 사내에서 유통되는 문서 , e-mail, message, 부서 / 개인정보 등을
체계적으로 관리 문서의 경우 기안 , 결재 , 생산 , 전달 , 보관 및 검색까지 모든 기능을
하나의 통일된 시스템으로 처리 Microsoft 사와 전략적 제휴로 공동 개발 – 웹 인터페이스 그 외 부서별 질의 / 응답게시판 , 사원간 메신저 기능
23/24
Future Work Target Baseline system
Basic Hierarchy Classification with some real data Research Issues (+α)
Hierarchy Utilize other implemented systems (BOW toolkit)
Online Learning Efficient and appropriate algorithm on adaptive-learning
Bayesian online perceptron, Gaussian online process Automatic expansion and extinction of the lowest level of Subtopics ov
er time dimension Pre-processing of law corpus
Integration of the heterogeneous data types Text, tables, images, e-mails, specially-formatted texts, etc.
The End