26
Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on Data Mining (IC DM’02) Pre sentation by Yu-Kai Lin

Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Embed Size (px)

Citation preview

Page 1: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Text Document Categorization by Term Association

Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada

2002 IEEE International Conference on Data Mining (ICDM’02)

Presentation by Yu-Kai Lin

Page 2: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Outline Introduction Related work Building an Associative Text Classifier Experimental Results Conclusion

Page 3: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Introduction Text categorization is a necessity due

to the very large amount of text documents that we have to deal with daily.

A text categorization system can be used in indexing documents to assist information retrieval tasks as well as in classifying e-mails, memos or web pages in a yahoo-like manner.

Page 4: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Introduction (cont.) The data classification process :

(a) Learning : Training data are analyzed by a classification algorithm. (Figure 1)

(b) classification : Test data are used to estimated in the form of classification rules. (Figure 2)

Page 5: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Figure 1

name age income

Credit_rating

JonesBill LeeFoxLake…

<= 30<= 3031..40> 40 …

LowLowHighMed…

FairExcellentExcellentFair…

Training data

Classification algorithm

Classificationrules

If age = “31…40”And income = high

ThenCredit_rating = excellent

Page 6: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Figure 2

name age income

Credit_rating

FrankSylviaAnne…

> 30<= 3031..40 …

highlowhigh…

fairfairexcellent…

Training data

Classificationrules

New data

( John ,31…40,high)Credit rating ?

excellent

Page 7: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Related Work Text classifier Association Rule Mining

Page 8: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Related Work (cont.) Text classifier

Naïve Bayesian classifier (chapter 7.4) ID3 (Decision tree chapter 7.3) C4.5 ( chapter 7.6) K-NN (chapter 7.7.1) Neural Networks Support Vector Machines (SVM)

Page 9: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Related Work (cont.) Association Rule Mining

Association Rules Generation Associative classifiers

Page 10: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Related Work (cont.) Association Rules Generation

“X=>Y” support s confidence c strong rules:

rules that have a support and confidence greater than given thresholds

Page 11: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Related Work (cont.) Associative classifiers

Learning method is represented by the association rule mining Discover strong patterns that are

associated with the class labels New object are categorized by these

patterns (classifier)

Page 12: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Building an Association Text Classifier

TrainingSet

PreprocessingPhase

AssociationRule Mining

AssociativeClassifier

ModelValidation

TestingSet

Page 13: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Building an Association Text Classifier (cont.)

Data collection Preprocessing Association Rules Generation Pruning the Set of Association Rules Prediction of Classes Associated with

New Documents

Page 14: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Building an Association Text Classifier (cont.)

Data collection Preprocessing Weed out not interesting words

stopwording stemming

Transform documents into transactions categories set C = {c1, c2, … , cm} term set T = {t1, t2, … , tn} document Di = {cc1, cc2, … , ccm, tt1, tt2, … , ttn}

Page 15: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Building an Association Text Classifier (cont.)

Association Rules Generation Apriori

Advantage The performance studies show its efficiency and sc

alability Drawback of using on our transactions

Generate a large number of associations rules Most of them are irrelevant for classification

Page 16: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

ARC-BC Association Rule-based Categorizer By Category

algorithm Apriori-based Interested in rules that indicate a category label (T => c

i ): Strong rules Prune the rules that no use for categorization

Page 17: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

ARC-BC Algorithm

Page 18: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

ARC-BC Algorithm

Page 19: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

ARC-BC

category 1

category i

category n

association rules for category 1

association rules for category i

association rules for category n

classifier

put the new documents in the correct class

Page 20: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Examples of association rules composing the classifier

Page 21: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Building an Association Text Classifier (cont.) Pruning the Set of Association Rules

The number of rules that can be generated in the association rule mining phase could be very large Noisy information mislead the classification

process Make classification time longer

Pruning method Eliminate the specific rules and keep only those

that are more general and with high confidence Prune unnecessary rules by database coverage

Page 22: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Building an Association Text Classifier (cont.)

Pruning the Set of Association Rules definition

Page 23: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Pruning the Set of Association Rules Algorithm

Page 24: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Building an Association Text Classifier (cont.)

Prediction of Classes Associated with New Documents Algorithm

Page 25: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Experimental results 9,603 training

documents and 3,299 testing documents

Page 26: Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on

Conclusion

Its effectiveness is comparable to most well-known text classifiers

Relatively fast training time Rules generated are understandable

and can be easily manually updated When retraining a new document, only

the concerned categories are adjusted and the rules could be incrementally updated