Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

B S HarishDepartment of Information Science & Engineering

SJCE, MysoreINDIA

Mail: [email protected]://sites.google.com/site/bsharishsjce/

Data explosion problem

Automated data collection tools, widely

used database systems, computerized society,

and the Internet lead to tremendous

amounts of data accumulated and/or

to be analyzed in databases, data warehouses,

WWW, and other information repositories

We are drowning in data, but starving for knowledge!

Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing

Mining interesting knowledge (rules, regularities, patterns,

constraints) from data in large databases

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) patterns or knowledge from

huge amount of data

Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, information

harvesting, business intelligence, etc.

What Is Data Mining?

Selection: Obtain data from various sources

Preprocessing: Cleanse data

Transformation: Convert to common format. Transformto new format

Data Mining: Obtain desired results

Interpretation/Evaluation: Present results to user inmeaningful manner

Different views lead to different classifications

Kinds of data to be mined

Kinds of knowledge to be discovered

Kinds of techniques utilized

Kinds of applications adapted

Applications of Data Mining

A case study on Text Mining

Introduction

Text mining is an emerging technology for handling the increasing text data

Text classification/clustering is one of the fundamental functions in textmining

Text classification/clustering is to divide a collection of text documents into

different category groups so that documents in the same category group

describe the same topic, such as classic music or Chinese

Unlike structured data, text data faces a number of new

challenges.

Volume

Dimensionality

Sparsity

Complex semantics

Security applications

Search Engines

Online Media applications

Marketing applications

Sentiment analysis

Academic applications

Text Mining Tasks

Text classification

Text clustering

Concept/entity extraction

Sentiment analysis

Document summarization

Entity relation modeling

Text Classification

Text Classification Applications

E-mail spam filtering

Categorize newspaper articles and newswires into topics

Organize Web pages into hierarchical categories

Sort journals and abstracts by subject categories (e.g.,MEDLINE, etc.)

Assigning international clinical codes to patient clinicalrecords

Poll of

Docu

mentsPre-Processing

Stemming/

Stopword Elimination/

Linguistic Preprocessing

Representation

BOW/VSM/

Probabilistic/

Ontology/

Term Sequence/

N-Grams/

Phrase Based/

Symbolic/

UNL/

Classification

(Supervised)

Clustering

(Unsupervised)

Evaluating

Results

Block Diagram of Text Classification Model

Representation Methods

VSM/BOW Representation Binary Representation Ontology Representation N-Grams Multi-Word terms Latent Semantic Indexing Locality Preserving Indexing Universal Networking Language

Classifiers

Naïve Bayes Classifier K-NN Classifier Centroid Classifier Decision Trees Rocchio Classifier Neural Networks SVM

Symbolic Representation (Our Method); ACM Compute , 2010

Ontology Based

VSMPruning Stemming

Stopword

Removal

Collection

ReaderDetagger Tokenizer

Input

Documents

TF IDF Weighting Scheme

Ontology

Representation

Preprocessing

Text Data and Representation Models

Different representation formats offer different capabilities todescribe context, structure, semantics, presentations of text content

Text Conversion

Representation Model

In information retrieval and text mining, text data ofdifferent formats is represented in a commonrepresentation model.

e.g., Vector Space Model

Text data is converted to the model representation

Vector Space Model (VSM)

The most popular representation model used in information

retrieval and text mining.

In VSM, a text document is represented as a vector of terms

<t1, t2, …, ti, …, tn>.

Each term ti represents a word or a phrase.

The set of all n unique terms in a set of text documents

forms the vocabulary for the set of documents.

A set of documents are represented as a set of vectors, that

can be written as a matrix.

Graphical Representation

Example:D1 = 2T1 + 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

T3

T1

T2

D1 = 2T1+ 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

7

32

5

• Is D1 or D2 more similar to Q?

• How to measure the degree of

similarity? Distance? Angle?

Projection?

Document Collection

A collection of n documents can be represented in the vector

space model by a term-document matrix.

An entry in the matrix corresponds to the “weight” of a term

in the document; zero means the term has no significance in

the document or it simply doesn’t exist in the document.

T1 T2 …. Tt

D1 w11 w21 … wt1

D2 w12 w22 … wt2

: : : :

: : : :

Dn w1n w2n … wtn

Term Weights: Term Frequency

More frequent terms in a document are more important,

i.e. more indicative of the topic.

fij= frequency of term i in document j

May want to normalize term frequency (tf) across the

entire corpus:

tfij = fij / max{fij}

Term Weights: Inverse Document Frequency

Terms that appear in many different documents are less indicative ofoverall topic.

df i = document frequency of term i

= number of documents containing term i

idfi = inverse document frequency of term i,

= log2 (N/ df i)

(N: total number of documents)

An indication of a term’s discrimination power.

Log used to dampen the effect relative to tf.

Advantages and Disadvantages of VSM

Advantages

Simple

Easy to calculate similarity between two documents

Data mining algorithms can be applied directly to text

data

Disadvantages

Terms are assumed independent (which is not true in

the real text document)

Lack of semantics

High dimensionality and sparsity

To classify documents into 4 classes:

economics, sports, science, life.

There are two approaches:

Rule-based approach

To write a set of rules that classify documents

Machine learning-based approach

Using a set of sample documents that are classifiedinto the classes (training data), automatically createclassifiers based on the training data

Rule-based classification

Pros:

Very accurate when rules are written by experts

Classification criteria can be easily controlled when the number of rules are small.

Cons:

Sometimes, rules conflicts each other

maintenance of rules becomes more difficult as the number of rules increases

The rules have to be reconstructed when a target domain changes

Low coverage because of a wide variety of expressions

Machine Learning-based approach

Pros:

Domain independent

High predictive performance

Cons:

Not accountable for classification results

Training data required

1.Prepare a set of training data

• Attach topic information to the documents in a target domain.

2.Create a classifier (model)

• Apply a Machine Learning tool to the data

Support Vector Machine (SVM), Maximum Entropy Models (MEM)

3.Classify new documents by the classifier

sports

science

lifeclassifier

sports

science

life

classifier

…

life

sports

Training data

f1

f2

f3

f4

game

play

ball

danceClassifier c(·|·)

c(sports|x)

document d

c(science|x)

c(economics|x)

c(y|x)

x=(f1, f2, f3, f4)

features

feature extraction

feature vector

(input vector) c(life|x)

Select the best

classification result

Case Study

Doc 1

This is a document which

contains information on

text clustering

Doc 2

Text clustering is the

information which is

presented in this

document

Doc 3

Cricket is an interesting

game which is played by

eleven players

Doc 4

The maximum numbers of

players in the game cricket

are eleven

Query

Eleven players will be playing cricket game

Elimination of Stop Words

Doc 1

Document contains

information text

clustering

Doc 2

Text clustering information

presented document

Doc 3

Cricket interesting game

played eleven players

Doc 4

maximum numbers

players game cricket

eleven

Query

Eleven players playing cricket game

BOW for

Document 1

Document

contains

information

text

Clustering

Text

Text

Document

information

BOW for

Document 2

Text

clustering

information

presented

document

Text

Clustering

document

BOW for

Document 3

Cricket

interesting

game

played

eleven

players

Cricket

Game

cricket

BOW for

Document 4

maximum

numbers

players

game

cricket

eleven

Eleven

Cricket

numbers

*Assumption: Add more term to documents to get more term frequency

BOW for the Corpus

T1 : Document

T2 : contains

T3 : information

T4 : text

T5 : clustering

T6 : presented

T7 : Cricket

T8 : interesting

T9 : game

T10 : played

T11 : eleven

T12 : players

T13 : maximum

T14 : numbers

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14

D1 2 1 2 3 1 0 0 0 0 0 0 0 0 0

D2 2 0 1 2 2 1 0 0 0 0 0 0 0 0

D3 0 0 0 0 0 0 3 1 2 1 1 1 0 0

D4 0 0 0 0 0 0 2 0 1 0 1 1 1 2

Document Vs Term Matrix for Training

Documents

BOW for Document Query

Eleven

players

playing

cricket

Game

players

players

cricket

cricket

Query

Eleven players will be playing cricket game

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14

Query

Docum

ent

0 0 0 0 0 0 3 0 1 0 0 3 0 0

Document Vs Term Matrix for Query Document

Matching

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14

D1 2 1 2 3 1 0 0 0 0 0 0 0 0 0

D2 2 0 1 2 2 1 0 0 0 0 0 0 0 0

D3 0 0 0 0 0 0 3 1 2 1 1 1 0 0

D4 0 0 0 0 0 0 2 0 1 0 1 1 1 2

Query

Docum

ent

0 0 0 0 0 0 3 0 1 0 0 3 0 0

CosSim(D1 , Q) = 0 / 196 = 0

CosSim(D3 , Q) = 14 / 196 = 0.071 CosSim(D4 , Q) = 10 / 196 = 0.051

CosSim(D2 , Q) = 0 / 196 = 0

Ranking of Documents

Sort the values in descending order and rank the documents

Sorted Sequence: D3, D4, D2, D1

CosSim(D1 , Q) = 0 / 196 = 0

CosSim(D3 , Q) = 14 / 196 = 0.071 CosSim(D4 , Q) = 10 / 196 = 0.051

CosSim(D2 , Q) = 0 / 196 = 0

Documents Related to query document : D3, D4, D2, D1

Application in several areas:querying

clustering, identifying topics

Other:synonym recognition (TOEFL..)

Psychology test

essay scoring

Latent Semantic Indexing isLatent: Captures associations which are not explicit

Semantic: Represents meaning as a function of similarity to other entities

Cool: Lots of spiffy applications, and the potential for some good theory too

Text corpus with many documents (docs)

Given a query, find relevant docs

Classical problems:synonymy: missing docs with reference to “automobile” when querying on “car”

polysemy: retrieving docs on internet when querying on “surfing”

Solution: Represent docs (and queries) by their underlying latent concepts

Represent each document as a word vector

Represent corpus as term-document matrix (T-D matrix)

A classical method:Create new vector from query terms

Find documents with highest dot-product

Query

Word 1

Word 2

Process term-document (T-D) matrix to expose statistical structure

Convert high-dimensional space to lower-dimensional space, throw out noise, keep the good stuff

Related to principal component analysis (PCA), multiple dimensional scaling (MDS)

U = universe of terms

n = number of terms

m = number of docs

A = n x m matrix with rank rcolumns represent docs

rows represent terms

=

=

mxn

A

mxr

U

rxr

D

rxn

VT

Terms

Documents

• LSI uses SVD, a linear analysis method:

r is the rank of A

D diagonal matrix of the r singular values

U and V matrices composed of orthonormal columns

SVD is always possible

numerical methods for SVD exist

run time: O(m n c), where c denotes the average number of words per document

=

=

mxn

A’k

mxk

Uk

kxk

Dk

kxn

VT

k

Terms

Documents

Landauer & DumaisPerform LSI on 30,000 encyclopedia articles

Take synonym test from TOEFL

Choose most similar word

LSI - 64.4% (52.2% corrected for guessing)

People - 64.5% (52.7% corrected for guessing)

Correlated .44 with incorrect alternatives

The model:Topics sufficiently disjoint

Each doc drawn from a single (random) topic

Result:With high probability (whp) :

Docs from the same topic will be similar

Docs from different topics will be dissimilar

K topics, each corresponding to a set of words

The sets are mutually disjoint

Below, all random choices are made uniformly at random

A corpus of m docs, each doc created as follows..

choosing a doc:choose length of the doc

choose a topic

Repeat times:With prob choose a word from topic

With prob choose a word from other topics

1

l

T

T

l

Theory should go beyond explaining (ideally)

Potential for speed up: project the doc vectors onto a suitably small space

perform LSI on this space

Yields O(m( n + c log n)) compared to O(mnc)

2log

Symbolic Representation of Text Documents

D S Guru, B S Harish and S Manjunath

Proceedings of the Third Annual ACM Bangalore Compute 2010

http://portal.acm.org/citation.cfm?id=1754306

Objective of the paper

A novel method of representing a text document by the use of

interval valued symbolic features

Classification of text documents based on new representation

Extensive experimentation on standard dataset

Comparison with the existing models

Overview of (Isa et al., 2008)

Naïve Bayes classifier is used to vectorize the given text document

Text documents contains list of words

Probabilistic classifier calculates the posterior probability of

particular document being annotate to a particular category by

using equation( ). ( )

( )( )

r r

r

r

PP Word Category P Category

Category WordP Word

Isa, D., Lee, L. H., Kallimani, V. P. and Rajkumar, R. 2008. Text document preprocessing with the Bayes formula

for classification using the support vector machine. IEEE Transactions on Knowledge and Data Engineering. vol.

20, pp. 23-31.

The table below shows the occurrence of words in a document and its

probability distribution

Word Probability

Category 1

Probability

Category 2

Probability

Category 3

… Probability

Category k-1

Probability

Category k

w1 Pr(C1|w1) Pr(C2|w1) Pr(C3|w1) … Pr(Ck-1|w1) Pr(Ck|w1)



.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wn-1 Pr(C1|wn-1) Pr(C2|wn-1) Pr(C3|wn-1) … Pr(Ck-1|wn-1) Pr(Ck|wn-1)

wn Pr(C1|wn) Pr(C2|wn) Pr(C3|wn) … Pr(Ck-1|wn) Pr(Ck|wn)

Total …

Probability of

Input Document

…

1( )rP C W 2( )rP C W 3( )rP C W 1( )r kP C W ( )r kP C W

1( )rP C W

n

2( )rP C W

n

3( )rP C W

n

1( )r kP C W

n

( )r kP C W

n

Limitations

Lot of variations of the features with training documents

Solution

Assimilates these variations together will be a good representative of

each classes.

Thus we use Symbolic Data Analysis (SDA) to represent a text document.

Advantages of using SDA:

Real life objects can be better described

Symbolic data appear in the form of continuous ratio, discrete absolute interval

and multivalued, multivalued with weightage, quantitative, categorical, etc..

Proposed Method

Let be a set of ‘n’ training documents of a class

(k denotes the number of categories) and let be the set

of ‘m’ words characterizing the document of the class Cj.

Compute the probabilities of the word in each category

Obtained probabilities of all words w r t each category are combined to form a

feature vector of length k

Process is repeated for all the documents in class Cj and also for all other

trained documents of all other classes

Representation Stage

Classification Stage

1 2 3,...,[ , , ]j j j j

nD D D D ; 1,2,3,...,jC j k

1 2 3[ , , ,..., ]j j j j j

i i i i imW w w w w

j

iD

Symbolic Representation

To capture the intra class variations of class conditional probability ;

we consider minimum and maximum class probability value of each

class in the form of interval types.

These interval valued type of class Cj are represented as

represents the upper and lower limit of feature value of a document

class in the knowledge base.

,i i

j j jC C C

Table below shows the representative vector for different classes.

Where, andmin{ ( )}; 1i

j r i iC P C d i n

max{ ( )}; 1i

j r i iC P C d i n

The representative vector for the class Cj, is formed by representing

each ‘k’ feature in the form of an interval and is given by

This symbolic feature vector is stored in the knowledge base as a

representative of the jth document class.

Similarly, we compute symbolic feature vectors for all the classes

and store them in knowledge base for classification.

1 1 2 2{[ , ],[ , ],...,[ , ]}j j j j j jk jkRF f f f f f f

Document Classification

Given a test document, which is described by a set of ‘k’ feature values of

type crisp and compares it with the corresponding interval type feature

values of the respective class i.e, stored in the knowledge base.

Now, each mth feature value of the test document is compared with the

corresponding interval in RFj to examine whether the feature value of the

test document lies within the corresponding interval.

We make use of Belongingness Count BC as a measure of degree of

belongingness for the test document to decide whether it belongs to the

correct class or not.

1

( ,[ , ]),k

c tm jm jm

m

B C f f f where

1 ( )

0( ,[ , ]) tm jm tm jmIf f f and f f

tm jm jm OtherwiseC f f f

Experimental Settings

Classification Accuracy with varying Training and Testing set

Comparative analysis of proposed method with that of [2]

There are 3 stages

Symbolic

Representation

Symbolic Feature

Selection

Document

Classification

Text Classification

Symbolic Representation

Number of classes: 3

Number of Documents per class: 4

Step 1: Representing the documents in “term-document matrix”

‘X’ represents the “term-document matrix” of size (kN X t) where,

k Number of classes present (In this case k=3 class),

N Number of training documents present per class (here N=5) and

t represents the number of terms present in the training set (i.e., 15 terms).

Step 2: Applying Dimensionality Reduction Technique (RLPI) on “term-document matrix”

Here ‘Y’ represents the “reduced term-document matrix” of size (kN X m) where,

m represents the number of reduced dimensions of the training set

(i.e., 5 dimensions).

Compute interval valued reference term frequency vector for each class on matrix Y. the

class representative vectors are stored in a matrix F of size (k X m) where each feature is

of type interval.

Step 4: Symbolic Feature Selection

To employ Symbolic Feature Selection (SFS) scheme on matrix . After employing SFS we

selected Feature 3 and Feature 5, as the best features. This is stored in the knowledge

base.

Step 5: Document Classification

We make use of a similarity measure proposed by Guru and

Nagendraswamy (2007) to measure the similarity between test feature vector

FDQ and the jth class representative Rj.

Results

Introduction to Data Mining

Text Mining : Issues

Text Classification : Issues

A case study on Text Classification

Symbolic Representation for Text Documents