89
B S Harish Department of Information Science & Engineering SJCE, Mysore INDIA Mail: [email protected] https://sites.google.com/site/bsharishsjce/

Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

B S HarishDepartment of Information Science & Engineering

SJCE, MysoreINDIA

Mail: [email protected]://sites.google.com/site/bsharishsjce/

Page 2: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function
Page 3: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function
Page 4: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function
Page 5: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Data explosion problem

Automated data collection tools, widely

used database systems, computerized society,

and the Internet lead to tremendous

amounts of data accumulated and/or

to be analyzed in databases, data warehouses,

WWW, and other information repositories

We are drowning in data, but starving for knowledge!

Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing

Mining interesting knowledge (rules, regularities, patterns,

constraints) from data in large databases

Page 6: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) patterns or knowledge from

huge amount of data

Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, information

harvesting, business intelligence, etc.

What Is Data Mining?

Page 7: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Selection: Obtain data from various sources

Preprocessing: Cleanse data

Transformation: Convert to common format. Transformto new format

Data Mining: Obtain desired results

Interpretation/Evaluation: Present results to user inmeaningful manner

Page 8: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Different views lead to different classifications

Kinds of data to be mined

Kinds of knowledge to be discovered

Kinds of techniques utilized

Kinds of applications adapted

Page 9: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function
Page 10: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Applications of Data Mining

Page 11: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

A case study on Text Mining

Page 12: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Introduction

Text mining is an emerging technology for handling the increasing text data

Text classification/clustering is one of the fundamental functions in textmining

Text classification/clustering is to divide a collection of text documents into

different category groups so that documents in the same category group

describe the same topic, such as classic music or Chinese

Page 13: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Unlike structured data, text data faces a number of new

challenges.

Volume

Dimensionality

Sparsity

Complex semantics

Page 14: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Security applications

Search Engines

Online Media applications

Page 15: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Marketing applications

Sentiment analysis

Academic applications

Page 16: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Text Mining Tasks

Text classification

Text clustering

Concept/entity extraction

Sentiment analysis

Document summarization

Entity relation modeling

Text Classification

Page 17: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Text Classification Applications

E-mail spam filtering

Categorize newspaper articles and newswires into topics

Organize Web pages into hierarchical categories

Sort journals and abstracts by subject categories (e.g.,MEDLINE, etc.)

Assigning international clinical codes to patient clinicalrecords

Page 18: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Poll of

Docu

mentsPre-Processing

Stemming/

Stopword Elimination/

Linguistic Preprocessing

Representation

BOW/VSM/

Probabilistic/

Ontology/

Term Sequence/

N-Grams/

Phrase Based/

Symbolic/

UNL/

Classification

(Supervised)

Clustering

(Unsupervised)

Evaluating

Results

Block Diagram of Text Classification Model

Page 19: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Representation Methods

VSM/BOW Representation Binary Representation Ontology Representation N-Grams Multi-Word terms Latent Semantic Indexing Locality Preserving Indexing Universal Networking Language

Classifiers

Naïve Bayes Classifier K-NN Classifier Centroid Classifier Decision Trees Rocchio Classifier Neural Networks SVM

Symbolic Representation (Our Method); ACM Compute , 2010

Page 20: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Ontology Based

VSMPruning Stemming

Stopword

Removal

Collection

ReaderDetagger Tokenizer

Input

Documents

TF IDF Weighting Scheme

Ontology

Representation

Preprocessing

Page 21: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Text Data and Representation Models

Different representation formats offer different capabilities todescribe context, structure, semantics, presentations of text content

Page 22: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Text Conversion

Representation Model

In information retrieval and text mining, text data ofdifferent formats is represented in a commonrepresentation model.

e.g., Vector Space Model

Text data is converted to the model representation

Page 23: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Vector Space Model (VSM)

The most popular representation model used in information

retrieval and text mining.

In VSM, a text document is represented as a vector of terms

<t1, t2, …, ti, …, tn>.

Each term ti represents a word or a phrase.

The set of all n unique terms in a set of text documents

forms the vocabulary for the set of documents.

A set of documents are represented as a set of vectors, that

can be written as a matrix.

Page 24: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Graphical Representation

Example:D1 = 2T1 + 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

T3

T1

T2

D1 = 2T1+ 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

7

32

5

• Is D1 or D2 more similar to Q?

• How to measure the degree of

similarity? Distance? Angle?

Projection?

Page 25: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Document Collection

A collection of n documents can be represented in the vector

space model by a term-document matrix.

An entry in the matrix corresponds to the “weight” of a term

in the document; zero means the term has no significance in

the document or it simply doesn’t exist in the document.

T1 T2 …. Tt

D1 w11 w21 … wt1

D2 w12 w22 … wt2

: : : :

: : : :

Dn w1n w2n … wtn

Page 26: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Term Weights: Term Frequency

More frequent terms in a document are more important,

i.e. more indicative of the topic.

fij= frequency of term i in document j

May want to normalize term frequency (tf) across the

entire corpus:

tfij = fij / max{fij}

Page 27: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Term Weights: Inverse Document Frequency

Terms that appear in many different documents are less indicative ofoverall topic.

df i = document frequency of term i

= number of documents containing term i

idfi = inverse document frequency of term i,

= log2 (N/ df i)

(N: total number of documents)

An indication of a term’s discrimination power.

Log used to dampen the effect relative to tf.

Page 28: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Advantages and Disadvantages of VSM

Advantages

Simple

Easy to calculate similarity between two documents

Data mining algorithms can be applied directly to text

data

Disadvantages

Terms are assumed independent (which is not true in

the real text document)

Lack of semantics

High dimensionality and sparsity

Page 29: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

To classify documents into 4 classes:

economics, sports, science, life.

There are two approaches:

Rule-based approach

To write a set of rules that classify documents

Machine learning-based approach

Using a set of sample documents that are classifiedinto the classes (training data), automatically createclassifiers based on the training data

Page 30: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Rule-based classification

Pros:

Very accurate when rules are written by experts

Classification criteria can be easily controlled when the number of rules are small.

Cons:

Sometimes, rules conflicts each other

maintenance of rules becomes more difficult as the number of rules increases

The rules have to be reconstructed when a target domain changes

Low coverage because of a wide variety of expressions

Page 31: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Machine Learning-based approach

Pros:

Domain independent

High predictive performance

Cons:

Not accountable for classification results

Training data required

Page 32: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

1.Prepare a set of training data

• Attach topic information to the documents in a target domain.

2.Create a classifier (model)

• Apply a Machine Learning tool to the data

Support Vector Machine (SVM), Maximum Entropy Models (MEM)

3.Classify new documents by the classifier

sports

science

lifeclassifier

sports

science

life

classifier

life

sports

Training data

Page 33: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

f1

f2

f3

f4

game

play

ball

danceClassifier c(·|·)

c(sports|x)

document d

c(science|x)

c(economics|x)

c(y|x)

x=(f1, f2, f3, f4)

features

feature extraction

feature vector

(input vector) c(life|x)

Select the best

classification result

Page 34: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Case Study

Page 35: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Doc 1

This is a document which

contains information on

text clustering

Doc 2

Text clustering is the

information which is

presented in this

document

Doc 3

Cricket is an interesting

game which is played by

eleven players

Doc 4

The maximum numbers of

players in the game cricket

are eleven

Query

Eleven players will be playing cricket game

Page 36: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Elimination of Stop Words

Doc 1

Document contains

information text

clustering

Doc 2

Text clustering information

presented document

Doc 3

Cricket interesting game

played eleven players

Doc 4

maximum numbers

players game cricket

eleven

Query

Eleven players playing cricket game

Page 37: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

BOW for

Document 1

Document

contains

information

text

Clustering

Text

Text

Document

information

BOW for

Document 2

Text

clustering

information

presented

document

Text

Clustering

document

BOW for

Document 3

Cricket

interesting

game

played

eleven

players

Cricket

Game

cricket

BOW for

Document 4

maximum

numbers

players

game

cricket

eleven

Eleven

Cricket

numbers

*Assumption: Add more term to documents to get more term frequency

Page 38: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

BOW for the Corpus

T1 : Document

T2 : contains

T3 : information

T4 : text

T5 : clustering

T6 : presented

T7 : Cricket

T8 : interesting

T9 : game

T10 : played

T11 : eleven

T12 : players

T13 : maximum

T14 : numbers

Page 39: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14

D1 2 1 2 3 1 0 0 0 0 0 0 0 0 0

D2 2 0 1 2 2 1 0 0 0 0 0 0 0 0

D3 0 0 0 0 0 0 3 1 2 1 1 1 0 0

D4 0 0 0 0 0 0 2 0 1 0 1 1 1 2

Document Vs Term Matrix for Training

Documents

Page 40: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

BOW for Document Query

Eleven

players

playing

cricket

Game

players

players

cricket

cricket

Query

Eleven players will be playing cricket game

Page 41: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14

Query

Docum

ent

0 0 0 0 0 0 3 0 1 0 0 3 0 0

Document Vs Term Matrix for Query Document

Page 42: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Matching

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14

D1 2 1 2 3 1 0 0 0 0 0 0 0 0 0

D2 2 0 1 2 2 1 0 0 0 0 0 0 0 0

D3 0 0 0 0 0 0 3 1 2 1 1 1 0 0

D4 0 0 0 0 0 0 2 0 1 0 1 1 1 2

Query

Docum

ent

0 0 0 0 0 0 3 0 1 0 0 3 0 0

CosSim(D1 , Q) = 0 / 196 = 0

CosSim(D3 , Q) = 14 / 196 = 0.071 CosSim(D4 , Q) = 10 / 196 = 0.051

CosSim(D2 , Q) = 0 / 196 = 0

Page 43: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Ranking of Documents

Sort the values in descending order and rank the documents

Sorted Sequence: D3, D4, D2, D1

CosSim(D1 , Q) = 0 / 196 = 0

CosSim(D3 , Q) = 14 / 196 = 0.071 CosSim(D4 , Q) = 10 / 196 = 0.051

CosSim(D2 , Q) = 0 / 196 = 0

Documents Related to query document : D3, D4, D2, D1

Page 44: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function
Page 45: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Application in several areas:querying

clustering, identifying topics

Other:synonym recognition (TOEFL..)

Psychology test

essay scoring

Page 46: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Latent Semantic Indexing isLatent: Captures associations which are not explicit

Semantic: Represents meaning as a function of similarity to other entities

Cool: Lots of spiffy applications, and the potential for some good theory too

Page 47: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Text corpus with many documents (docs)

Given a query, find relevant docs

Classical problems:synonymy: missing docs with reference to “automobile” when querying on “car”

polysemy: retrieving docs on internet when querying on “surfing”

Solution: Represent docs (and queries) by their underlying latent concepts

Page 48: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Represent each document as a word vector

Represent corpus as term-document matrix (T-D matrix)

A classical method:Create new vector from query terms

Find documents with highest dot-product

Page 49: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Query

Word 1

Word 2

Page 50: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Process term-document (T-D) matrix to expose statistical structure

Convert high-dimensional space to lower-dimensional space, throw out noise, keep the good stuff

Related to principal component analysis (PCA), multiple dimensional scaling (MDS)

Page 51: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

U = universe of terms

n = number of terms

m = number of docs

A = n x m matrix with rank rcolumns represent docs

rows represent terms

Page 52: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

=

=

mxn

A

mxr

U

rxr

D

rxn

VT

Terms

Documents

• LSI uses SVD, a linear analysis method:

Page 53: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

r is the rank of A

D diagonal matrix of the r singular values

U and V matrices composed of orthonormal columns

SVD is always possible

numerical methods for SVD exist

run time: O(m n c), where c denotes the average number of words per document

Page 54: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

=

=

mxn

A’k

mxk

Uk

kxk

Dk

kxn

VT

k

Terms

Documents

Page 55: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Landauer & DumaisPerform LSI on 30,000 encyclopedia articles

Take synonym test from TOEFL

Choose most similar word

LSI - 64.4% (52.2% corrected for guessing)

People - 64.5% (52.7% corrected for guessing)

Correlated .44 with incorrect alternatives

Page 56: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

The model:Topics sufficiently disjoint

Each doc drawn from a single (random) topic

Result:With high probability (whp) :

Docs from the same topic will be similar

Docs from different topics will be dissimilar

Page 57: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

K topics, each corresponding to a set of words

The sets are mutually disjoint

Below, all random choices are made uniformly at random

A corpus of m docs, each doc created as follows..

Page 58: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

choosing a doc:choose length of the doc

choose a topic

Repeat times:With prob choose a word from topic

With prob choose a word from other topics

1

l

T

T

l

Page 59: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Theory should go beyond explaining (ideally)

Potential for speed up: project the doc vectors onto a suitably small space

perform LSI on this space

Yields O(m( n + c log n)) compared to O(mnc)

2log

Page 60: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function
Page 61: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function
Page 62: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function
Page 63: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function
Page 64: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Symbolic Representation of Text Documents

D S Guru, B S Harish and S Manjunath

Proceedings of the Third Annual ACM Bangalore Compute 2010

http://portal.acm.org/citation.cfm?id=1754306

Page 65: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Objective of the paper

A novel method of representing a text document by the use of

interval valued symbolic features

Classification of text documents based on new representation

Extensive experimentation on standard dataset

Comparison with the existing models

Page 66: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Overview of (Isa et al., 2008)

Naïve Bayes classifier is used to vectorize the given text document

Text documents contains list of words

Probabilistic classifier calculates the posterior probability of

particular document being annotate to a particular category by

using equation( ). ( )

( )( )

r r

r

r

PP Word Category P Category

Category WordP Word

Isa, D., Lee, L. H., Kallimani, V. P. and Rajkumar, R. 2008. Text document preprocessing with the Bayes formula

for classification using the support vector machine. IEEE Transactions on Knowledge and Data Engineering. vol.

20, pp. 23-31.

Page 67: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

The table below shows the occurrence of words in a document and its

probability distribution

Word Probability

Category 1

Probability

Category 2

Probability

Category 3

… Probability

Category k-1

Probability

Category k

w1 Pr(C1|w1) Pr(C2|w1) Pr(C3|w1) … Pr(Ck-1|w1) Pr(Ck|w1)

w2 Pr(C1|w2) Pr(C2|w2) Pr(C3|w2) … Pr(Ck-1|w2) Pr(Ck|w2)

w3 Pr(C1|w3) Pr(C2|w3) Pr(C3|w3) … Pr(Ck-1|w3) Pr(Ck|w3)

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

wn-1 Pr(C1|wn-1) Pr(C2|wn-1) Pr(C3|wn-1) … Pr(Ck-1|wn-1) Pr(Ck|wn-1)

wn Pr(C1|wn) Pr(C2|wn) Pr(C3|wn) … Pr(Ck-1|wn) Pr(Ck|wn)

Total …

Probability of

Input Document

1( )rP C W 2( )rP C W 3( )rP C W 1( )r kP C W ( )r kP C W

1( )rP C W

n

2( )rP C W

n

3( )rP C W

n

1( )r kP C W

n

( )r kP C W

n

Page 68: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Limitations

Lot of variations of the features with training documents

Solution

Assimilates these variations together will be a good representative of

each classes.

Thus we use Symbolic Data Analysis (SDA) to represent a text document.

Advantages of using SDA:

Real life objects can be better described

Symbolic data appear in the form of continuous ratio, discrete absolute interval

and multivalued, multivalued with weightage, quantitative, categorical, etc..

Page 69: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Proposed Method

Let be a set of ‘n’ training documents of a class

(k denotes the number of categories) and let be the set

of ‘m’ words characterizing the document of the class Cj.

Compute the probabilities of the word in each category

Obtained probabilities of all words w r t each category are combined to form a

feature vector of length k

Process is repeated for all the documents in class Cj and also for all other

trained documents of all other classes

Representation Stage

Classification Stage

1 2 3,...,[ , , ]j j j j

nD D D D ; 1,2,3,...,jC j k

1 2 3[ , , ,..., ]j j j j j

i i i i imW w w w w

j

iD

Page 70: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Symbolic Representation

To capture the intra class variations of class conditional probability ;

we consider minimum and maximum class probability value of each

class in the form of interval types.

These interval valued type of class Cj are represented as

represents the upper and lower limit of feature value of a document

class in the knowledge base.

,i i

j j jC C C

Page 71: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Table below shows the representative vector for different classes.

Where, andmin{ ( )}; 1i

j r i iC P C d i n

max{ ( )}; 1i

j r i iC P C d i n

Page 72: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

The representative vector for the class Cj, is formed by representing

each ‘k’ feature in the form of an interval and is given by

This symbolic feature vector is stored in the knowledge base as a

representative of the jth document class.

Similarly, we compute symbolic feature vectors for all the classes

and store them in knowledge base for classification.

1 1 2 2{[ , ],[ , ],...,[ , ]}j j j j j jk jkRF f f f f f f

Page 73: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Document Classification

Given a test document, which is described by a set of ‘k’ feature values of

type crisp and compares it with the corresponding interval type feature

values of the respective class i.e, stored in the knowledge base.

Now, each mth feature value of the test document is compared with the

corresponding interval in RFj to examine whether the feature value of the

test document lies within the corresponding interval.

We make use of Belongingness Count BC as a measure of degree of

belongingness for the test document to decide whether it belongs to the

correct class or not.

1

( ,[ , ]),k

c tm jm jm

m

B C f f f where

1 ( )

0( ,[ , ]) tm jm tm jmIf f f and f f

tm jm jm OtherwiseC f f f

Page 74: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Experimental Settings

Classification Accuracy with varying Training and Testing set

Comparative analysis of proposed method with that of [2]

Page 75: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function
Page 76: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

There are 3 stages

Symbolic

Representation

Symbolic Feature

Selection

Document

Classification

Text Classification

Page 77: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Symbolic Representation

Number of classes: 3

Number of Documents per class: 4

Step 1: Representing the documents in “term-document matrix”

‘X’ represents the “term-document matrix” of size (kN X t) where,

k Number of classes present (In this case k=3 class),

N Number of training documents present per class (here N=5) and

t represents the number of terms present in the training set (i.e., 15 terms).

Page 78: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Step 2: Applying Dimensionality Reduction Technique (RLPI) on “term-document matrix”

Here ‘Y’ represents the “reduced term-document matrix” of size (kN X m) where,

m represents the number of reduced dimensions of the training set

(i.e., 5 dimensions).

Page 79: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function
Page 80: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Compute interval valued reference term frequency vector for each class on matrix Y. the

class representative vectors are stored in a matrix F of size (k X m) where each feature is

of type interval.

Page 81: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Step 4: Symbolic Feature Selection

Page 82: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

To employ Symbolic Feature Selection (SFS) scheme on matrix . After employing SFS we

selected Feature 3 and Feature 5, as the best features. This is stored in the knowledge

base.

Page 83: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Step 5: Document Classification

We make use of a similarity measure proposed by Guru and

Nagendraswamy (2007) to measure the similarity between test feature vector

FDQ and the jth class representative Rj.

Page 84: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function
Page 85: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function
Page 86: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Results

Page 87: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function

Introduction to Data Mining

Text Mining : Issues

Text Classification : Issues

A case study on Text Classification

Symbolic Representation for Text Documents

Page 88: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function
Page 89: Classification of Large Text Data: Machine Learning ......Latent Semantic Indexing is Latent: Captures associations which are not explicit Semantic: Represents meaning as a function