27
103 CHAPTER 5 EFFECTIVE TEXT MINING USING INTELLIGENCE LEARNING BASED MCMM (ETM-ILM) 5.1 DESIGN OF PROPOSED ETM-ILM The proposed system design of ETM-ILM is shown in Figures 5.1 and 5.2. Figure 5.1 shows the flow diagram of proposed ETM-ILM in the training session. Figure 5.2 shows the flow diagram of proposed ETM-ILM in the testing session. The detailed functionalities of each component in the proposed system design are recalled hereunder: The MCMM used for effective text classification. The working design of proposed MCMM executes in two algorithms of manipulation, which are training algorithm and testing algorithm. In the conceptual analysis stage, the terms which appeared in each STL are searched in the given training documents. The meta data analysis (mda) values of the documents are shown in the equation 5.1, documents in terms of number total STL in terms matching of number mda _ _ _ _ _ _ _ _ _ _ = (5.1) On the classification stage, the highest values of ‘mda’ which appeared in any one field of STL is identified and clustered as the name of STL.

CHAPTER 5 EFFECTIVE TEXT MINING USING INTELLIGENCE ...shodhganga.inflibnet.ac.in/bitstream/10603/16044/14/14_chapter 5.pdf · EFFECTIVE TEXT MINING USING INTELLIGENCE LEARNING BASED

Embed Size (px)

Citation preview

103

CHAPTER 5

EFFECTIVE TEXT MINING USING INTELLIGENCE

LEARNING BASED MCMM (ETM-ILM)

5.1 DESIGN OF PROPOSED ETM-ILM

The proposed system design of ETM-ILM is shown in Figures

5.1 and 5.2. Figure 5.1 shows the flow diagram of proposed ETM-ILM in

the training session. Figure 5.2 shows the flow diagram of proposed

ETM-ILM in the testing session.

The detailed functionalities of each component in the proposed

system design are recalled hereunder:

The MCMM used for effective text classification. The working

design of proposed MCMM executes in two algorithms of manipulation,

which are training algorithm and testing algorithm.

In the conceptual analysis stage, the terms which appeared in

each STL are searched in the given training documents. The meta data

analysis (mda) values of the documents are shown in the equation 5.1,

documentsintermsofnumbertotal

STLintermsmatchingofnumbermda

_____

_____= (5.1)

On the classification stage, the highest values of ‘mda’ which

appeared in any one field of STL is identified and clustered as the name

of STL.

104

Figure 5.1 Flow diagram of proposed ETM-ILM in the training

session

Figure 5.2 Flow diagram of proposed ETM-ILM in the testing

session

MCMM Training

Data (60%)

STL

ABI

MCMM Testing Data

(40%)

STL

Output

105

This process continues for each training document and each

additional relevant terms identified in the training algorithm is added in

the concern STL.

In the conceptual analysis stage, the ‘mda’ values of each term

which appeared in every STL are calculated from the given document.

On the classification stage, the highest values of ‘mda’ which appeared

in any one field of STL is identified and clustered as the name of STL.

The performances of algorithms and techniques used in

computational field of domain are improved by means of proper learning

method. Hence, in order to improve the performances of proposed

MCMM, a learning method is proposed.

The proposed learning model involves the learning of

conceptual terms from the MCMM. The terms learned from the proposed

learning algorithm are grouped and added to the STL. The frequent

update of conceptual terms in the STL is more important for effective

clustering. For learning of such terminologies, this proposed work

applies ANN based learning algorithm.

The proposed ANN based unsupervised learning, is termed as,

Analysis of Bilateral Intelligence (ABI). The ABI applies the learning

process to identify two equivalent terms which havethe same meaning.

ABI contains text documents as datasets, improving accuracy of text

clustering which is the required output and achieving error free clustering

in a shorter time is the goal.

106

The working model of the proposed ABI Learning method is

explained in the following:

The following sigmoidal function is applied in the proposed

ABI,

A x

1X

1 e−

=

+

(5.2)

Where, XA is the output in the hidden and output layer. Where the inputs

are ‘x’ which is connected to the hidden layer from input layer. The

connection has weights ‘rai’, between inputs to hidden layer. And the

output of the neurons refered as ‘sba’ is computational values between

output and hidden layer. Where, ‘b’ neurons in the output layer, ‘a’

neurons in the hidden layer and ‘i’ neurons in the input layer.

Soptimum= A-1

x B (5.3)

Where

A=∑=

P

p

p

i

p

a ZZ1

a, i = 1,…, P (5.4)

B=∑=

P

p

p

b

p

a tZ1

a, b = 1,…, P (5.5)

Where, ‘ZP’= scalar output of the hidden neuron of training

data ‘p’, ‘A’ and ‘B’ are the output of the hidden layer and output layer

respectively, ‘a’ and ‘b’ are neurons in the hidden layer and output layer,

‘i’ is neuron in the input layer, and ‘t’ is transaction function.

The state vector or simply state, denoted by ‘xk’, is defined as

the minimal set of data that is sufficient to uniquely describe the

107

unforced dynamical behaviour of the system; the subscript ‘k’ denotes

discrete time. In other words, the state is the least amount of data on the

past behaviour of the system that is needed to predict its future

behaviour. Typically, the state ‘xk’ is unknown. To estimate it, use a set

of observed data, denoted by the vector ‘yk’.

RMS error (ERMS) was then calculated comparing the ‘Rtest’

matrix with ‘Soptimum’

matrices.

a. ERMS< E (5.6)

The hidden layer weight matrix ‘R’ is updated ‘R’= ‘Rtest’

.

Decrease the influence of the penalty term by decreasing ‘µ’.

b. ERMS ≥ E (5.7)

Increase the influence of ‘µ’.

If the RMS error is not within the desired range, else the

training process is ceased. After the successful completion of the training

algorithm, the sample real time data are given as input of the system. The

system will choose the comparatively best path. This thesis used 60%

dataset for training and 40% dataset for testing.

5.2 PROPOSED ETM-ILM ALGORITHM

The proposed ETM-ILM algorithm is also executes in two

algorithms, which are training algorithm and testing algorithm. The

detailed ETM-ILM training algorithm is described below.

5.2.1 Training Algorithm

Step 1 : Apply preprocessing

108

Step 2 : Prepare STL for each field of study

Step 3 : Check the metadata stored in each STL is unique

and primary data

Step 4 : Verify that all training documents are read then go

to step 9, otherwise continue step 5.

Step 5 : Calculate the number of matching terms in the

given document which matching the STL is ‘m’

and calculate the total number of terms in the

given document is ‘n’.

Step 6 : Compute ‘mda’, where ‘mda’ = �

Step 7 : Sort the ‘mda’ in decreasing order and check the

terms which has higher ‘mda’ terms in the STL, if

available, go to step 8 otherwise go to step 9.

Step 8 : Update these new terms to concern STL and go to

step 3

Step 9 : Apply ABI learning algorithm

Step 10 : Compare output with a minimum threshold. If

outputs above threshold go to step 11, otherwise

go to step 12.

Step 11 : Update STL

Step 12 : Go to the testing process

5.2.2 Testing Algorithm

Step 1 : Apply preprocessing

Step 2 : Collect STL for each field of study from training

algorithm

Step 3 : Check the metadata stored in each STL is unique

and primary data

109

Step 4 : Apply ABI and verify interestingness of each

keyword presented in the STL. If the confidence

of interestingness is not in acceptable level, which

may be removed from concern STL.

Step 5 : Apply the input test document

Step 6 : Read each term in every STL and calculate the

number of matching terms in the given test

document. In which, the terms are matching with

the STL is ‘m’ and calculate the total number of

terms in the given test document is ‘n’.

Step 7 : Calculate ‘mda’, ‘mda’ = �

Step 8 : Sort the ‘mda’ in decreasing order

Step 9 : Check the terms which have higher ‘mda’

Step 10 : Check this highest ‘mda’ term is available in the

given STL, if available, go to step 10 otherwise

gotostep11.

Step 11 : Classify the given test document as the field of

matching STL

Step 12 : Identify next higher ‘mda’ term until end of the

test document and goto Step 9

5.3 WORKING MODEL OF ETM-ILM

A sample input file in the computer network field of domain is

shown in Table 5.1. Table 5.2 is prepared based on a frequent item set

based on ETM-ILM.

Table 5.3 shows the STL in the computer network field of

study. The non-technical terms appeared in Table 5.2 is removed in the

training algorithm of the proposed work.

110

Table 5.1 Sample Input File

Computer Network

Computer Communication is a major field of study in circuit branches.

Communications between Computers are carried out through networks.

The network is a collection of computers. Local Area Network is an

integrated network within campus, which is also termed Intranet. Like

Intranet, a collection of network is called Internet. The Internet is a

world-wide computer integration based on common protocols. The

protocol defines the exact applications and implementations of each task

used for computer communication.

Table 5.2 Sample Frequent Term List

List of terms Frequency

Network

Computer

Communication

Intranet

Internet

Integrated/integration

Collection

Protocol

5

5

3

2

2

2

2

2

111

Table 5.3 Sample STL File (After training algorithm)

List of terms_ Computer Network

Network

Computer

Communication

Intranet

Internet

Protocol

5.3 RESULT AND ANALYSIS

The proposed ETM-ILM is implemented in MatLab

(MATLAB). Performance analysis and comparison of proposed work

with existing TBM, CBMM and PTM are computed. The result of the

implementation in terms of F-Measure and Entropy are recorded and

shown in Table 5.4 and 5.5 respectively.

In which, the comparison of F-Measure of various existing

methods and proposed ETM-ILM are shown in Table 5.4. In which, the

comparison of Entropy of various existing methods and proposed ETM-

ILM are shown in Table 5.5. In which, Improvement of F-Measure in

ETM-ILM over MCMM are shown in Table 5.6 and also improvement of

Entropy in ETM-ILM over MCMM are shown in Table 5.7.

112

Table 5.4 Comparison of F-Measure of existing Vs proposed

methods

Field of Study Data Set TBM CBMM PTM Proposed ETM-ILM

Electrical

IEEE 0.697 0.741 0.780 0.833

ACM 0.767 0.812 0.834 0.890

Scopus 0.724 0.807 0.828 0.887

Electronics

IEEE 0.688 0.731 0.753 0.825

ACM 0.757 0.801 0.836 0.876

Scopus 0.715 0.797 0.821 0.861

Civil

IEEE 0.756 0.804 0.847 0.903

ACM 0.832 0.881 0.906 0.962

Scopus 0.785 0.875 0.899 0.944

Computer

IEEE 0.746 0.793 0.835 0.892

ACM 0.821 0.869 0.898 0.950

Scopus 0.775 0.864 0.886 0.930

Mechanical

IEEE 0.736 0.783 0.807 0.880

ACM 0.810 0.858 0.892 0.936

Scopus 0.765 0.853 0.871 0.918

113

Table 5.5 Comparison of Entropy of existing methods Vs proposed

methods

Field of Study Data Set TBM CBMM PTM Proposed ETM-ILM

Electrical

IEEE 0.329 0.214 0.191 0.125

ACM 0.317 0.178 0.160 0.116

Scopus 0.412 0.380 0.358 0.260

Electronics

IEEE 0.325 0.211 0.202 0.123

ACM 0.313 0.176 0.159 0.114

Scopus 0.407 0.375 0.355 0.256

Civil

IEEE 0.357 0.232 0.213 0.136

ACM 0.344 0.193 0.174 0.125

Scopus 0.447 0.412 0.393 0.282

Computer

IEEE 0.352 0.229 0.213 0.134

ACM 0.339 0.191 0.165 0.123

Scopus 0.441 0.407 0.370 0.278

Mechanical

IEEE 0.348 0.226 0.185 0.132

ACM 0.335 0.188 0.167 0.122

Scopus 0.435 0.401 0.366 0.275

114

Table 5.6 Improvement of F-Measure in ETM-ILM over MCMM

Field of Study Data Set Proposed MCMM Proposed ETM-ILM

Electrical

IEEE 0.823 0.833

ACM 0.876 0.890

Scopus 0.859 0.887

Electronics

IEEE 0.812 0.825

ACM 0.865 0.876

Scopus 0.848 0.861

Civil

IEEE 0.892 0.903

ACM 0.950 0.962

Scopus 0.932 0.944

Computer

IEEE 0.881 0.892

ACM 0.938 0.950

Scopus 0.919 0.930

Mechanical

IEEE 0.869 0.880

ACM 0.925 0.936

Scopus 0.907 0.918

115

Table 5.7 Improvement of Entropy in ETM-ILM over MCMM

Field of Study Data Set Proposed MCMM Proposed ETM-ILM

Electrical

IEEE 0.143 0.125

ACM 0.132 0.116

Scopus 0.297 0.260

Electronics

IEEE 0.141 0.123

ACM 0.130 0.114

Scopus 0.293 0.256

Civil

IEEE 0.155 0.136

ACM 0.143 0.125

Scopus 0.322 0.282

Computer

IEEE 0.153 0.134

ACM 0.141 0.123

Scopus 0.318 0.278

Mechanical

IEEE 0.151 0.132

ACM 0.139 0.122

Scopus 0.314 0.275

116

Data set from these different web sources namely IEEE, ACM,

SCOPUS in five different fields namely electrical, electronics, civil,

computer, mechanical are taken for analysis and subjected to four different

methods namely TBM, CBMM, PTM and Proposed ETM-ILM methods

and the results are indicated in Table 5.4 to 5.7 and Figure (5.3) to (5.22).

Figure 5.3 Comparison of F-Measure on Electrical data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IEEE ACM Scopus

F-M

easu

re

Data Set

TBM

CBMM

PTM

Proposed ETM-ILM

117

Figure 5.4 Comparison of F-Measure on Electronics data

Figure 5.5 Comparison of F-Measure on Civil data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IEEE ACM Scopus

F-M

easu

re

Data Set

TBM

CBMM

PTM

Proposed ETM-ILM

0

0.2

0.4

0.6

0.8

1

1.2

IEEE ACM Scopus

F-M

easu

re

Data Set

TBM

CBMM

PTM

Proposed ETM-ILM

118

Figure 5.6 Comparison of F-Measure on Computer data

Figure 5.7 Comparison of F-Measure on Mechanical data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IEEE ACM Scopus

F-M

easu

re

Data Set

TBM

CBMM

PTM

Proposed ETM-ILM

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IEEE ACM Scopus

F-M

easu

re

Data Set

TBM

CBMM

PTM

Proposed ETM-ILM

119

Figure 5.8 Comparison of Entropy on Electrical data

Figure 5.9 Comparison of Entropy on Electronics data

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

IEEE ACM Scopus

En

trop

y

Data Set

TBM

CBMM

PTM

Proposed ETM-ILM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

IEEE ACM Scopus

En

tro

py

Data Set

TBM

CBMM

PTM

Proposed ETM-ILM

120

Figure 5.10 Comparison of Entropy on Civil Data

Figure 5.11 Comparison of Entropy on Computer data

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

IEEE ACM Scopus

En

trop

y

Data Set

TBM

CBMM

PTM

Proposed ETM-ILM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

IEEE ACM Scopus

En

tro

py

Data Set

TBM

CBMM

PTM

Proposed ETM-ILM

121

Figure 5.12 Comparison of Entropy on Mechanical data

Figure 5.13 Improvement of F-Measure in ETM-ILM over MCMM

on Electrical data

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

IEEE ACM Scopus

En

trop

y

Data Set

TBM

CBMM

PTM

Proposed ETM-ILM

0.78

0.8

0.82

0.84

0.86

0.88

0.9

IEEE ACM Scopus

F-M

easu

re

Data Set

Proposed MCMM

Proposed ETM-ILM

122

Figure 5.14 Improvement of F-Measure in ETM-ILM over MCMM

on Electronics data

Figure 5.15 Improvement of F-Measure in ETM-ILM over MCMM

on Civil data

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

IEEE ACM Scopus

F-M

easu

re

Data Set

Proposed MCMM

Proposed ETM-ILM

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

IEEE ACM Scopus

F-M

easu

re

Data Set

Proposed MCMM

Proposed ETM-ILM

123

Figure 5.16 Improvement of F-Measure in ETM-ILM over MCMM

on Computer data

Figure 5.17 Improvement of F-Measure in ETM-ILM over MCMM

on Mechanical data

0.84

0.86

0.88

0.9

0.92

0.94

0.96

IEEE ACM Scopus

F-M

easu

re

Data Set

Proposed MCMM

Proposed ETM-ILM

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

IEEE ACM Scopus

F-M

easu

re

Data Set

Proposed MCMM

Proposed ETM-ILM

124

Figure 5.18 Improvement of Entropy in ETM-ILM over MCMM on

Electrical data

Figure 5.19 Improvement of Entropy in ETM-ILM over MCMM on

Electronics data

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

IEEE ACM Scopus

En

tro

py

Data Set

Proposed MCMM

Proposed ETM-ILM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

IEEE ACM Scopus

En

tro

py

Data Set

Proposed MCMM

Proposed ETM-ILM

125

Figure 5.20 Improvement of Entropy in ETM-ILM over MCMM on

Civil data

Figure 5.21 Improvement of Entropy in ETM-ILM over MCMM on

Computer data

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

IEEE ACM Scopus

En

tro

py

Data Set

Proposed MCMM

Proposed ETM-ILM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

IEEE ACM Scopus

En

tro

py

Data Set

Proposed MCMM

Proposed ETM-ILM

126

Figure 5.22 Improvement of Entropy in ETM-ILM over MCMM on

Mechanical data

F-Measure improvement in the proposed ETM-ILM method is

seen from Table 5.4 in all the fields of study over other existing methods.

From Figure (5.3) and (5.4) it is clearly evident that the proposed ETM-

ILM stands for better than the other methods in electrical and electronics

web based data undertaken in the present study. The same trend is

obtained in the field of civil and computer data as indicated from Figure

(5.5) and (5.6). Same trend provides with mechanical data as seen from

Figure (5.7).

From Figure (5.3) to (5.7) it is seen clearly that F-Measure for

electrical, electronics, civil, computer and mechanical data is highest in

all the website sources and it is found to have maximum in proposed

ETM-ILM compared to other three existing methods.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

IEEE ACM Scopus

En

tro

py

Data Set

Proposed MCMM

Proposed ETM-ILM

127

The F-Measure of the proposed ETM-ILM is improved than

TBM as a minimum of 16% than existing system and it leads to

maximum of 23%. The F-Measure of the proposed ETM-ILM is

improved than CBMM as a minimum of 8% than existing system and it

leads to maximum of 13%. The F-Measure of the proposed ETM-ILM is

improved than PTM as a minimum of 5% than existing system and it

leads to maximum of 10%.

Entropy improvement in the proposed ETM-ILM method is

seen from Table 5.5 in all the fields of study over other existing methods.

From Figure (5.8) and (5.9) it is clearly evident that the proposed ETM-

ILM stands for better than the other methods in electrical and electronics

web based data used in the present study. The same trend is obtained in

the field of civil and computer data as indicated from Figure (5.10) and

(5.11). Same trend provides with mechanical data as seen from Figure

(5.12).

From Figure (5.8) to (5.12) it is seen clearly that Comparison

of entropy in electrical, electronics, civil, computer and mechanical data

indicated that is minimum for the proposed ETM-ILM compared to other

three exiting methods.

The entropy of the proposed ETM-ILM is improved than TBM

as a minimum of 37% than existing system and it leads to maximum of

64%. The entropy of the proposed ETM-ILM is improved than CBMM

as a minimum of 31% than existing system and it leads to maximum of

42%. The entropy of the proposed ETM-ILM is improved than PTM as a

minimum of 25% than existing system and it leads to maximum of 39%.

128

F-Measure improvement in the proposed ETM-ILM over

proposed MCMM method is seen from Table 5.6 in all the fields of study

over other existing methods. From Figure (5.13) and (5.14) it is clearly

evident that the proposed ETM-ILM stands for over than proposed

MCMM method in electrical and electronics web based data undertaken

in the present study. The same trend is obtained in the field of civil and

computer data as indicated from Figure (5.15) and (5.1). Same trend

provides with mechanical data as seen from Figure (5.17).

From Figure (5.13) to (5.17) it is seen clearly that F-Measure

for electrical, electronics, civil, computer and mechanical data is highest

in all the website sources and it is found to have maximum in proposed

ETM-ILM over proposed MCMM method.

Entropy improvement in the proposed ETM-ILM over

proposed MCMM method is seen from Table 5.7 in all the fields of study

over other existing methods. From Figure (5.18) and (5.19) it is clearly

evident that the proposed ETM-ILM stands for over than proposed

MCMM method in electrical and electronics web based data undertaken

in the present study. The same trend is obtained in the field of civil and

computer data as indicated from Figure (5.20) and (5.21). Same trend

provides with mechanical data as seen from Figure (5.22).

From Figure (5.18) to (5.22) it is seen clearly that Entropy for

electrical, electronics, civil, computer and mechanical data is lowest in

all the website sources and it is found to have minimum in proposed

ETM-ILM over proposed MCMM method.

From Figure (5.18) to (5.22) it is seen clearly that Comparison

of entropy in electrical, electronics, civil, computer and mechanical

129

data indicated that is minimum for the proposed ETM-ILM over

proposed MCMM method.

The improvement of F-Measure in proposed ETM-ILM over

proposed MCMM is a minimum of 1% and maximum of 3%. The

improvement of entropy in proposed ETM-ILM over proposed MCMM

is a minimum of 12% and maximum of 13%.

From the result and performance analysis, it is concluded that

the proposed MCMM with ABI learning algorithm (ETM-ILM) is

proved better result than existing and notable recent works in document

clustering field of domain.