IJESAT_2012_02_01_13

Y JAYA BABU* et al. ISSN: 2250–3676

[IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY Volume - 2, Issue - 1, 79 – 84

IJESAT | Jan-Feb 2012

Available online @ http://www.ijesat.org 79

EXTRACTING SPATIAL ASSOCIATION RULES FROM THE MAXIMUM

FREQUENT ITEMSETS BASED ON BOOLEAN MATRIX

Prof. Y Jaya Babu1, G J Phani Bala

2, Siva Rama Krishna T

3

1Prof. & Head, Dept. of MCA, Pragati Engg. College, Andhra Pradesh, India, [email protected]

2 Asst. Prof., Dept. of IT, Pragati Engg. College, Andhra Pradesh, India, [email protected]

3 Asst. Prof., Dept. of CSE, Vishnu Inst. of Technology, Andhra Pradesh, India, [email protected]

Abstract Mining spatial association rules is one of the most important branches in the field of Spatial Data Mining (SDM). Because of the

complexity of spatial data, a traditional method in extracting spatial association rules is to transform spatial database into general

transaction database. The Apriori algorithm is one of the most commonly used methods in mining association rules at present. But a

shortcoming of the algorithm is that its performance on the large database is inefficient. The present paper proposed a new algorithm

by extracting maximum frequent itemsets based on a Boolean matrix. And a case study about extracting the spatial association rules

between land cover and terrain factors was demonstrated to show the validation of the new algorithm. Finally, the conclusion was

reached by the comparison between the Apriori algorithm and the new one which revealed that the new algorithm improves the

efficiency of extracting spatial association rules.

Index Terms: Maximum frequent itemset; Spatial association rule; Apriori algorithm

--------------------------------------------------------------------- *** ------------------------------------------------------------------------

1. INTRODUCTION

Spatial Data Mining (SDM) is a process of spatial support

decision, which aims at extracting the implicit, unknown,

potential, useful spatial and non-spatial knowledge from

spatial data, including general geometry rules, spatial

characteristics rules, spatial classification rules, spatial

clustering rules, spatial association rules and so on [1]. Spatial

association rule, termed as spatial association location pattern

[2], is one of the most important branches in the SDM, which

means a rule indicating certain association relationships

among a set of spatial and nonspatial attributes of

geographical objects. Because of the complexity of spatial

data, the main idea of extracting spatial association rules is to

mine spatial association rules in the transaction database

categorized from spatial data using some mining algorithms.

The Apriori algorithm [3] is one of the most commonly used

algorithms in mining association rules at present, and its

typical application was market basket analysis to discover

customer shopping patterns [4]. Subsequently, the algorithm

was extended towards SDM to discover multi-level spatial

association rules based on progressive refinement [5]; But a

shortcoming about the algorithm is that the performance is

inefficient on the large database, especially, the deficiency is

more obvious for an amount of spatial data. Although the

meta-rules can reduce the computation of the number of

unnecessary itemsets, the metarules were re-designed and

users accepted them passively. Therefore, two models were

proposed to learn the prior knowledge from users‟ interactive

feedback [10]. In this paper, a new algorithm was proposed

that focus on extracting maximum frequent itemsets first

based on the Boolean matrix of frequent length-1 itemsets that

are generated using the Apriori algorithm, and then generating

all the frequent itemsets from maximum frequent itemsets

according to the nonempty sub-sets of frequent itemsets being

still frequent. Finally, the comparison between the Apriori

algorithm and the proposed one by mining the spatial

association rules between terrain factors and land cover was

showed to validate the new algorithm‟s efficiency.

2. THE COMPARISON OF THE PRINCIPLES OF

THE ALGORITHMS

The Apriori algorithm is one of the most influential algorithms

used for mining association rules, which was proposed by R.

Aglawal et al. in 1994. According to the principles of the

Apriori algorithm in [3], it is composed of two steps, one is

extracting all the frequent itemsets; the other is generating all

the strong association rules from frequent itemsets [6]. In fact,

the essence is to iteratively generate the set of candidate

itemsets of length (k+1) from frequent itemsets of length-k

and check their corresponding occurrence frequencies in the





database to obtain frequent itemsets of length (k+1) at each

level. Therefore it can be seen that there are two main reasons

to low efficiency of the Apriori algorithm: It is required to

generate lots of candidate itemsets for generating each

frequent itemsets; It is essential to scan database many times

for generating each frequent itemsets. Thus the research

presented a new algorithm of mining maximum frequent

itemsets first based on the Boolean matrix of frequent length-1

itemsets. The main idea of the algorithm is to create a Boolean

matrix with frequent length-1 itemsets as row headings and

transaction records‟ IDs as column headings (TABLE I). In

the matrix, there are only two type of values, „1‟ and „0‟,

which means that the transaction record contains or not the

corresponding frequent length-1 itemset. Then it is necessary

to calculate the number of value 1 in each column and the

count of the columns with the same number of value 1. If the

count of those columns is larger than the minimum support, in

accordance, the number of value 1 in the column may be the

size of maximum frequent itemset, vice versa. Therefore,

some values of which each may be the maximum frequent

itemset‟s length will be calculated. Subsequently, a set of

candidate itemsets used for extracting maximum frequent

itemsets will be generated from frequent length-1 itemsets

according to each maximum value and the support of each

candidate itemset will be calculated based on the Boolean

matrix. If the support is larger than the minimum support, the

candidate itemset is frequent, vice versa. Finally, all the

frequent itemsets will be extracted from maximum frequent

itemsets according to the nonempty sub-sets of frequent

itemsets being still frequent. Generally speaking, the main

principles of the new algorithm include three aspects:

Table 1 A Part Of The Boolean Matrix Of The Frequent

Length-1 Itemsets

2.1 Creating a Boolean Matrix According to

Frequent Length-1 Itemsets

All the frequent length-1 itemsets will be generated from

transaction database using the Apriori algorithm when

transaction database is scanned first time and for each frequent

length-1 itemset, all the IDs of transaction records containing

it need to be taken note in one array. Then the corresponding

Boolean array with the length being the number of the

transaction records in database will be created for each

frequent length-1 itemset. In each array, there are only two

values, „0‟ and „1‟. If transaction record contains frequent

length-1 itemset, the value is 1 in the corresponding Boolean

array, vice versa. At last, a Boolean matrix will be constructed

according to all the Boolean arrays of frequent length-1

itemsets.

1) Definition 1: The corresponding Boolean array ofeach

frequent length-1 itemset Im[N] is {BT1, BT2, ... , BTn} (1≤

n≤N), where Im is the mth frequent length-1 itemset; N is the

number of transaction records in database; Tn is ID of the nth

transaction record respectively; and BTn‟s value is 0 or 1 only.

2) Definition 2: The Boolean matrix of frequent length-1

itemsets IM*N is {I1[N], I2[N], ... , Im[N]} (1≤m≤M), where

Im[N] is the Boolean array with N dimensions of the mth

frequent length-1 itemset; M is the number of frequent length-

1 itemsets.

3) The pesudo codes of the first part (Fig. 1):

Figure 1: The pesudo codes of the first part





2.2 Extracting Maximum Frequent Itemsets

from Boolean Matrix

Each column in the Boolean matrix represents one transaction

record. Value 0 in the column means the corresponding

transaction record contains the corresponding frequent length-

1 itemset, vice versa. Therefore, the number of value 1 in each

column indicates the corresponding transaction record

contains the number of frequent length-1 itemsets together. If

there is the number of transaction records with the same

number of value 1 being larger than the minimum support, the

number of value 1 may be the size of maximum frequent

itemset, vice versa. As a result, a set of values in which each

one may be maximum frequent itemset‟s length will be

obtained. Then according to each of the values in descending

order, a series of candidate itemsets will be generated from

frequent length-1 itemsets and the support of each candidate

itemset could be calculated according to the Boolean matrix of

frequent length-1 itemsets. If the support of each candidate

itemset is larger than the minimum support, the candidate

itemset is frequent, vice versa. At last, if the maximum

frequent itemsets generated from the set of candidate itemsets

are not empty, the size of candidate itemset is required, that is

length of maximum frequent itemset. Otherwise, it is

necessary to continue the previous operation to check the next

value until maximum frequent itemsets are not empty. If all

the maximum frequent itemsets are empty, the maximum

length of frequent itemset is one.

1) Definition 3: Max[n] is an array used for storing some

values of which each may be the length of maximum frequent

itemset, where n is the size of Max[n].

2) Definition 4: The set of candidate itemsets of maximum

frequent itemsets C is {IM1, IM2, ... , IMn}, therefore, the

corresponding Boolean matrix CMn*N is {IM1[N], IM2[N], ... ,

IMn[N]}, where IMn is candidate itemset.

3) Definition 5: The support of candidate itemset C,

Support(C) = IM1[N] And IM2[N] And ... IMn[N]. Fig. 3 shows

the example of the logical Boolean operator “And” between

the Boolean arrays of candidate itemsets, where “And” is the

logical Boolean operator, if there exists value 0, then the

calculation will be 0.

4) The pesudo codes of the second part (Fig. 2):

Figure 2: The pesudo codes of the second part

Figure 3: The logical Boolean operator of the Boolean

arrays of the set of candidate itemsets

2.3 Generating All the Frequent Itemsets from

Maximum Frequent Itemsets





All the frequent itemsets could be extracted from all the

maximum frequent itemsets according to the nonempty

subsets of frequent itemsets being still frequent. And the

support of each frequent itemset could be calculated by

Definition 5. At last, all the strong association rules can be

mined from all the frequent itemsets.

3. THE IMPLEMENTS AND COMPARISON OF

THE ALGORITHMS

In this part, a case study about extracting the spatial

association rules between land cover and terrain factors was

presented to validate the proposed algorithm‟s efficiency. The

slope and the aspect derived from DEM with the grid cell size

of 100m and the land cover map extracted from the SPOT-5

remote sensing image were taken as experimental datasets; the

Apriori algorithm and the new one were used to mine the

spatial association rules from the above datasets and the

efficiency between the two algorithms was be compared and

analyzed at last.

3.1 Spatial Data Preprocessing

Spatial datasets need to be preprocessed to construct the

transaction database before mining spatial association rules

according to the main idea of mining spatial association rules

at present. Imam Mukhlash and Benhard Sitohang put forward

the framework of spatial data preprocessing, including feature

(spatial and non-spatial) selection based on spatial parameters,

performing dimension reduction and selection of non-spatial

attributes, performing data categorization based on non-spatial

data parameters, performing join operations for spatial objects

based on spatial parameters and transforming into output form

[16]. Therefore, all the spatial datasets in the case need be

preprocessed as the following three aspects:

1) The preparation and preprocessing of spatial datasets: The

spatial datasets in the case included the elevation, the slope

and the aspect with the spatial resolution of 100m and the land

cover map. The slope and the aspect were derived from the

elevation and the land cover map was derived from the SPOT-

5 remote sensing image. Fig. 4 shows the flow chart of the

spatial data preprocessing. At last, all the spatial datasets are

masked by the study region boundary layer to be sure the same

spatial extent for each spatial dataset.

Figure 4: The flow chart of the spatial data preprocessing

2) The categorization of the attribute values for each spatial

dataset: According to the spatial data preprocessing

framework, the attribute values of each spatial dataset must be

generalized. Therefore, the elevation would be categorized

into 5 types, including extremely High Mountain (>5000m),

High Mountain (3500~5000m), Middle Mountain

(1000~3000m), Low Mountain (500~1000m) and Plain and

Hill (<500m) according to [17] and the slope could be

generalized into 4 types based on the slope steepness

classification of International Geographical Union

Geomorphological Survey and Mapping Council, including

plain (<2°), slope (2°~6°), abrupt slope (6°~25°) and steep

slope (>25°). Fig. 5 shows the compass direction of the aspect.

The land cover types included river, estuarine, reservoir, built-

up land, farmland, gardens, forest land, mangrove, grass land,

and so on.

Figure 5: The compass direction of the aspect





3) The construction of the transaction database: After

completing the spatial data preprocessing, to construct the

transaction database, each grid cell was treated as one

transaction record with 5 parts, including TID (the grid cell‟s

ID), the slope category, the aspect category, the elevation

category and the land cover type. Then it‟s required to read

and classify quickly the attribute values of each grid cell from

all the raster datasets to construct the attribute transaction table

according to the categories of all the spatial datasets. At last,

the Apriori algorithm and the new one were applied to

extracting the frequent itemsets from the constructed

transaction database. The program of all the above tasks was

implemented by using of c# programming language (Fig. 6).

Figure 6: The procedure of the extracting the spatial

association rules

3.2 Comparison of the Algorithms’ Efficiencies

After the transaction database with the number of transaction

records being 157155 was constructed, the procedure as

shown in Fig. 6 was performed on the computer with Pentium

(R) Dual-Core2.60GHz CPU and 2GB memory to extract all

the frequent itemsets with the minimum supports as 100, 200,

400, 800, 1600, 3200, 6400, 12800, 25600 and the spatial

association rules with the minimum support and confidence

being 2% and 30% respectively. At last the output of the

procedure was shown in Fig. 7. It can be seen obviously that

the runtime of the new algorithm is less than the Apriori‟s for

each minimum support. As the minimum support grew

smaller, the runtime of the two algorithms both increased

continuously, but the growth rate of the new algorithm‟s

runtime was much less than the Apriori‟s. And the new

algorithm not only reduced times of scanning transaction

database, but also decreased the number of the set of candidate

itemsets according to the comparison of the principles

between two algorithms. Therefore, the proposed algorithm in

this paper is superior to the Apriori.

Figure 7: The comparison of between the proposed

algorithm and the Apriori

4. CONCLUSIONS

The paper represented a new algorithm used for mining spatial

association rules—extracting maximum frequent itemsets first

based on a Boolean matrix. The algorithm not only reduced

the times of scanning transaction database, but also decreased

the number of the set of candidate itemsets. However, some

problems about the algorithm should still be taken into

consideration further: First, in the Boolean matrix of frequent

length-1 itemsets, there may be lots of successive values 0 so

as to waste memory resource to some extent. Although

compressing the matrix can solve the problem, in contrast,

uncompressing the matrix may lower the efficiency of the

algorithm; Second, the algorithm is lack of evaluating the

quality of frequent itemsets, especially, interpreting and

understanding the significance of frequent itemsets: Third, the

auto-correlation between spatial objects is not be considered in

the new algorithm. Finally, the above three aspects will be

emphasized in the future work.

REFERENCES

[1] D. Li, S. Wang, and D. Li, Spatial Data Mining Theories

and Applications, Beijing: Publisher of Science, 2006, pp. 32-

36.

[2] R. Ma, Y. Pu, and X. Ma, Mining Spatial Association

Patterns from GIS Database, Beijing: Publisher of Science,

2007, pp. 68-69.





[3] R. Agrawal, T. Imelinski, and A. Swami, “Mining

Association Rules Between Sets of Items in Large Database,”

Proc. ACM-SIGM OD International Conference, pp. 208-216,

1993.

[4] J. Han and M. Kamber, Data Mining Concepts and

Techniques, Beijing: China Machine Press, 2007, pp. 149-152.

[5] K. Koperski and J. Han, “Discovery of spatial association

rules in geographic information databases,” Lecture Notes In

Computer Science, vol. 951, 1995, pp. 47-66.

[6] A. Salleb and C. Vrain, “An application of association

rules discovery to geographic information systems,” Proc. The

4th European Conference on Principles of Data Mining and

Knowledge Discovery PKDD, pp. 613-618, 2000.

[7] G. Chen, Z. He, and B. Yang, “Spatial Association Rules

Data Mining Research on Terrain Feature and Mountain

Climate Change,” Geography and Geo-Information Science,

vol. 26(1), 2010, pp. 37-40.

[8] Y. Fu and J. Han, “Meta-Rule-Guided Mining of

Association Rules in Relational Databases,” Proc. Int‟l

Workshop on Internation of Knowledge Discovery with

Deductive and Objective and Object-Oriented Databases, pp.

39-46, 1995,.

[9] C. Yuan and F. Xiong, “Meta-rule-guided Mining

Multiple-level Spatial Association Rules Based on Progressive

Refinement,” Computer Engineering, vol. 30(8), 2004, pp. 34-

36.

[10] D. Xin, X. Shen, Q. Mei, and J. Han, “Discovering

Interesting Patterns Through User‟s Interactive Feedback”,

Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and

Data Mining (KDD‟06), Philadelphia, Pennsylvania, USA,

August 20-23, 2006.

[11] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining Frequent

Patterns without Candidate Generation: A Frequent-Pattern

Tree Approach,” Data Mining and Knowledge Discovery,

2004, pp. 53-87.

[12] R. Ma, X. Ma, and Y. Pu, “Spatial Association Rule

Mining from GIS Database,” Journal of Remote Sensing, vol.

9(6), 2005, pp. 733-741.

[13] L. Wang, K. Xie, T. Chen, and X. Ma, “Efficient

discovery of multilevel spatial association rules using

partitions,” Information and Software Technology, vol. 47,

2005, pp. 829-840.

[14] A. J. T. Lee, R. Hong, W. Ko, W. Tsao, and H. Lin,

“Mining spatial association rules in image databases,”

Information Sciences, vol. 177, 2007, pp. 1593-1608.

[15] Y. Zhang, “Research of Frequent Itemsets Mining

Algorithm Based on 0-1 Matrix,” Computer Engineering and

Design, vol. 30(20), 2009, pp. 4662-4664.

[16] I. Mukhlash and B. Sitohang, “Spatial Data Preprocessing

for Mining Spatial Association Rule with Conventional

Association Mining Algorithms,” Proc. The International

Conference on Electrical Engineering and Informatics,

Institute Teknologi Bandung, Indonesia, pp. 531-534, June 17-

19, 2007.

[17] Physical Regionalization Working Committee of Chinese

Academy of Science, Geomorphological Regionalization of

China, Beijing: Publisher of Science, 1959.

BIOGRAPHIES

Prof. Y Jaya Babu is currently heading

the department of Computer Applications,

Pragati Engineering College. He is a

postgraduate in Computer Science and

Technology and had 18 years of teaching

and research experience. His research

interests include spatial data mining, web

mining and data warehousing.

Mrs. G J Phani Bala is an Assistant

Professor in the department of Information

Technology, Pragati Engineering College.

She is graduated in Computer Science and

Engineering and had 5 years of teaching

and research experience. Her research

interests include data mining, 2D object

rendering and image processing.

Mr. Siva Rama Krishna T is an

Assistant Professor in the department of

Computer Science and Engineering,

Vishnu Institute of Technology. He is a

postgraduate in Computer Networks and

had 3 years of teaching and research

experience. His research interests include

data mining, cloud computing and

security protocols.

Documents

IJESAT_2012_02_01_13