Upload
ijesat-journal
View
213
Download
0
Tags:
Embed Size (px)
DESCRIPTION
1. INTRODUCTION ISSN: 2250–3676 Prof. & Head, Dept. of MCA, Pragati Engg. College, Andhra Pradesh, India, [email protected] 2 Asst. Prof., Dept. of IT, Pragati Engg. College, Andhra Pradesh, India, [email protected] 3 Asst. Prof., Dept. of CSE, Vishnu Inst. of Technology, Andhra Pradesh, India, [email protected] Abstract IJESAT | Jan-Feb 2012 [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY Volume - 2, Issue - 1, 79 – 84 1
Citation preview
Y JAYA BABU* et al. ISSN: 2250–3676
[IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY Volume - 2, Issue - 1, 79 – 84
IJESAT | Jan-Feb 2012
Available online @ http://www.ijesat.org 79
EXTRACTING SPATIAL ASSOCIATION RULES FROM THE MAXIMUM
FREQUENT ITEMSETS BASED ON BOOLEAN MATRIX
Prof. Y Jaya Babu1, G J Phani Bala
2, Siva Rama Krishna T
3
1Prof. & Head, Dept. of MCA, Pragati Engg. College, Andhra Pradesh, India, [email protected]
2 Asst. Prof., Dept. of IT, Pragati Engg. College, Andhra Pradesh, India, [email protected]
3 Asst. Prof., Dept. of CSE, Vishnu Inst. of Technology, Andhra Pradesh, India, [email protected]
Abstract Mining spatial association rules is one of the most important branches in the field of Spatial Data Mining (SDM). Because of the
complexity of spatial data, a traditional method in extracting spatial association rules is to transform spatial database into general
transaction database. The Apriori algorithm is one of the most commonly used methods in mining association rules at present. But a
shortcoming of the algorithm is that its performance on the large database is inefficient. The present paper proposed a new algorithm
by extracting maximum frequent itemsets based on a Boolean matrix. And a case study about extracting the spatial association rules
between land cover and terrain factors was demonstrated to show the validation of the new algorithm. Finally, the conclusion was
reached by the comparison between the Apriori algorithm and the new one which revealed that the new algorithm improves the
efficiency of extracting spatial association rules.
Index Terms: Maximum frequent itemset; Spatial association rule; Apriori algorithm
--------------------------------------------------------------------- *** ------------------------------------------------------------------------
1. INTRODUCTION
Spatial Data Mining (SDM) is a process of spatial support
decision, which aims at extracting the implicit, unknown,
potential, useful spatial and non-spatial knowledge from
spatial data, including general geometry rules, spatial
characteristics rules, spatial classification rules, spatial
clustering rules, spatial association rules and so on [1]. Spatial
association rule, termed as spatial association location pattern
[2], is one of the most important branches in the SDM, which
means a rule indicating certain association relationships
among a set of spatial and nonspatial attributes of
geographical objects. Because of the complexity of spatial
data, the main idea of extracting spatial association rules is to
mine spatial association rules in the transaction database
categorized from spatial data using some mining algorithms.
The Apriori algorithm [3] is one of the most commonly used
algorithms in mining association rules at present, and its
typical application was market basket analysis to discover
customer shopping patterns [4]. Subsequently, the algorithm
was extended towards SDM to discover multi-level spatial
association rules based on progressive refinement [5]; But a
shortcoming about the algorithm is that the performance is
inefficient on the large database, especially, the deficiency is
more obvious for an amount of spatial data. Although the
meta-rules can reduce the computation of the number of
unnecessary itemsets, the metarules were re-designed and
users accepted them passively. Therefore, two models were
proposed to learn the prior knowledge from users‟ interactive
feedback [10]. In this paper, a new algorithm was proposed
that focus on extracting maximum frequent itemsets first
based on the Boolean matrix of frequent length-1 itemsets that
are generated using the Apriori algorithm, and then generating
all the frequent itemsets from maximum frequent itemsets
according to the nonempty sub-sets of frequent itemsets being
still frequent. Finally, the comparison between the Apriori
algorithm and the proposed one by mining the spatial
association rules between terrain factors and land cover was
showed to validate the new algorithm‟s efficiency.
2. THE COMPARISON OF THE PRINCIPLES OF
THE ALGORITHMS
The Apriori algorithm is one of the most influential algorithms
used for mining association rules, which was proposed by R.
Aglawal et al. in 1994. According to the principles of the
Apriori algorithm in [3], it is composed of two steps, one is
extracting all the frequent itemsets; the other is generating all
the strong association rules from frequent itemsets [6]. In fact,
the essence is to iteratively generate the set of candidate
itemsets of length (k+1) from frequent itemsets of length-k
and check their corresponding occurrence frequencies in the
Y JAYA BABU* et al. ISSN: 2250–3676
[IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY Volume - 2, Issue - 1, 79 – 84
IJESAT | Jan-Feb 2012
Available online @ http://www.ijesat.org 80
database to obtain frequent itemsets of length (k+1) at each
level. Therefore it can be seen that there are two main reasons
to low efficiency of the Apriori algorithm: It is required to
generate lots of candidate itemsets for generating each
frequent itemsets; It is essential to scan database many times
for generating each frequent itemsets. Thus the research
presented a new algorithm of mining maximum frequent
itemsets first based on the Boolean matrix of frequent length-1
itemsets. The main idea of the algorithm is to create a Boolean
matrix with frequent length-1 itemsets as row headings and
transaction records‟ IDs as column headings (TABLE I). In
the matrix, there are only two type of values, „1‟ and „0‟,
which means that the transaction record contains or not the
corresponding frequent length-1 itemset. Then it is necessary
to calculate the number of value 1 in each column and the
count of the columns with the same number of value 1. If the
count of those columns is larger than the minimum support, in
accordance, the number of value 1 in the column may be the
size of maximum frequent itemset, vice versa. Therefore,
some values of which each may be the maximum frequent
itemset‟s length will be calculated. Subsequently, a set of
candidate itemsets used for extracting maximum frequent
itemsets will be generated from frequent length-1 itemsets
according to each maximum value and the support of each
candidate itemset will be calculated based on the Boolean
matrix. If the support is larger than the minimum support, the
candidate itemset is frequent, vice versa. Finally, all the
frequent itemsets will be extracted from maximum frequent
itemsets according to the nonempty sub-sets of frequent
itemsets being still frequent. Generally speaking, the main
principles of the new algorithm include three aspects:
Table 1 A Part Of The Boolean Matrix Of The Frequent
Length-1 Itemsets
2.1 Creating a Boolean Matrix According to
Frequent Length-1 Itemsets
All the frequent length-1 itemsets will be generated from
transaction database using the Apriori algorithm when
transaction database is scanned first time and for each frequent
length-1 itemset, all the IDs of transaction records containing
it need to be taken note in one array. Then the corresponding
Boolean array with the length being the number of the
transaction records in database will be created for each
frequent length-1 itemset. In each array, there are only two
values, „0‟ and „1‟. If transaction record contains frequent
length-1 itemset, the value is 1 in the corresponding Boolean
array, vice versa. At last, a Boolean matrix will be constructed
according to all the Boolean arrays of frequent length-1
itemsets.
1) Definition 1: The corresponding Boolean array ofeach
frequent length-1 itemset Im[N] is {BT1, BT2, ... , BTn} (1≤
n≤N), where Im is the mth frequent length-1 itemset; N is the
number of transaction records in database; Tn is ID of the nth
transaction record respectively; and BTn‟s value is 0 or 1 only.
2) Definition 2: The Boolean matrix of frequent length-1
itemsets IM*N is {I1[N], I2[N], ... , Im[N]} (1≤m≤M), where
Im[N] is the Boolean array with N dimensions of the mth
frequent length-1 itemset; M is the number of frequent length-
1 itemsets.
3) The pesudo codes of the first part (Fig. 1):
Figure 1: The pesudo codes of the first part
Y JAYA BABU* et al. ISSN: 2250–3676
[IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY Volume - 2, Issue - 1, 79 – 84
IJESAT | Jan-Feb 2012
Available online @ http://www.ijesat.org 81
2.2 Extracting Maximum Frequent Itemsets
from Boolean Matrix
Each column in the Boolean matrix represents one transaction
record. Value 0 in the column means the corresponding
transaction record contains the corresponding frequent length-
1 itemset, vice versa. Therefore, the number of value 1 in each
column indicates the corresponding transaction record
contains the number of frequent length-1 itemsets together. If
there is the number of transaction records with the same
number of value 1 being larger than the minimum support, the
number of value 1 may be the size of maximum frequent
itemset, vice versa. As a result, a set of values in which each
one may be maximum frequent itemset‟s length will be
obtained. Then according to each of the values in descending
order, a series of candidate itemsets will be generated from
frequent length-1 itemsets and the support of each candidate
itemset could be calculated according to the Boolean matrix of
frequent length-1 itemsets. If the support of each candidate
itemset is larger than the minimum support, the candidate
itemset is frequent, vice versa. At last, if the maximum
frequent itemsets generated from the set of candidate itemsets
are not empty, the size of candidate itemset is required, that is
length of maximum frequent itemset. Otherwise, it is
necessary to continue the previous operation to check the next
value until maximum frequent itemsets are not empty. If all
the maximum frequent itemsets are empty, the maximum
length of frequent itemset is one.
1) Definition 3: Max[n] is an array used for storing some
values of which each may be the length of maximum frequent
itemset, where n is the size of Max[n].
2) Definition 4: The set of candidate itemsets of maximum
frequent itemsets C is {IM1, IM2, ... , IMn}, therefore, the
corresponding Boolean matrix CMn*N is {IM1[N], IM2[N], ... ,
IMn[N]}, where IMn is candidate itemset.
3) Definition 5: The support of candidate itemset C,
Support(C) = IM1[N] And IM2[N] And ... IMn[N]. Fig. 3 shows
the example of the logical Boolean operator “And” between
the Boolean arrays of candidate itemsets, where “And” is the
logical Boolean operator, if there exists value 0, then the
calculation will be 0.
4) The pesudo codes of the second part (Fig. 2):
Figure 2: The pesudo codes of the second part
Figure 3: The logical Boolean operator of the Boolean
arrays of the set of candidate itemsets
2.3 Generating All the Frequent Itemsets from
Maximum Frequent Itemsets
Y JAYA BABU* et al. ISSN: 2250–3676
[IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY Volume - 2, Issue - 1, 79 – 84
IJESAT | Jan-Feb 2012
Available online @ http://www.ijesat.org 82
All the frequent itemsets could be extracted from all the
maximum frequent itemsets according to the nonempty
subsets of frequent itemsets being still frequent. And the
support of each frequent itemset could be calculated by
Definition 5. At last, all the strong association rules can be
mined from all the frequent itemsets.
3. THE IMPLEMENTS AND COMPARISON OF
THE ALGORITHMS
In this part, a case study about extracting the spatial
association rules between land cover and terrain factors was
presented to validate the proposed algorithm‟s efficiency. The
slope and the aspect derived from DEM with the grid cell size
of 100m and the land cover map extracted from the SPOT-5
remote sensing image were taken as experimental datasets; the
Apriori algorithm and the new one were used to mine the
spatial association rules from the above datasets and the
efficiency between the two algorithms was be compared and
analyzed at last.
3.1 Spatial Data Preprocessing
Spatial datasets need to be preprocessed to construct the
transaction database before mining spatial association rules
according to the main idea of mining spatial association rules
at present. Imam Mukhlash and Benhard Sitohang put forward
the framework of spatial data preprocessing, including feature
(spatial and non-spatial) selection based on spatial parameters,
performing dimension reduction and selection of non-spatial
attributes, performing data categorization based on non-spatial
data parameters, performing join operations for spatial objects
based on spatial parameters and transforming into output form
[16]. Therefore, all the spatial datasets in the case need be
preprocessed as the following three aspects:
1) The preparation and preprocessing of spatial datasets: The
spatial datasets in the case included the elevation, the slope
and the aspect with the spatial resolution of 100m and the land
cover map. The slope and the aspect were derived from the
elevation and the land cover map was derived from the SPOT-
5 remote sensing image. Fig. 4 shows the flow chart of the
spatial data preprocessing. At last, all the spatial datasets are
masked by the study region boundary layer to be sure the same
spatial extent for each spatial dataset.
Figure 4: The flow chart of the spatial data preprocessing
2) The categorization of the attribute values for each spatial
dataset: According to the spatial data preprocessing
framework, the attribute values of each spatial dataset must be
generalized. Therefore, the elevation would be categorized
into 5 types, including extremely High Mountain (>5000m),
High Mountain (3500~5000m), Middle Mountain
(1000~3000m), Low Mountain (500~1000m) and Plain and
Hill (<500m) according to [17] and the slope could be
generalized into 4 types based on the slope steepness
classification of International Geographical Union
Geomorphological Survey and Mapping Council, including
plain (<2°), slope (2°~6°), abrupt slope (6°~25°) and steep
slope (>25°). Fig. 5 shows the compass direction of the aspect.
The land cover types included river, estuarine, reservoir, built-
up land, farmland, gardens, forest land, mangrove, grass land,
and so on.
Figure 5: The compass direction of the aspect
Y JAYA BABU* et al. ISSN: 2250–3676
[IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY Volume - 2, Issue - 1, 79 – 84
IJESAT | Jan-Feb 2012
Available online @ http://www.ijesat.org 83
3) The construction of the transaction database: After
completing the spatial data preprocessing, to construct the
transaction database, each grid cell was treated as one
transaction record with 5 parts, including TID (the grid cell‟s
ID), the slope category, the aspect category, the elevation
category and the land cover type. Then it‟s required to read
and classify quickly the attribute values of each grid cell from
all the raster datasets to construct the attribute transaction table
according to the categories of all the spatial datasets. At last,
the Apriori algorithm and the new one were applied to
extracting the frequent itemsets from the constructed
transaction database. The program of all the above tasks was
implemented by using of c# programming language (Fig. 6).
Figure 6: The procedure of the extracting the spatial
association rules
3.2 Comparison of the Algorithms’ Efficiencies
After the transaction database with the number of transaction
records being 157155 was constructed, the procedure as
shown in Fig. 6 was performed on the computer with Pentium
(R) Dual-Core2.60GHz CPU and 2GB memory to extract all
the frequent itemsets with the minimum supports as 100, 200,
400, 800, 1600, 3200, 6400, 12800, 25600 and the spatial
association rules with the minimum support and confidence
being 2% and 30% respectively. At last the output of the
procedure was shown in Fig. 7. It can be seen obviously that
the runtime of the new algorithm is less than the Apriori‟s for
each minimum support. As the minimum support grew
smaller, the runtime of the two algorithms both increased
continuously, but the growth rate of the new algorithm‟s
runtime was much less than the Apriori‟s. And the new
algorithm not only reduced times of scanning transaction
database, but also decreased the number of the set of candidate
itemsets according to the comparison of the principles
between two algorithms. Therefore, the proposed algorithm in
this paper is superior to the Apriori.
Figure 7: The comparison of between the proposed
algorithm and the Apriori
4. CONCLUSIONS
The paper represented a new algorithm used for mining spatial
association rules—extracting maximum frequent itemsets first
based on a Boolean matrix. The algorithm not only reduced
the times of scanning transaction database, but also decreased
the number of the set of candidate itemsets. However, some
problems about the algorithm should still be taken into
consideration further: First, in the Boolean matrix of frequent
length-1 itemsets, there may be lots of successive values 0 so
as to waste memory resource to some extent. Although
compressing the matrix can solve the problem, in contrast,
uncompressing the matrix may lower the efficiency of the
algorithm; Second, the algorithm is lack of evaluating the
quality of frequent itemsets, especially, interpreting and
understanding the significance of frequent itemsets: Third, the
auto-correlation between spatial objects is not be considered in
the new algorithm. Finally, the above three aspects will be
emphasized in the future work.
REFERENCES
[1] D. Li, S. Wang, and D. Li, Spatial Data Mining Theories
and Applications, Beijing: Publisher of Science, 2006, pp. 32-
36.
[2] R. Ma, Y. Pu, and X. Ma, Mining Spatial Association
Patterns from GIS Database, Beijing: Publisher of Science,
2007, pp. 68-69.
Y JAYA BABU* et al. ISSN: 2250–3676
[IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY Volume - 2, Issue - 1, 79 – 84
IJESAT | Jan-Feb 2012
Available online @ http://www.ijesat.org 84
[3] R. Agrawal, T. Imelinski, and A. Swami, “Mining
Association Rules Between Sets of Items in Large Database,”
Proc. ACM-SIGM OD International Conference, pp. 208-216,
1993.
[4] J. Han and M. Kamber, Data Mining Concepts and
Techniques, Beijing: China Machine Press, 2007, pp. 149-152.
[5] K. Koperski and J. Han, “Discovery of spatial association
rules in geographic information databases,” Lecture Notes In
Computer Science, vol. 951, 1995, pp. 47-66.
[6] A. Salleb and C. Vrain, “An application of association
rules discovery to geographic information systems,” Proc. The
4th European Conference on Principles of Data Mining and
Knowledge Discovery PKDD, pp. 613-618, 2000.
[7] G. Chen, Z. He, and B. Yang, “Spatial Association Rules
Data Mining Research on Terrain Feature and Mountain
Climate Change,” Geography and Geo-Information Science,
vol. 26(1), 2010, pp. 37-40.
[8] Y. Fu and J. Han, “Meta-Rule-Guided Mining of
Association Rules in Relational Databases,” Proc. Int‟l
Workshop on Internation of Knowledge Discovery with
Deductive and Objective and Object-Oriented Databases, pp.
39-46, 1995,.
[9] C. Yuan and F. Xiong, “Meta-rule-guided Mining
Multiple-level Spatial Association Rules Based on Progressive
Refinement,” Computer Engineering, vol. 30(8), 2004, pp. 34-
36.
[10] D. Xin, X. Shen, Q. Mei, and J. Han, “Discovering
Interesting Patterns Through User‟s Interactive Feedback”,
Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and
Data Mining (KDD‟06), Philadelphia, Pennsylvania, USA,
August 20-23, 2006.
[11] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining Frequent
Patterns without Candidate Generation: A Frequent-Pattern
Tree Approach,” Data Mining and Knowledge Discovery,
2004, pp. 53-87.
[12] R. Ma, X. Ma, and Y. Pu, “Spatial Association Rule
Mining from GIS Database,” Journal of Remote Sensing, vol.
9(6), 2005, pp. 733-741.
[13] L. Wang, K. Xie, T. Chen, and X. Ma, “Efficient
discovery of multilevel spatial association rules using
partitions,” Information and Software Technology, vol. 47,
2005, pp. 829-840.
[14] A. J. T. Lee, R. Hong, W. Ko, W. Tsao, and H. Lin,
“Mining spatial association rules in image databases,”
Information Sciences, vol. 177, 2007, pp. 1593-1608.
[15] Y. Zhang, “Research of Frequent Itemsets Mining
Algorithm Based on 0-1 Matrix,” Computer Engineering and
Design, vol. 30(20), 2009, pp. 4662-4664.
[16] I. Mukhlash and B. Sitohang, “Spatial Data Preprocessing
for Mining Spatial Association Rule with Conventional
Association Mining Algorithms,” Proc. The International
Conference on Electrical Engineering and Informatics,
Institute Teknologi Bandung, Indonesia, pp. 531-534, June 17-
19, 2007.
[17] Physical Regionalization Working Committee of Chinese
Academy of Science, Geomorphological Regionalization of
China, Beijing: Publisher of Science, 1959.
BIOGRAPHIES
Prof. Y Jaya Babu is currently heading
the department of Computer Applications,
Pragati Engineering College. He is a
postgraduate in Computer Science and
Technology and had 18 years of teaching
and research experience. His research
interests include spatial data mining, web
mining and data warehousing.
Mrs. G J Phani Bala is an Assistant
Professor in the department of Information
Technology, Pragati Engineering College.
She is graduated in Computer Science and
Engineering and had 5 years of teaching
and research experience. Her research
interests include data mining, 2D object
rendering and image processing.
Mr. Siva Rama Krishna T is an
Assistant Professor in the department of
Computer Science and Engineering,
Vishnu Institute of Technology. He is a
postgraduate in Computer Networks and
had 3 years of teaching and research
experience. His research interests include
data mining, cloud computing and
security protocols.