10
Association Rule Mining Using Genetic Algorithm: The role of Estimation Parameters Indira. K 1 , Dr. S. Kanmani 2 , 1 Research Scholar, Department of Computer Science, 2 Professor, Department of Information Technology, Pondicherry Engineering College, Puducherry, India 1 i[email protected] , 2 [email protected] Abstract. Genetic Algorithms (GA) have emerged as practical, robust optimization and search methods to generate accurate and reliable Association Rules. The performance of GA for mining association rules greatly depends on the GA parameters namely population size, crossover rate, mutation rate, fitness function adopted and selection method. The objective of this paper is to compare the performance of the Genetic algorithm for association rule mining by varying these parameters. The algorithm when tested on three datasets namely Lenses, Iris and Haberman indicates that the accuracy depends mainly on the fitness function which is the key parameter of GA. The population size is affected by the size of the dataset under study. The crossover probability brings changes in convergence rate with minimal changes in accuracy. The size of the dataset and relationship between its attributes also plays a role in achieving the optimum accuracy. Keywords: Association rules, Genetic Algorithm, Population size, Crossover rate, Fitness function. 1 Introduction Data mining, also referred as knowledge discovery in database, means a process of nontrivial extraction of implicit, previously unknown and potentially useful information (such as knowledge rules, constraints, regularities) from data in database. Data mining combines theory and technology of several domains which include artificial intelligence, machine learning, statistics, neural network and so on. Association rule mining is a major area in data mining that discovers the relations  between different attributes by analyzing and disposing data in the database. Many algorithms for generating association rules were developed over time. Some of the well known algorithms are Apriori, Eclat and FP-Growth tree. Many existing algorithms traverse the database many times so the I/O overhead and computational complexity becomes very high and cannot meet the requirements of large-scale database mining. Genetic algorithm is an algorithm which based on the  biological theory of evolution and molecular genetics of the global random search, the algorithm has a strong randomness, robust and implicit parallelism and can quickly

Ga Parameters

Embed Size (px)

Citation preview

8/4/2019 Ga Parameters

http://slidepdf.com/reader/full/ga-parameters 1/10

Association Rule Mining Using Genetic Algorithm: The

role

of Estimation Parameters

Indira. K 1, Dr. S. Kanmani

2,

1Research Scholar, Department of Computer Science,2Professor, Department of Information Technology,Pondicherry Engineering College, Puducherry, India

[email protected], [email protected]

Abstract. Genetic Algorithms (GA) have emerged as practical, robustoptimization and search methods to generate accurate and reliable AssociationRules. The performance of GA for mining association rules greatly depends on

the GA parameters namely population size, crossover rate, mutation rate, fitnessfunction adopted and selection method. The objective of this paper is to compare

the performance of the Genetic algorithm for association rule mining by varyingthese parameters. The algorithm when tested on three datasets namely Lenses,

Iris and Haberman indicates that the accuracy depends mainly on the fitnessfunction which is the key parameter of GA. The population size is affected bythe size of the dataset under study. The crossover probability brings changes in

convergence rate with minimal changes in accuracy. The size of the dataset and

relationship between its attributes also plays a role in achieving the optimumaccuracy.

Keywords: Association rules, Genetic Algorithm, Population size, Crossover 

rate, Fitness function.

1  Introduction

Data mining, also referred as knowledge discovery in database, means a process of 

nontrivial extraction of implicit, previously unknown and potentially useful

information (such as knowledge rules, constraints, regularities) from data in database.

Data mining combines theory and technology of several domains which include

artificial intelligence, machine learning, statistics, neural network and so on.

Association rule mining is a major area in data mining that discovers the relations

 between different attributes by analyzing and disposing data in the database.

Many algorithms for generating association rules were developed over time.

Some of the well known algorithms are Apriori, Eclat and FP-Growth tree. Many

existing algorithms traverse the database many times so the I/O overhead and

computational complexity becomes very high and cannot meet the requirements of 

large-scale database mining. Genetic algorithm is an algorithm which based on the

 biological theory of evolution and molecular genetics of the global random search, the

algorithm has a strong randomness, robust and implicit parallelism and can quickly

8/4/2019 Ga Parameters

http://slidepdf.com/reader/full/ga-parameters 2/10

and effectively search for global optimization, in an effective way to deal with large-

scale data sets. At present, genetic algorithm-based data mining methods have yielded

some progress, and based on genetic algorithms classification system has also yieldedsome results.

This paper analyses the mining of Association Rules by applying Genetic

Algorithms. There have been several attempts for mining association rules using

Genetic Algorithm. Robert Cattral et al. [1] describe the evolution of hierarchy of rule

using genetic algorithm with chromosomes of varying length and macro mutations.

The initial population is seeded rather than random selection. Manish Saggar et al. [2] proposes an algorithm with binary encoding and the fitness function was generated

  based on confusion matrix. The individuals are represented using the Michigan’s

Approach. Roulette Wheel selection is done by first normalizing the values of all

candidates.

Genetic algorithm based on the concept of strength of implication of rules was

  presented by Zhou et al. [3].  The properties of independence and correlation of descriptions in rules are taken up for fitness calculation. Genxiang et al. [4]

introduced dynamic immune evolution, and biometric mechanism in Engineering

immune computing namely immune recognition, immune memory and immune

regulation to GA for mining association rules.

Gonzales. E et al. [5] introduced the Genetic Relation Algorithm (GRA) based on

evaluating the distances between rules. The distance is calculated using both matchingcriteria namely complete match and partial match. Genetic algorithm easily leads to

 premature convergence or takes too much time to converge during evolution process.

Hong Lei et al. [6] propose GA where the fitness function is based on predictive

accuracy, comprehensibility and interestingness factor. The selection method is based

on elitist recombination.In Haiying Ma et al. [7] the encoding of data is done with gene string structure

where the complexity concepts are mapped to form linear symbols. The fitness

function is the measure of the overall performance of the process rather than that of 

individual rules when the bit strings were interpreted as a complex process. Adaptive

exchange probability (Pc) and mutation probability (Pm) are adopted in this paper.

Hong Guo et al. [8] adopt the method of adaptive mutation rate to avoid excessive

variation causing non-convergence, or into a local optimal solution. A sort of 

individual-based selection method is applied to the evolution in genetic algorithm, in

order to prevent the high-fitness individuals converging early by the rapid growth of 

the number of individual.

As the parameters of the genetic algorithm and the fitness function are found to

 be the major area of interest in the above studies, this paper tries to explore on the

effects of the genetic parameters and the controlling variables of fitness function onthree different datasets.

A  brief introduction about Association Rule Mining and GA is given in Section

2, followed by methodology in section 3, which describes the basic implementation

details of Association Rule Mining with GA. In section 4 the parameters that decideson efficiency of the algorithm is presented. Section 5 presents the experimental results

followed by conclusion in the last section.

8/4/2019 Ga Parameters

http://slidepdf.com/reader/full/ga-parameters 3/10

2  Association Rules and Genetic Algorithms2.1 Association Rules

Association rule is a popular and well researched method for discovering interesting

relations between variables in large databases. It studies the frequency of itemsoccurring together in transactional databases, and based on a threshold called support,

identifies the frequent item sets. Another threshold, confidence, which is the

conditional probability that an item appears in a transaction when another item

appears, is used to pinpoint association rules.

The discovered association rules are of the form: P Q [s, c], where P and Q are

conjunctions of attribute value-pairs, and s (for support) is the probability that P andQ appear together in a transaction and c (for confidence) is the conditional probability

that Q appears in a transaction when P is present.

2.2 Genetic Algorithm

A Genetic Algorithm (GA) is a procedure used to find approximate solutions to

search problems through the application of the principles of evolutionary biology.

Genetic algorithms use biologically inspired techniques such as genetic inheritance,natural selection, mutation, and sexual reproduction (recombination, or crossover).

Genetic algorithms are typically implemented using computer simulations in

which an optimization problem is specified. For this problem, members of a space of 

candidate solutions, called individuals, are represented using abstract representations

called chromosomes. The GA consists of an iterative process that evolves a workingset of individuals called a population towards an objective function, or fitness

function. Traditionally, solutions are represented using fixed length strings especially

 binary strings, but alternative encodings have also been developed.

3  Methodology

The evolutionary process of GA is a highly simplified and stylized simulation of the

  biological version. It starts from a population of individuals randomly generated

according to some probability distribution, usually uniform and updates this

  population in steps called generations. In each generation, multiple individuals are

randomly selected from the current population based on application of fitness,

crossover, and modified through mutation to form a new population.

A.  [Start] Generate random population of n chromosomes.B.  [Fitness] Evaluate the fitness f(x) of each chromosome x in the population.

C.  [New population] Create a new population by repeating the following steps until

the new population is complete.

i.  [Selection] Select two parent chromosomes from a population according

to their fitness.ii.  [Crossover] With a crossover probability alter the parents to form a new

offspring.

8/4/2019 Ga Parameters

http://slidepdf.com/reader/full/ga-parameters 4/10

iii.  [Mutation] With a mutation probability mutate new offspring at each

locus.

iv.  [Accepting] Place new offspring in a new populationD.  [Replace] Use newly generated population for a further run of the algorithm

E.  [Test] If the end condition is satisfied, stop, and return the best solution in

current population

F.  [Loop] Go to step B 

4  Parameters in Genetic Algorithm 

The GA parameters are the key components enabling the system to achieve good

enough solution for possible terminating conditions.

4.1 Encoding

Encoding is the process of representing individual solutions. The most common way

of encoding is binary encoding. Here each chromosome encodes a binary string where

each bit in the string represents some characteristics of the solution. Other encoding

schemes are octal, hexadecimal, permutation value and tree encoding.

4.2 Population

Population refers to the number of chromosomes taken up for optimization. A

chromosome is the raw genetic information that the GA deals with. If there are too

few chromosomes, GA has few possibilities to perform crossover and only a small  part of search space is explored. On the other hand, if there are too many

chromosomes, GA slows down. The initial population generation and population size

are the two aspects of population. The initial population is either selected randomly

from the data or selected with prior knowledge on the data.

The population size is calculated by 

(1)

Where = number of chromosomes in data and k is the average size of the schema of 

interest. If uniform crossover is adopted we can most likely get with population size at least

twice as small as the number of instances in the dataset.

4.3 Selection

During each successive generation, a proportion of the existing population is selected

to breed a new generation. Individuals are selected through a fitness-based process,

where fitter solutions as measured by a fitness function are typically more likely to be

selected. The Tournament, Roulette Wheel, Random, Rank and Boltzmann selection

8/4/2019 Ga Parameters

http://slidepdf.com/reader/full/ga-parameters 5/10

are the commonly used selection methods. Elitism and stochastic universal sampling

significantly improves the GA’s performance.

4.4 Fitness Function

A fitness function is a particular type of objective function that prescribes the

optimality of a chromosome in a genetic algorithm, so that the particular chromosome

may be ranked against all the other chromosomes [9, 10]. An ideal fitness function

correlates closely with the algorithm's goal, and yet may be computed quickly. Speedof execution is very important, as a typical genetic algorithm must be iterated many

times in order to produce an usable result for a non-trivial problem.

This paper adopts minimum support and minimum confidence for filtering rules.

Then correlative degree is confirmed in rules which satisfy minimum support-degreeand minimum confidence-degree. After support-degree and confidence-degree are

synthetically taken into account, fit degree function is defined as follows.

(2) 

In the above formula,  R s     +  Rc =1 ( R s  ≥0  Rc  ≥  0) and Suppmin, Conf min are

respective values of minimum support and minimum confidence. By all appearances

if the Suppmin and Conf min are set to higher values, then the value of fitness function isalso found to be high.

4.5 Crossover Operator

Crossover  entails choosing two individuals to swap segments of their code, producing

artificial "offspring" that are combinations of their parents. This process is intended to

simulate the analogous process of recombination that occurs to chromosomes during

sexual reproduction. Common forms of crossover include single-point crossover, in

which a point of exchange is set at a random location in the two individual genomes,

where one individual contributes all its code till the point of crossover, the second

individual contributes all its code after the point of crossover to produce an offspring,

and uniform crossover, in which the value at any given location in the offspring's

genome is either the value of one parent's genome at that location or the value of the

other parent's genome at that location, chosen with 50/50 probability[8].

4.6 Mutation Operator 

Partial gene values of individuals are adjusted by using mutation operation [5]. This

 part of the genetic algorithm, require great care, here there are two probabilities, one

usually called as Pm, this probability will be used to judge whether mutation has to be

done or not, when the candidate fulfills this criterion it will be fed to another 

8/4/2019 Ga Parameters

http://slidepdf.com/reader/full/ga-parameters 6/10

 probability, the locus probability that is on which point of the candidate the mutation

has to be done.

4.7 Number of Generations

The generational process of mining association rules by Genetic algorithm is repeated

until a termination condition has been reached. Common terminating conditions are:

A solution is found that satisfies minimum criteria.

•  Fixed number of generations reached.

•  Allocated budget (computation time/money) reached.

•  The highest ranking solution's fitness is reaching or has reached a plateau

such that successive iterations no longer produce better results.

•  Manual inspection.•  Combinations of the above.

5  Experimental Studies

The objective of this study is to compare the accuracy achieved in datasets by varying

the GA Parameters. The encoding of chromosome is binary encoding with fixedlength. As the crossover is performed on attribute level the mutation rate is set to zero

so as to retain the original attribute values. The selection method used is tournament

selection. The fitness function adopted is as given in equation (1).

Three datasets namely Lenses, Haberman survival and Iris Data Set from UCI

Machine Learning Repository have been taken up for experimentation. Lenses dataset

has 4 attributes with 24 instances. Haberman's Survival data Set has 3 attributes and306 instances and Iris dataset has 5 attributes and 150 instances. The Algorithm is

implemented using MATLAB R2008a simulation package. The flow of the system is

as shown in flowchart below.

Figure 1. Flow chart of the GA.

Select Survivors

Output Results

Crossover

Initialize Population

Evaluate fitness

Satisfy ConstraintsYes

No

8/4/2019 Ga Parameters

http://slidepdf.com/reader/full/ga-parameters 7/10

The default values set for the GA parameters are given in Table 1.

Table 1. Default GA Parameters.

Parameter Value

Population Size Instances * 1.5

Crossover Rate 0.5

Mutation Rate 0.0

Selection Method Tournament Selection

Minimum Support 0.2

Minimum Confidence 0.8

The accuracy and the convergence rate by controlling the GA parameters arerecorded in the table 2. Accuracy is the count of dataset matching between the

original dataset and resulting population divided by the number of instances in

dataset. The convergence rate is the generation at which the fitness value becomesfixed. The population size is varied for the three dataset, from the size of the dataset to

one and half times the dataset size while keeping the other parameters fixed.

Table 2: Comparison based on variation in population Size.

  No. of Instances No. of Instances * 1.25 No. of Instances *1.5

Accuracy

%

 No. of 

Generations

Accuracy

%

 No. of 

Generations

Accuracy

%

 No. of 

Generations

Lenses 75 7 82 12 95 17

Haberman 71 114 68 88 64 70Iris 77 88 87 53 82 45

It could be seen from Table 2 that for the Lenses dataset whose size is small,

an optimal accuracy is achieved, when the population size is one and half times the

size of the dataset whereas for the larger dataset, Haberman the accuracy is maximum

when the population size is equivalent to dataset size. For the Iris dataset of moderatesize the population has to be set to 1.25 times the size of the dataset to achieve

optimum result.

As the fitness function is considered to be the crucial factor for the GA,

variations are introduced in the fitness function while other parameters remain

unchanged. In Table 3 the minimum confidence and support values are altered when

others are at default values and the results are recorded.

From the Table 3 it is clear that the variation in minimum support andconfidence brings greater changes in accuracy. When the values of minimum support

and confidence are set to minimum, the accuracy if found to be low regardless of the

size of the dataset. The same is noted when both the values are set to maximum.

Optimum accuracy is achieved when a tradeoff value between minimum confidence

and minimum support is set.

8/4/2019 Ga Parameters

http://slidepdf.com/reader/full/ga-parameters 8/10

Table 3 : Comparison based on variation in Minimum Support and Confidence 

Minimum Support & Minimum Confidence

Sup = 0.4 &con =0.4

Sup =0.9 &con =0.9

Sup = 0.9 &con = 0.2

Sup = 0.2 &con = 0.9

Accuracy

%

 No.

of Gen.

Accuracy

%

 No.

of Gen.

Accuracy

%

 No.

of Gen.

Accuracy

%

 No.

of Gen.

Lenses 22 20 49 11 70 21 95 18

Haberman 45 68 58 83 71 90 62 75

Iris 40 28 59 37 78 48 87 55

When the parameters R s and R c are altered in the fitness function, minimumalterations in accuracy are noted and hence their impact is not taken up for analysis.

In Table 4 the crossover probability is altered when other GA parameters are

set to default values and the results observed are recorded.

Table 4 : Comparison based on variation in Crossover Probability

Cross Over 

Pc = .25 Pc = .5 Pc = .75

Accuracy%

 No. of Generations

Accuracy%

 No. of Generations

Accuracy%

 No. of Generations

Lenses 95 8 95 16 95 13

Haberman 69 77 71 83 70 80

Iris 84 45 86 51 87 55

From the Table 4 it is evident that the accuracy achieved is almost same for all the

three datasets whatever the crossover probability adopted. The effect of the crossover 

 probability on convergence rate is noticeable, the data size and population size being

set also alters the convergence rate.

The results observed are compared for the three datasets as shown in figures

2 and 3.

Figure 2: Population Size Vs Accuracy. Figure 3: Minimum Support and

Confidence Vs Accuracy.

8/4/2019 Ga Parameters

http://slidepdf.com/reader/full/ga-parameters 9/10

The values of the GA parameters set for the three datasets when maximum efficiency

is achieved is shown in Table 5.

Table 5. Comparison of the optimum value of Parameters for maximum Accuracy achieved.

Dataset No. of 

Instances

 No. of 

attributes

Minimum

Support

Minimum

confidence

Crossover 

rate

Accuracy

in %

Lenses 24 4 0.2 0.9 0.25 95

Haberman 306 3 0.9 0.2 0.5 71

Iris 150 5 0.2 0.9 0.75 87

It is observed from the experimental analysis that the choice of optimum population

size for better accuracy depends upon the number of instances in dataset. If datasetsize is larger, then the population size same as the number of instances in dataset is

found to produce better accuracy.

Setting up values for minimum support and confidence depends on the dataset and

their relationship between attributes. Tradeoff between minimum confidence and

minimum support has to be scored to attain optimum results. Cross over rate affectsthe convergence rate of the system mainly and has minimum effect on the accuracy of 

the system.

6  Conclusion

Genetic Algorithms have been used to solve difficult optimization problems in anumber of fields and have proved to produce optimum results in mining Association

rules. When Genetic algorithm is used for mining association rules the GA parameters

decides the efficiency of the system. Minimum support, minimum confidence and

  population size are the key parameters deciding the accuracy of the system. The

setting of the population size is based on the size of the problem under study, whereas

the minimum confidence and minimum support to be set depends upon the problem

under study. The optimum value of crossover rate leads to earlier convergence while

 playing minimum role in achieving better accuracy. The setting of optimum value of the GA parameters varies from data to data and the fitness function plays a major role

in optimizing the results. The size of the dataset and relationship between attributes in

data contributes to the setting up of the parameters. The efficiency of the methodology

could be further explored on more datasets with varying attribute sizes.

References

1.  Cattral, R., Oppacher, F., Deugo, D. : Rule Acquisition with a Genetic Algorithm. In:Proceedings of the 1999 Congress on Evolutionary Computation,. CEC 99, 1999.

2.  Saggar, M., Agrawal, A.K., Lad, A. : Optimization of Association Rule Mining. In IEEEInternational Conference on Systems, Man and Cybernetics, Vol. 4, Page(s): 3725 – 

3729, 2004.

8/4/2019 Ga Parameters

http://slidepdf.com/reader/full/ga-parameters 10/10

3.  Zhou Jun, Li Shu-you, Mei Hong-yan, Liu Hai-xia. : A Method for Finding ImplicatingRules Based on the Genetic Algorithm. In: Third International Conference on NaturalComputation, Volume: 3, Page(s): 400 – 405, 2007.

4.  Genxiang Zhang, Haishan Chen. : Immune Optimization Based Genetic Algorithm for Incremental Association Rules Mining. In : International Conference on Artificial

Intelligence and Computational Intelligence, AICI '09, Volume: 4, Page(s): 341 – 345,2009.

5.  Gonzales, E., Mabu, S., Taboada, K., Shimada, K., Hirasawa, K.: Mining Multi-classDatasets using Genetic Relation Algorithm for Rule Reduction. In : IEEE Congress onEvolutionary Computation, CEC '09, Page(s): 3249 – 3255, 2009.

6.  Xian-Jun Shi, Hong Lei. : Genetic Algorithm-Based Approach for Classification Rule

Discovery. In : International Conference on Information Management, InnovationManagement and Industrial Engineering, ICIII '08, Volume: 1 , Page(s): 175 – 178, 2008.

7.  Haiying Ma, Xin Li. : Application of Data Mining in Preventing Credit Card Fraud. In :

International Conference on Management and Service Science, MASS '09, Page(s): 1 – 6,2009.

8.  Hong Guo, Ya Zhou. : An Algorithm for Mining Association Rules Based on ImprovedGenetic Algorithm and its Application. In : 3rd International Conference on Genetic andEvolutionary Computing, WGEC '09, Page(s): 117 – 120, 2009.

9.  Hua Tang, Jun Lu. : Hybrid Algorithm Combined Genetic Algorithm with Information

Entropy for Data Mining. In: 2nd IEEE Conference on Industrial Electronics andApplications, Page(s): 753 – 757, 2007.

10.  Wenxiang Dou, Jinglu Hu, Hirasawa, K., Gengfeng Wu. : Quick Response Data Mining

Model using Genetic Algorithm. In: SICE Annual Conference, Page(s): 1214 – 1219,2008.