Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Association Rule Mining In Partitioned Databases:
Performance Evaluation and Analysis
A DISSERTATION Submitted in partial fulfillment
Of the requirements for the award of the degree Of
MASTER OF TECHNOLOGY In
INFORMATION TECHNOLOGY (Specialization: SOFTWARE ENGINEERING)
By
Pankaj Kandpal
Under the Guidance of:
Prof. M. Radhakrishna Mr. Manish Kumar
IIIT-Allahabad
INDIAN INSTITUTE OF INFORMATION TECHNOLOGY, ALLAHABAD
(A Centre of Excellence in Information Technology Established by Govt. of India)
INDIAN INSTITUTE OF INFORMATION TECHNOLOGY ALLAHABAD
(Deemed University)
(A centre of excellence in IT, established by Ministry of HRD, Govt. of India)
Date:
We do hereby recommend that the thesis work prepared under our supervision
by Pankaj Kandpal entitled Association Rule Mining in Partitioned
Databases: Performance Evaluation and Analysis be accepted in partial
fulfillment of the requirements of the degree of Master of Technology in
Information Technology (Software Engineering) for examination.
Countersigned
Prof. M. Radhakrishna
______________________________ Mr. Manish Kumar
Dr. U. S. Tiwary (Dean Academics)
Thesis Advisers
IINNDDIIAANN IINNSSTTIITTUUTTEE OOFF IINNFFOORRMMAATTIIOONN TTEECCHHNNOOLLOOGGYY
AALLLLAAHHAABBAADD (A University Established under sec.3 of UGC Act, 1956 vide Notification No. F.9-4/99-U.3 Dated 04.08.2000
of the Govt. of India )
(A Centre of Excellence in Information Technology Established by Govt. of India)
CERTIFICATE OF APPROVAL*
The foregoing thesis is hereby approved as a creditable study in the area of
information technology carried out and presented in a manner satisfactory to
warrant its acceptance as a pre-requisite to the degree for which it has been
submitted. It is understood that by this approval the undersigned do not
necessarily endorse or approve any statement made, opinion expressed or
conclusion drawn therein but approve the thesis only for the purpose for
which it is submitted.
COMMITTEE ON
FINAL EXAMINATION
FOR EVALUATION
OF THE THESIS
* Only in case the recommendation is concurred in
Candidate Declaration This is to certify that Report entitled “Association Rule Mining in
Partitioned Databases: Performance Evaluation and Analysis” which is
submitted by me in partial fulfillment of the requirement for the completion
of M.Tech in Information Technology (with specialization in Software
Engineering) to Indian Institute of Information Technology, Allahabad
comprises only my original work and due acknowledgement has been made in
the text to all other material used.
PANKAJ KANDPAL
M.Tech (INFORMATION TECHNOLOGY)
SPECIALISATION IN SOFTWARE ENGINEERING
MS200512
To My Family and Friends
Acknowledgements
First and foremost, I would like to express my sincere thanks to my thesis advisors Prof.
M. Radhakrishna and Mr. Manish Kumar, for providing me their precious advices and
suggestions. This Thesis wouldn’t have been a success for me without their cooperation
and valuable comments and suggestions.
Next, I would like to express my esteem gratitude to my family: my father Mr. Bhuwan
Chandra Kandpal, my mother Smt. Kamla Kandpal and my younger brother Mr.
Devesh Kandpal for their unconditional love and support in every part of my life.
Without their support I would never had dreamt of pursuing higher studies.
I would like to thank INDIAN INSTITUTE OF INFORMATION TECHNOLOGY,
ALLAHABAD for providing me such a graceful opportunity to become a part of its
family. It has been a privilege for me to pursue M.Tech. in Software Engineering from
this institute.
I would like to thank Blue Martini Software for contributing the KDD Cup 2000 data
without which the experiments would not be possible.
My special thanks go to Mr. Balwant Singh for providing me software and hardware
support.
Some of my friends deserve special mention. They are Mr. Nilesh Shukla, Mr. Kamal
Sawan, Mr. Abhay S. Pawne, Mr. Vineet Kumar, Mr. Prabhat Saheja, Mr. Imran
Khan, Mr. Anand Atre, Mr. Adish Singh, Mr. S. K. Mada, Mr. Kamal Singh, Mr.
D. N. Lan and Ms. Mallika Srivastav.
For full two years, several souls of IIITA’s M. Tech 2005 batch, suffered the burden of
my company. Heartily thanks go to these fellows, who in spite of that maintained a lively
and jovial work environment. These jolly persons are Mr. Parikshit Totawar, Mr.
Niladree Biswas, Mr. Dhirendra Pratap Singh, Mr. Rama Rao, Mr. Dora babu, Mr.
Ravi Kiran, Mr. Anil Pandey and Mr. Prateek Dayal.
Lastly I would like to thanks all the persons those are related to the thesis directly or
indirectly.
Pankaj Kandpal
July 2007
Abstract
Association Rule Mining in Partitioned Databases: Performance Evaluation and Analysis
Pankaj Kandpal, M.Tech (Software Engineering) Indian Institute of Information technology, Allahabad
July 2007
Data mining is the process of extracting useful information from the huge amount
of data stored in the databases. Data mining tools and techniques help to predict business
trends those can occur in near future. Association rule mining is an important technique
to discover hidden relationships among items in the transaction.
The goal of the thesis is to experimentally evaluate association rule mining
approaches in the context of horizontal database partitioning. Algorithms are
implemented using SQL and PLSQL stored procedures. For experimental evaluation
Oracle 10g RDBMS is used as the database. Apriori, partitioning and sampling
algorithms have been implemented and their performance is evaluated extensively.
Apriori algorithm is implemented using K-Way join approach for support counting.
Partitioning approach is implemented in the traditional (using TIDLISTS for support
counting) as well as by the combination with K-way join second pass optimization. In the
case of sampling algorithm the dataset is first partitioned into a number of given
partitions and then algorithm is applied by considering one partition as a sample.
Table of Contents
Candidate Declaration...................................................................................................I
Acknowledgements..................................................................................................... III
Abstract........................................................................................................................V
List of Figures...........................................................................................................VIII
List of Tables ............................................................................................................... X
Chapter 1 - Introduction .............................................................................................. 1
1.1 Data mining Functionalities ................................................................................ 2
1.1.1 Association Analysis.............................................................................................. 2
1.1.2 Clustering analysis ................................................................................................ 3
1.1.3 Classification analysis............................................................................................ 3
1.1.4 Deviation analysis ................................................................................................. 3
1.2 Architectures - Integrating mining with DBMS ................................................... 4
1.2.1 Loose coupling (or cache mining)......................................................................... 4
1.2.2 Stored procedures and user defined functions ..................................................... 4
1.2.3 SQL based approach ............................................................................................. 5
1.2.4 Integrated approach .............................................................................................. 5
1.3 Database Partitioning and PLSQL....................................................................... 6
1.3.1 Database partitioning............................................................................................ 6
1.3.2 An Introduction to PLSQL................................................................................... 7
1.3.2.1 PLSQL Stored Procedures and Dynamic SQL .......................................... 7
1.3.2.2 Why Use Dynamic SQL? ........................................................................... 9
1.4 Focus of the Thesis............................................................................................ 12
1.5 Thesis Organization .......................................................................................... 13
Chapter 2 - Association Analysis.................................................................................. 14
2.1 Background ...................................................................................................... 14
2.2 Association Rule Mining Algorithms................................................................. 16
2.2.1 Terminology and Concepts ................................................................................16
2.2.2 Example of association rules: ..............................................................................17
2.2.3 Classification of Association Rules .....................................................................18
2.2.4 Apriori Algorithm...............................................................................................20
2.2.5 Partitioning Algorithm........................................................................................24
2.2.6 Sampling Algorithm............................................................................................26
Chapter 3 - Apriori Algorithm for Association Rule Mining........................................ 28
3.1 Datasets for Experiments................................................................................... 28
3.2 Performance analysis......................................................................................... 29
Chapter 4 - Partitioned Algorithm for Association Rule Mining.................................. 36
4.1 Performance analysis of Partition Algorithm..................................................... 36
4.2 Partition algorithm with second optimization (SPO) ......................................... 40
4.2.1 Second pass optimization of K-Way Join approach for support counting .........40
4.2.2 The Approach .....................................................................................................41
4.3 Performance Comparisons ................................................................................ 42
Chapter 5 - Sampling Algorithm for Association Rule Mining .................................... 46
5.1 The Negative Border ......................................................................................... 46
5.2 Performance analysis of the Sampling algorithm ............................................... 47
5.2.2 Errors in Frequent itemsets for BMSWEBVIEW1 dataset ................................54
5.2.3 Errors in Frequent itemsets for T10I4D100K dataset........................................55
Chapter 6 - Conclusion and Future Work ................................................................... 57
6.1 Conclusion........................................................................................................ 57
6.2 Future Work..................................................................................................... 58
References .................................................................................................................. 59
List of Figures
Figure 1.1: Different architectures for integrating mining within DBMS............................. 5
Figure 2.1: Presentation of candidate and frequent itemsets in the database .....................23
Figure 2.2: Partitioning approach for frequent itemsets mining .........................................24
Figure 3.1: Performance of apriori for BMSWEBVIEW1 dataset (2 partitions).................30
Figure 3.2: Performance of apriori for BMSWEBVIEW1 dataset (3 partitions).................31
Figure 3.3: Performance of apriori for BMSWEBVIEW1 dataset (4 partitions).................31
Figure 3.4: Performance of apriori for T10I4D100K dataset (2 partitions) ........................33
Figure 3.5: Performance of apriori for T10I4D100K dataset (4 partitions) ........................33
Figure 3.6: Performance of apriori for MUSHROOM dataset (2 partitions)......................34
Figure 4.1: Tidlist creation time for different datasets.........................................................37
Figure 4.2: Performance of partition for BMSWEBVIEW1 dataset (2 partitions)..............37
Figure 4.3: Performance of partition for BMSWEBVIEW1 dataset (3 partitions)..............38
Figure 4.4: Performance of partition for BMSWEBVIEW1 dataset (4 partitions)..............38
Figure 4.5: Performance of partition for T10I4D100K dataset (4 partitions) .....................40
Figure 4.6: Performance comparisons of partition and partition with SPO algorithm for
BMSWEBVIEW1 dataset (2 partitions) ..............................................................................42
Figure 4.7: Performance comparisons of partition and partition with SPO algorithm for
BMSWEBVIEW1 dataset (3 partitions) ..............................................................................43
Figure 4.8: Performance comparisons of partition and partition with SPO algorithm for
BMSWEBVIEW1 dataset (4 partitions) ..............................................................................43
Figure 4.9: Performance comparisons of partition and partition with SPO algorithm for
T10I4D100K dataset (4 partitions) ......................................................................................44
Figure 5.1: Performance of sampling algorithm for sample size 2484 for BMSEBVIEW1
dataset...................................................................................................................................49
Figure 5.2: Performance of Sampling algorithm for sample size 5063 for BMSEBVIEW1
dataset...................................................................................................................................50
Figure 5.3: Performance of Sampling algorithm for sample size 9133 for BMSEBVIEW1
dataset...................................................................................................................................50
Figure 5.4: Performance of Sampling algorithm for sample size 19246 for BMSEBVIEW1
dataset...................................................................................................................................51
Figure 5.5: Performance of Sampling algorithm for sample size 37182 for BMSEBVIEW1
dataset...................................................................................................................................52
Figure 5.6: Performance of Sampling algorithm for sample size 74634 for BMSEBVIEW1
dataset...................................................................................................................................52
Figure 5.7: Performance of Sampling algorithm for sample size 7881 for T10I4D100K
dataset...................................................................................................................................53
Figure 5.8: Performance of Sampling algorithm for sample size 16703 for T10I4D100K
dataset...................................................................................................................................53
List of Tables Table 2.1: Transaction Database D......................................................................................18
Table 2.2: Frequent Itemsets F3...........................................................................................18
Table 2.3: Tidlists for 1-itemsets ..........................................................................................26
Table 2.4: Tidlists for 2-itemsets ..........................................................................................26
Table 3.1: Details of Datasets...............................................................................................29
Table 3.2: BMSWEBVIEW1 dataset (2 partitions) .............................................................29
Table 3.3: BMSWEBVIEW1 dataset (3 partitions) .............................................................30
Table 3.4: BMSWEBVIEW1 dataset (4 partitions) .............................................................30
Table 3.5: T10I4D100K dataset (2 partitions) .....................................................................32
Table 3.6: T10I4D100K dataset (4 partitions) .....................................................................32
Table 3.7: MUSHROOM dataset (2 partitions) ..................................................................33
Table 3.8: Itemsets in MUSHROOM dataset (1st partition) ...............................................35
Table 4.1: Candidate 2-itemsets (C2) for 0.45% support for 4 partitions...........................39
Table 4.2: Performance comparison ....................................................................................44
Table 5.1: Candidate itemset C2 .........................................................................................47
Table 5.2: Negative Border NBd(F2) ...................................................................................47
Table 5.3: Frequent itemset F2 ............................................................................................47
Table 5.4: Description of different samples for BMSWEBVIEW1 dataset.........................48
Table 5.5: Description of different samples for T100I4D100K dataset ..............................48
Table 5.6: Candidate itemsets in different samples for BMSWEBVIEW1 dataset.............49
Table 5.7: Frequent itemsets generated for BMSWEBVIEW1 dataset for 0.15% support.54
Table 5.8: Frequent itemsets generated for BMSWEBVIEW1 dataset for 0.30% support.55
Table 5.9: Percentage error for BMSWEBVIEW1 dataset ..................................................55
Table 5.10: Frequent itemsets generated for T10I4D100K dataset for 0.45% support ......56
Table 5.11: Frequent itemsets generated for T10I4D100K dataset for 0.60% support ......56
Table 5.12: Percentage error for T10I4D100K dataset ...................................................... 56
Chapter 1 Introduction
There are basically two most important reasons that data mining has attracted a great deal
of attention in the recent years. First, our capability to collect and store the huge amount
of data is rapidly increasing day by day. Due to the decrease in the cost of storage devices
and increase in the processing power of computers, now a days it is possible to store huge
amount of organizational data and process it. The second but the more important reason
is the need to turn such data into useful information and knowledge. The knowledge that
is acquired through the help of data mining can be applied into various applications like
business management, retail and market analysis, engineering design and scientific
exploration [1].
Data mining or Knowledge discovery in databases (KDD) is the process of discovering
previously unknown patterns form the huge amount of data stored in flat files, databases,
data warehouses or any other type of information repository. Database mining deals with
the data stored in database management systems (e.g. Oracle).
If we are data rich then we may or may not be information rich, because the useful
information is often hidden in the data. Data mining tools and techniques are used to
generate information from the data that we have stored in our repositories over the years.
To take advantage in the market over the competitors’, decision makers or managers need
to mine the knowledge hidden in the data collected over the years and use that
information in an effective way.
1.1 Data mining Functionalities
The process of mining is often controlled by the requirements of the users. The user may
be a business analyst or may be a marketing manager. Different users have different need
of information. Depending on the requirements we can use different data mining
techniques. The different types of data mining functionalities and the patterns they
discover are described below: 1.1.1 Association Analysis [1]
Association rule mining is an interesting data mining technique that is used to find out
interesting patterns or associations among the data items stored in the database. Support
and confidence are two measures of the interestingness for the mined patterns. These are
user supplied parameters and differ from user to user. Association rule mining is mainly
used in market basket analysis or retail data analysis. In market basket analysis we
identify different buying habits of customers and analyze them to find associations
among items those are purchased by customers. Items that are frequently purchased
together by customers can be identified. Association analysis is used to help retailers to
plan different types of marketing, item placement and inventory management strategies.
When we do association rule mining in relational database management systems we
generally transform the database into (tid, item) format, where tid stands for transaction
ID and item stands for different items purchased by the customers. There will be multiple
entries for a given transaction ID, because one transaction ID indicates purchase of one
particular customer and a customer can purchase as many items as he want. An
association rule can look like this:
X (buys, computer) X (buys, Windows OS CD) [support =1%, confidence=50%]
Where:
Support = The number of transactions that contain Computer and Windows OS CD The total number of transactions
Confidence = The number of transactions that contain Windows OS CD The number of transactions that contain Computer The above rule will hold if its support and confidence are equal to or greater than the user
specified minimum support and confidence.
1.1.2 Clustering analysis [1] In clustering we group the data items in such a way that the data items in a cluster are
more similar to one another and data items in different clusters are more dissimilar. These
data items are some times called data points. The focus of clustering is to maximize the
intra-class similarity and minimize the interclass similarity. The main clustering methods
are: partitioning methods, hierarchical methods, density based methods, grid-based
methods and model based methods. By the help of clustering technique for example we
can plan our marketing strategy by dividing market areas into different zones according
to the climates or customer behaviors, so that each group is targeted differently
1.1.3 Classification analysis [1] In classification, by the help of the analysis of training data we develop a model which
then is used to predict the class of objects whose class label is not known. The model is
trained so that it can distinguish different data classes. The training data is having data
objects whose class label is known in advance. There are various presentation methods
for the derived model like IF-THEN rules, decision trees, neural networks, mathematical
formula.
The major difference between classification and clustering is that classification is
supervised and clustering is unsupervised. That means in classification the class label is
known in advance, while clustering does not assume any knowledge of clusters.
1.1.4 Deviation analysis [2] Deviations are differences between the current data and previously defined normal
values. Deviation analysis is used to detect anomalies in the datasets. It is very useful for
time-related data analysis, in which we need to identify data deviations which occur over
the time. Deviation analysis tools are helpful in security systems, where authorities can
be warned about the deviation in resource utilization by a particular user.
1.2 Architectures - Integrating mining with DBMS There are various architectures [3] available for integrating data mining process with the
data base management systems. These architectures are depicted in the figure 1.1 and
described briefly below:
1.2.1 Loose coupling (or cache mining) This is an example of multi tier architecture. Mining applications are integrated into
client or into the application server depending on the architecture. The mining kernel can
be considered as the application server. Data is first fetched from the database
management system into the mining kernel and then it is mined according the user need.
Finally the results are sent back to the DBMS. Any intermediate results generated are
also stored back into the DBMS. In this approach the DBMS runs in a different address
space form the mining process. Cache based mining is another type of loose coupling
approach, in which the data is read only once form the DBMS and cached into flat files
on the local disk for future processing.
1.2.2 Stored procedures and user defined functions Mining logic is embedded as an application on the database server. There are two ways in
which mining application is stored on the database server side: stored procedures and
user defined functions. The mining application and DBMS executes in the same address
space. For example in Oracle we can create PLSQL stored procedures or JAVA stored
procedures for our mining algorithms and these procedures are then stored in the
database. In IBM DB2 we can implement mining algorithm with the help of user defined
functions.
Mining as application on Client/app server
Integrated withSQL query engine
Mining as application on database server
Mining usingSQL+ extensions
Integrated approach
TightIntegrationLoose
SQL based approach
Stored procedure
Loosecoupling
Mining extenders/blades
User defined functions
CacheMine
Figure 1.1: Different architectures for integrating mining within DBMS [3]
1.2.3 SQL based approach Here mining algorithm is presented in the form of SQL queries to DBMS query engine
where these SQL queries are executed by the SQL query processor. A mining based
optimizer can be used to optimize these SQL queries. DBMS provides support for
check-pointing and space management which is very useful in the case of these long
running queries.
1.2.4 Integrated approach In Integrated approach querying and mining are treated similarly. There is no distinction
between OLTP, OLAP or mining, the main goal is to get the information form the
database in the most effective way. Here mining operators are essential part of database
query engine. These mining operators or extended SQL is used for mining.
1.3 Database Partitioning and PLSQL 1.3.1 Database partitioning [15] A database partition is a logical division of a database or its constructs like tables or
indexes into distinct independent parts. Database partitioning is done mainly for the
following reasons:
• Performance
• Manageability
• Availability
A database can be partitioned in two ways:
• Building several smaller databases
• Splitting selected elements (splitting a table into various tables)
Database can be partitioned in two manners:
• In Horizontal Partitioning we put different rows in different tables. (Row wise
partitioning)
• In vertical partitioning we put different columns in different tables (column wise
partitioning. Normalization uses vertical partitioning)
Oracle provides various types of partitioning options like hash partitioning, list
partitioning, range partitioning and various combinations of these. We want to randomize
the data allocated to the partitions. For this purpose hash partitioning option will be used ,
in which a hash function is applied to the partition key of each row, and based on the
result the row is placed into appropriate partition.
A hash partitioned table can be created like this:
Create Table My_Table
(
Tid number,
Item number
)
Partition by hash (tid)
Partitions 4;
The above script will create a hash partitioned table having four partitions. This table is
initially empty, when the data is inserted into that table it will be allocated to the different
partitions according to the value of hash function.
But what if the table already contains data? In that case we have to redefine the logical
structure of the table online. For this oracle RDBMS provides the facility of online
redefinition of the table. DBMS_REDIFINITION package [15] is used to partition the
table that already contains data.
1.3.2 An Introduction to PLSQL PLSQL is Oracle’s extension of the Structured Query Language (SQL). PLSQL can be
used to implement business rules through the creation of stored procedures, functions and
packages, triggers to trigger events and add programming logic to the execution of SQL
commands.
1.3.2.1 PLSQL Stored Procedures and Dynamic SQL [19-20] Stored procedures [18] are stored at the database server side and can be invoked by client
applications. Stored procedures are written by users and they include SQL statements.
SQL is a declarative language that allows writing a SQL declarations and sending them to
database engine for execution. Procedural code cannot be executed by SQL. To overcome
this limitation PL/SQL was created.
A PLSQL stored procedure has name, parameters can be passed to them as an input and
they return values to the calling program. The variables they can handle can have basic
data types like characters, integers, numbers, dates or complex data types like large
objects (LOB), Varrays, PLSQL tables.
PLSQL is a complete block structured programming language. PLSQL procedures,
functions and packages are stored at the server side. PLSQL procedures and functions are
collectively called PLSQL stored procedure or subprograms. All PLSQL programs are
made up of blocks, which can be nested within each other. SQL can be easily embedded
inside the PLSQL program and it provides some additional features those are lacking in
SQL.SQL DML statements can directly included in the PLSQL and database tables can
be manipulated easily, after the computation results can be stored in the database directly.
SQL DDL statements can also be included inside the PLSQL stored procedures by the
help of dynamic SQL. We can call PLSQL procedures from client programs easily.
Oracle provides two ways to execute dynamic SQL, Native Dynamic SQL and by using
DBMS_SQL package. Native dynamic SQL is easier to write and its code is compact
compared to the code written with the other method [19].
SQL is static, which remains the same in each execution. Dynamic SQL facilitates us to
develop dynamic SQL statements as character strings at the runtime. The string contains
the text of a SQL statements or PLSQL block and can also contain placeholders for bind
variables. With the help of dynamic SQL we can generalize SQL statements because the
full text of a SQL statement is not known at the compile time. It gives us the facility to
create general purpose flexible applications. Dynamic SQL can be used in several
different development environments, including PLSQL, Pro*C/C++, and Java.
For an example, suppose that user want to run a complex query with a user specified sort
order. Instead of coding the query twice with a different sort order clause (Order By
clause) in each query, query can be developed dynamically to include specified sort order
clause.
1.3.2.2 Why Use Dynamic SQL? [19] Static SQL and dynamic SQL both have advantages and disadvantages. The full text of
static SQL statements is known at the compilation time, which provides the following
advantages:
• Static SQL has better performance than dynamic SQL.
• If a SQL statement complies successfully it states that all the database objects
referenced in the SQL statement are valid and all the necessary privileges are in
place to access the objects.
• Static SQL has some limitations that can be overcome with dynamic SQL.
Dynamic SQL provides the following advantages over static SQL:
• Full text of SQL statement is not known that must be executes in PLSQL
procedure.
• Executing DDL and other SQL statements those are not supported by static SQL
programs.
• Referencing database objects that do not exist at compile time.
• Execution optimization at run time.
• Executing dynamic PLSQL blocks.
The following PLSQL block contains several examples of dynamic SQL:
DECLARE
sql_stmt varchar2 (200);
plsql_block varchar2 (500);
query_str varchar2 (100);
v_deptno number;
BEGIN
query_str:= ‘ SELECT deptno FROM emp WHERE empno = :no’;
EXECUTE IMMEDIATE query_str into v_deptno using 100;
EXECUTE IMMEDIATE ‘CREATE TABLE BMSWEBVIEW1 (tid number, item
number)’;
EXECUTE IMMEDIATE ‘ALTER SYSTEM SET CURSOR_SHARING = SIMILAR’;
plsql_block:= ‘BEGIN pkg_apriori.apriori (:pass_no,:min_sup); END;’;
EXECUTE IMMEDIATE plsql_block USING 4, 179;
END;
The above PLSQL block has no name. It is called an Anonymous PLSQL block. An
Anonymous block is not stored at the server side in the database. Because the anonymous
block doesn’t have a name, it can’t be called by any other block. But, PLSQL functions
and procedures can be called from anonymous block.
A typical format of PLSQL stored procedure is described below:
CREATE OR REPLACE PROCEDURE MyProcedure (Tid in number, Item in number)
AS/IS
/*
This is declaration section. Define and initialize the variables and cursors used in the
block.
*/
BEGIN
/*
This is executable section. Uses flow-control commands (such as if commands
and loops) to execute the commands and assign values to the declared variables.
*/
EXECPTION
/*
This is exception handling section. This section is optional. It provides customized
handling of error conditions.
*/
END;
PLSQL packages are a unit of encapsulation those are used to store related functions and
procedures together. Packages in PLSQL are similar to other programming languages. A
PLSQL package consists of two parts: package specification and package body.
The following is an example of PLSQL package for creating and altering a table at run
time:
CREATE OR REPLACE PACKAGE pkg_new_approach
AS/IS
PROCEDURE table_creation
(initial_tablename VARCHAR2, new_tablename VARCHAR2);
PROCEDURE alter_table_creation
(new_tablename VARCHAR2, buffer_1 VARCHAR2);
END pkg_new_approach;
CREATE OR REPLACE PACKAGE BODY pkg_new_approach
AS
PROCEDURE table_creation
(initial_tablename VARCHAR2, new_tablename VARCHAR2)
IS
Item_1 NUMBER;
buffer_1 VARCHAR2 (50);
buffer_final VARCHAR2 (1000);
type cur_type IS ref CURSOR;
my_rec1 cur_type;
BEGIN
OPEN my_rec1 FOR 'select distinct item from ' || initial_tablename || ' order by item';
EXECUTE IMMEDIATE 'create table ' || new_tablename || '(x number)';
LOOP
FETCH my_rec1
INTO item_1;
EXIT
WHEN my_rec1 % NOTFOUND;
buffer_1 := CONCAT ('x', item_1);
alter_table_creation (new_tablename, buffer_1);
END LOOP;
CLOSE my_rec1;
END;
PROCEDURE alter_table_creation
(new_tablename VARCHAR2, buffer_1 VARCHAR2)
IS
query_str VARCHAR2 (1000);
BEGIN
query_str:= 'alter table ' || new_tablename || ' add ' || buffer_1 || ' number';
EXECUTE IMMEDIATE query_str;
END;
END pkg_new_approach;
1.4 Focus of the Thesis
In this thesis we are concerned about the database mining, in which data is stored in the
relational database management systems (e.g. Oracle). RDBMS provides various
additional benefits those are lacking in the file based mining. SQL and PLSQL stored
procedures [15, 20] are used for the purpose of implementation. For the purpose of
experimentations one synthetic and two real life datasets [21, 22] are used.
The goal of the thesis is to evaluate the performance of association rule mining
algorithms in the context of database partitioning. The thesis focuses on Apriori,
partitioned and sampling algorithms for frequent itemsets mining when the data is
partitioned into a number of given segments. Apriori algorithm scans the database
multiple number of times for counting the support for the itemsets. Partitioning approach
partitions the database for mining frequent itemsets. Sampling algorithm uses the small
sample instead of the entire database for mining.
1.5 Thesis Organization
The structure of the rest of the thesis is as follows:
Chapter 2 presents the background of various association rule mining approaches
developed so far. It covers in detail about the association analysis and association rule
mining algorithms discussed in the thesis.
Chapter 3 discusses the performance analysis of apriori algorithm when it is applied as
the partitioned approach.
Chapter 4 presents the performance analysis of partitioning algorithm. It discusses the
TIDLIST approach for support counting and K-way join second pass optimization.
Chapter 5 presents the sampling approach for frequent itemsets mining.
Chapter 6 discusses the conclusion and future directions about the work done in the
thesis.
Chapter 2 Association Analysis
In this chapter a background of various association rule mining algorithms is discussed.
This chapter also covers in detail about the association analysis and association rule
mining algorithms discussed in the thesis.
2.1 Background The Association rule mining was first introduced in the AIS [4] algorithm. It was again
modified in [5]. There are various algorithms proposed for association rule mining since
the development of the AIS algorithm so that the performance of the algorithm is
improved. Apriori [5] is the very basic and most popular association rule mining
algorithm. Most of the association rule mining algorithms are based on apriori algorithm.
Apriori algorithm scans database multiple times. The FP-tree [6] (Frequent-pattern)
algorithm builds a special type of tree structure in main memory so that it can avoid
multiple scans over the database. The turbo-charging [7] algorithm improves the
performance by the help of data compression techniques.
The partition algorithm [8] is based on apriori algorithm. It firstly partitions the data into
a number of non overlapping partitions and processes each partition separately to
generate frequent itemsets local to each partition and finally it combines all the local
frequent itemsets to generate global frequent itemsets. It reduces the number of complete
database scans up to two and hence improves the performance of mining algorithm.
The Incremental Mining algorithm [9] is another useful technique for speeding up the
mining process when new data is added to the database. [1, 10] Sampling algorithm is
also based on apriori algorithm. Rather than mining entire database, we here draw out a
random sample of data form the database and then finds out frequent itemsets in that
sample instead of the entire database. Finally the rest of the database is used to compute
the actual support of the frequent itemsets that we found in the sample.
Because we are searching for frequent itemsets in the sample, it is possible that we may
miss some global frequent itemsets. To lessen this we use lower support than minimum
support for the sample. In this way we trade off some degree of accuracy against
efficiency. There are various mechanisms so that we can find out all the missing frequent
itemsets those are not find out in the sample.
Most of these algorithms are In-memory algorithms, in which data is directly read from
flat files or first extracted from database to the flat files and then processed in main
memory. Most of these algorithms build specialized data-structures and implement their
own buffer management schemes.
Since then very few attempts have been made to build database based mining approaches.
Various extensions to the standard SQL have also been proposed. These extensions allow
the inclusion of mining operators in the SQL. The data mining query language (DMQL)
[11] includes such mining operators for various types of mining tasks.
[12] Shows various architectural alternatives for coupling data mining with relational
database systems. [3] Have also compared various SQL based approaches for association
rule mining. These are SQL-92 based approaches and SQL-OR based approaches. The
SQL-92 based approaches use the standard SQL language for the mining. SQL-OR based
approaches use the object relational extensions to the SQL. [3] Have implemented apriori
algorithm in the form of SQL queries.
[13] Deals with the partitioned and incremental approaches for association rule mining
and they have evaluated basic k-way join algorithm in the context of multiple databases
and proposed two optimizations of partitioned approach for multi-database mining.
2.2 Association Rule Mining Algorithms 2.2.1 Terminology and Concepts [1] Let I is the set of all items in the database D. Database D contains user transactions. Each
transactions T contains a set of items such that T ⊂ I. Let X and Y are set of item. An
association rule is of the form X where XY, ⊂ I, Y ⊂ I, and
X ∩ Y = φ. Support and Confidence are two measures of rule interestingness.
The rule X Y holds in the database D with support s, where s is the percentage of
transactions in D that contain X U Y. The rule has confidence c if c is the percentage of
transactions in D containing X which also contains Y. i.e.
Support (X Y) = P (XUY)
Confidence (X Y) = P (Y|X)
The rules that satisfy both the user specified minimum support and confidence are said to
the Strong Association rules.
[1] A set of items is called an itemset. An itemset that contains k items is called a k-
itemset. The occurrence frequency of an itemset is the number of transactions that contain
the itemset. This is also known as the frequency or support count of the itemset. An
itemset satisfies minimum support if the occurrence frequency of the itemset is greater
than or equal to the product of minimum support and the total number of transactions in
the entire database. The number of transactions required for the itemset to satisfy
minimum support is referred as the minimum support count. If an itemset satisfies
minimum support then it is called a frequent or large itemset.
An association rule mining algorithm is divided into two parts:-
• Frequent itemsets generation i.e. all the itemsets having support greater than the
user specified minimum support.
• Frequent itemsets generated in the step 1 will be used to generate association rules
that satisfy user specified minimum confidence.
First step is more complex and requires more effort. After the frequent itemsets are
generated the strong association rule generation is simple. Strong association rules satisfy
both minimum support and minimum confidence.
Confidence (X Y) = P (Y | X) = support-count (X U Y) support-count (X) Where support-count (X U Y) is the total number of transactions having itemset {X, Y}
and support-count (X) is the total number of transactions having itemset {X}.
Association rules are generated as follows:-
• For every frequent itemset x, generate all non empty subset of x.
• For every non- empty subset s, of x, generate the association rule
s (x-s) if
support-count (X U Y) is greater or equal to minimum confidence. support-count (X)
Since the association rules are generated directly from frequent itemsets, each rule
automatically satisfies minimum support.
2.2.2 Example of association rules: The Table 2.1 depicts an example Transaction database and Table 2.2 shows that
{1, 2, 3} and {1, 2, 5} are frequent 3-itemsets. The non empty subsets of {1,2,3} are
{1},{2},{3},{1,2}, {1,3} and {2,3}. The association rules generated are:
{1, 2} {3} confidence=2/4 = 50%
{1, 3} {2} confidence=2/2 = 100%
{2, 3} {1} confidence=2/3 = 66%
{1} {2, 3} confidence=2/4 = 50%
{2} {1, 3} confidence=2/6 = 33%
{3} {1, 2} confidence=2/3 = 66%
If minimum confidence is equal to 66% then the following rules are strong rules:
{1, 3} {2}, {2, 3} {1} , {3} {1, 2}.
TID ITEM T1 1 T1 2 T1 5 T2 2 T2 4 T3 2 T3 3 T4 1 T4 2 T4 4 T8 1 T8 2 T8 3 T9 1 T9 2 T9 3 T9 5 Table 2.1: Transaction Database D
Table 2.2: Frequent Itemsets F3
2.2.3 Classification of Association Rules Association rules can be classified into various ways [1]:
• If a rule specifies association between the presence and absence of the items, it is
called Boolean association rule. Ex
Computer Windows OS CD
ITEM1ITEM2 ITEM3 SUPPORT1 2 3 2 1 2 5 2
• If a rule specifies associations among quantitative items, then it is called a
Quantitative association rule. Quantitative values are partitioned into intervals. Ex
Age (Y,”20...25”) ^ Income (Y,”22K..30K”) Buys (X,”Washing Machine”)
• If the rule reference only one dimension, then it is called single-dimensional
association rule. Ex.
Buys (X, computer) Buys(X, Windows OS CD)
• If a rule references two or more dimensions, such as the dimensions buys etc ,
then it is a multidimensional associational rule. Ex.
Age (Y,”20...25”) ^ Income (Y,”22K..30K”) Buys (X,”Washing Machine”)
The above rule involves three dimensions age, income, buys.
• Multi level association rules. Ex
Age (Z,”20..25”) buys (Z,”printer”)
Age (Z,”20..25”) buys (Z,”color printer”)
The above rules are at different levels of abstraction. Printers are higher level abstraction
of color printers. If the rules do not reference items at different levels of abstraction, then
these rules are called single level association rules.
2.2.4 Apriori Algorithm [1] The Apriori algorithm [5] is one of the most important algorithms for association rule
mining because most of the other algorithms are based on it or extensions of it. It is a
Main-memory based algorithm. Main memory imposes a limitation on the size of the
dataset that can be mined.
The algorithm executes in two steps as described above. i.e. frequent itemsets generation
and association rule generation. The frequent itemsets generation is again a two step
process:
• Candidate itemsets (Ck) generation i.e. all possible combination of items those are
potential candidates for frequent itemsets.
• Frequent itemsets (Fk) generation- support for all candidate itemsets are generated
and itemsets having support greater than the user-specified minimum support are
qualified as the frequent itemsets.
This algorithm scans over the database a multiple number of times and it is not possible
to find out number of scans earlier.
The algorithm is presented below: [1, 13]
F1 = {frequent 1-itemsets}
For (k = 2; Fk-1 ≠ 0; k++) loop
Ck = generate (Fk-1);
For all transactions x ∈ D loop
Cx = generate_subset (Ck, x); //candidate generation;
For all candidates c ∈ Cx loop
c.count++
end loop;
end loop;
Fk = { c ∈ Ck | c.count ≥ minsup}
end loop;
Return ∪k {Fk}
Firstly apriori algorithm generates frequent 1-itemsets F1 by directly reading the database
D.Then it iterates through for loop and Fk-1 is used to generate candidate itemsets Ck. In
the next pass Ck is then used in the generation of Fk. The generate procedure generates
potential candidate itemsets and then eliminates itemsets from this set whose subset is not
frequent. The algorithm builds a special hash tree data structure in the memory for
support counting. [1]
For SQL based implementation of the algorithm, the candidate itemsets and frequent
itemsets are represented as relational tables. The SQL for candidate generation in pass k
is presented below: [2]
Insert into Ck
Select I1.item1… I1.itemk-1, I2.itemk-1
From Fk-1 I1, Fk-1 I2
Where I1.item1 = I2.item1 and
:
I1.item k-2 = I2.itemk-2 and
I1.item k-1 < I2.itemk-1;
Frequent itemsets generation form candidate itemsets is the most time consuming part of
the association rule mining. It is called support counting phase. For SQL based
formulations SQL-92 and SQL-OR based approaches are used for support counting.
The K-Way join [13] approach presented below is SQL-92 based approach for support
counting.
Insert into Fk
Select item1, … , itemk, count(*)
From Ck, T T1, … , T Tk
Where T1.item = Ck.item1 and
:
Tk.item = Ck.itemk and
T1.tid = T2.tid and
:
Tk-1.tid = Tk.tid
Group by item1, item2, … ,itemk
Having count(*) > min_sup;
There have been various optimizations [2] proposed for K-Way join approach. These are:
• Pruning the input data.
• Second pass optimization.
• Reuse of item combinations.
Figure 2.1 shows an example how the candidate and frequent itemsets generated in
different passes are presented as database tables in the database.
ITEM1 SUPPORT1 6
2 7
3 6
4 2
5 2
ITEM1ITEM2SUPPORT 1 2 4 1 3 4 1 5 2 2 3 4 2 4 2 2 5 2
Frequent itemset F1 Frequent itemset F2 ITEM1 ITEM2 ITEM3
1 2 3 1 2 5 1 2 4 2 3 5
ITEM1 ITEM2 1 2 1 3 1 5 2 3 3 5 3 4 2 4
Candidate itemset C3
Candidate itemset C2
Figure 2.1: Presentation of candidate and frequent itemsets in the database
2.2.5 Partitioning Algorithm [8] Partitioning algorithm is basically based on apriori algorithm, but it requires only two
complete scans over the database. Figure 2.2 depicts the partitioning approach for
frequent itemsets mining [1]. The partition algorithm is divided into two phases:
• The database is divided into a number of non overlapping partitions and frequent
itemsets local to partition are generated for each partition. The database is
scanned completely for the first time.
• Local frequent itemsets from each partition are combined to generate global
candidate itemsets. Then the database is scanned second time to generate global
frequent itemsets.
DivideDatabase (D)
into n partitions
GeneratefrequentItemsets local to
each partition
Generate global
candidate itemsets
Generate global
frequent itemsets
Combine all local frequent itemsets
Second phase support counting
Second complete scan of D
First completescan of D
Figure 2.2: Partitioning approach for frequent itemsets mining [1] The algorithm is formulated below: [8, 13]
P = partition_database (D)
n = Number of partitions
// Phase I
For i = 1 to n loop
Read-in_partition (pi ∈ P)
Li = gen_large_itemsets (pi)
End loop;
// Generating global candidate itemsets
For (i = 2; Lj i ≠ φ, j = 1,2....,n; i++) loop
Ci G = ∪j = 1,2,...nLij
End loop;
// Phase II
For i = 1 to n loop
Read-in_partition (pi ∈ P)
For all candidates c ∈CG generate_count (c, pi)
end loop;
LG = {c ∈ CG| c.count > minsup}
Return LG;
Here minsup is the minimum support for the entire database D. Minimum support for a
particular partition is the multiplication of minsup and the total number of transactions in
that partition.
[8]Algorithm builds a special data structure called TIDLISTs. For every itemset a tidlist
is generated. A tidlist for an itemset contains the TIDs of all the transactions that contain
that itemset in the partition. The TIDs in the tidlist are maintained in the sorted order.
Tidlists are used for counting the support for the candidate items sets. Cardinality of the
tidlists of an itemset divided by the total number of transactions in a partition gives the
support for that itemset in that partition. Initially the tidlist for the entire partition is
generated. From here we find out tidlists corresponding to the 1-itemsets. Higher level
tidlists are generated by the intersection of tidlists. Figure 2.3 and Figure 2.4 show the
Tidlists and their representations as database tables for 1-itemsets and 2-itemsets
respectively.
ITEM1 COUNT TIDLIST 1 2 100,300 2 3 200,300,400 3 3 100,200,300 4 1 100 5 3 200,300,400
Table 2.3: Tidlists for 1-itemsets
ITEM1 ITEM2 COUNT TIDLIST 1 2 1 300 1 3 2 100,300 1 5 1 300 2 3 2 200,300 2 5 3 200,300,400
Table 2.4: Tidlists for 2-itemsets 2.2.6 Sampling Algorithm [10, 14 and 16] There are various sampling algorithms for association rule mining has been proposed in
[10, 14 and 16]. Among them the sampling algorithm proposed in [10] has the best
performance. The algorithm [10] picks up a random sample form the database and then
finds out frequent itemsets in the sample using support that is less than the user specified
minimum support for the database. These frequent itemsets are denoted by S. Then the
algorithm finds out the negative border [10] of these itemsets denoted by NBd (S). The
negative border is the set of itemsets those are candidate itemsets but did not satisfy
minimum support. Simply NBd (Fk) = Ck - Fk. After that for each itemset X in
S U NBd (S) it checks whether X is frequent itemset in entire database by scanning the
database. [1, 17]
If NBd (S) contains no frequent itemsets then all the frequent itemsets are found.
If NBd (S) contains frequent itemsets then the algorithm constructs a set of candidate
itemsets CG by expanding the negative border of S U NBd (S) until the negative border is
empty. Now for each itemset X in CG the algorithm scans the database for the second
time. In the best case when all the frequent itemsets are found in the sample this
algorithm requires only one scan over the database. In the worst case it requires two scans
over the database. [1, 17]
The performance of sampling algorithm relies on the quality of the sample chosen. If the
sample chosen is a bad sample the number of candidates generates for second scan may
be very large hence second scan can be inefficient.
The sample can be a partition of the database. In that case the partition is treated just like
a random sample chosen.
The sampling algorithm [10] is depicted below:
s = Draw_random_sample (D);
// generate frequent itemsets for the sample drawn.
S = generate_frequent_itemsets (s, low_support);
// counting support for the itemsets and their negative border generated in the sample, in
the database D.
F = {X ∈ S U NBd (S) | X.count >= minsup};
// if NBd (S) contains frequent itemsets, expand border
Repeat
S = S U NBd (S);
Until S does not grow;
// another scan of D
F = {X ∈ S | X.count >= minsup};
Output F; // frequent itemsets in the database D
Chapter 3
Apriori Algorithm for Association Rule Mining
This chapter presents the performance analysis of the apriori algorithm [5] for association
rule mining in the context of partitioning approach. For support counting K-Way join
approach [2] is used. The algorithm executes in two phases. In the first phase the
database (or dataset) is partitioned into a number of given partitions and local frequent
itemsets for each partition are generated using the minimum support count for that
partition. Then all the local frequent itemsets are combined into the following two sets:
• Global frequent itemsets
• Global candidate itemsets
In the second phase support for the global candidate itemsets are calculated in the entire
database. Itemsets qualifying the minimum support are frequent itemsets in the entire
database and hence added to the set of global frequent itemsets. The algorithm scans the
database multiple number of times. TIDLIST data structure for support counting is not
used at all.
The experiments are done on Oracle 10g RDBMS installed on Microsoft Windows XP
with 1 GB of RAM and 2.40 GHz processor. Each experiment is performed various times
and the best of them is taken.
3.1 Datasets for Experiments
Datasets are needed for the purpose of experiments. Some synthetic and real life datasets
[21, 22] have been collected from the internet for experiments. Synthetic datasets are
generated through the synthetic datasets generation utility or program. Real life datasets
are real transactions done on retail items; those are collected over the years for the
purpose of the analysis. These datasets were stored in the flat files and will have to be
transferred into the database tables. To transfer these datasets within the Oracle database,
SQL*Loader utility [15] provided by Oracle RDBMS was used. This utilizes the
functionalities provided by DBMS and saves unnecessary effort that could have been
spend in writing the program for loading the data into DBMS. After the data have been
loaded into the database, it was converted into the format suitable for the algorithms.
The details of the datasets used in the thesis for experiments are given in the Table 3.1
Name Size Total transactions
Total items Average no of items per transactions
BMSWEBVIEW1 3 MB 59602 497 5 T10I4D100K 16 MB 100000 870 10
MUSHROOM 3 MB 8124 119 23
Table 3.1: Details of Datasets
3.2 Performance analysis The Table 3.2 shows the total number of records, total number of distinct transactions and distinct items contained in the partitions for BMSWEBVIEW1 dataset.
Table 3.2: BMSWEBVIEW1 dataset (2 partitions)
Total Rows Distinct Transactions
Total items
Partition 1 74634 29718 486 Partition 2 75005 29884 486
0
50
10
15
20
0
0
0
250
15% .30% 45% .60%
Time in
seconds
0. 0 0. 0
Minimum Support
Figure 3.1: Performance of apriori for BMSWEBVIEW1 dataset (2 partitions) Figure 3.1 shows the performance of apriori algorithm for BMSWEBVIEW1 dataset for
two partitions. It is obvious form the figure that as the support value increases time taken
by the algorithm decreases.
The Table 3.3 and Table 3.4 show the total number of input records, total number of
distinct transactions and distinct items contained in the different partitions for
BMSWEBVIEW1 dataset for 3 and 4 partitions respectively.
Total Rows Distinct Transactions
Total items
Partition-1 37067 15025 473 Partition-2 75005 29884 486 Partition-3 37567 14693 464
Table 3.3: BMSWEBVIEW1 dataset (3 partitions)
Total Rows Distinct Transactions Total items Partition-1 37067 15025 473 Partition-2 37182 14979 466 Partition-3 37567 14693 464 Partition-4 37823 14905 467
Table 3.4: BMSWEBVIEW1 dataset (4 partitions)
Figure 3.2 and Figure 3.3 show the performance of apriori algorithm for 3 and 4
partitions of BMSWEBVIEW1 dataset respectively for different support values. For
0.15% support value time taken by the algorithm increase as the number of partitions
increases. As the number of partitions increase form 3 to 4 time taken by the algorithm
also increases. But the increase in the time is not much.
0
50
10
15
20
25
0. 0 0. 0
0
0
0
0
300
15% .30% 45% .60%
Time in
seconds
Minimum Support
Figure 3.2: Performance of apriori for BMSWEBVIEW1 dataset (3 partitions)
0
50
10
15
20
25
0. 0 0 0
0
0
0
0
300
15% .30% .45% .60%
Time in
seconds
Minimum Support
Figure 3.3: Performance of apriori for BMSWEBVIEW1 dataset (4 partitions)
The Table 3.5 and Table 3.6 show the total number of input records, total number of
distinct transactions and distinct items contained in the different partitions for
T10I4D100K dataset for 2 and 4 partitions respectively.
Total Rows Distinct Transactions
Total items
Partition 1 503769 50154 870 Partition 2 504863 49846 869
Table 3.5: T10I4D100K dataset (2 partitions)
Table 3.6: T10I4D100K dataset (4 partitions)
Total Rows Distinct Transactions
Total items
Partition-1 252380 24945 869 Partition-2 252387 24956 869 Partition-3 254040 25209 869 Partition-4 251307 24890 868
Figure 3.4 and Figure 3.5 show the performance of apriori algorithm for 2 and 4
partitions of T10I4D100K dataset respectively for different support values. For 2
partitions time taken in frequent itemsets generation by the algorithm is more as
compared to time taken for 4 partitions for each support value. As the support increases
form 0.45% to 0.75% time taken by the algorithm decreases. But the difference between
the time taken by the algorithm for 0.45% and 0.60% support values is less than as
compared to the difference between the time taken for 0.60% and 0.75% support values.
0
2004006080
10121416
00
00000000
0.45% 0.60% 0.75%
Time in
seconds
Minimum Support
Figure 3.4: Performance of apriori for T10I4D100K dataset (2 partitions)
0200400600
800100012001400
.45% .60% 75%
Time in
seconds
0 0 0.
Minimum Support
Figure 3.5: Performance of apriori for T10I4D100K dataset (4 partitions)
The Table 3.7 shows the total number of input records, total number of distinct
transactions and distinct items contained in the different partitions for MUSHROOM
dataset for 2 partitions respectively.
Total Rows Distinct Transactions
Total items
Partition 1 93242 4054 119 Partition 2 93610 4070 119
Table 3.7: MUSHROOM dataset (2 partitions)
Figure 3.6 depicts the performance of apriori algorithm for MUSHROOM dataset for 2
partitions and for different support values. As support increases for 0.15% to 1.0%, time
taken by the algorithm decreases except for 0.60% support value. But the decrease in
time is not significant as compared to decrease in support values.
Table 3.8 shows the candidate and frequent itemsets generated for the first partition of
MUSHROOM dataset. Figure 3.6 shows the time only up to frequent 3-itemsets.
Candidate 4-itemsets were not generated even after running the algorithm for two
additional hours after the generation of frequent 3-itemsets. For later passes larger
candidate and frequent itemsets are generated. They require more time for support
counting as compared to earlier passes. For example, in the case of 2.0% support value
size of C4 is more than 5 times the size of C3 (C3 = 25113, C4 = 127227). Generation of
C4 form F3 and support counting of C4 to generate F4 is more time consuming than
previous pass.
0200400600800
100012001400
0. 0. 0 0. 1.
1600
15% 30% .45% 60% 00%
Time in
seconds
Minimum Support Figure 3.6: Performance of apriori for MUSHROOM dataset (2 partitions)
support F1 C2 F2 C3 F3 C4 F4
0.15% 115 6555 3255 49907 44244
0.30% 111 6105 3023 45400 39846
0.45% 104 5356 2810 41832 36324
1.0% 94 4371 2413 34091 28475
2.0% 89 3916 2008 25113 20487 127227 119150
Table 3.8: Itemsets in MUSHROOM dataset (1st partition)
The main reason for the bad performance of the MUSHROOM dataset is the average
number of items per transactions. As the average number of items per transactions
increases, frequent itemsets generated for each pass also increases. Hence support
counting requires more time for that.
Chapter 4 Partitioned Algorithm for Association Rule
Mining
In this chapter we will present performance analysis of Partition algorithm [8]. Partition
algorithm finds all frequent itemsets in just two scans over the database. In the first scan
it partitions the database into a given number of partitions and finds out all local
frequents itemsets. Then it merges all local frequent itemsets to form a global candidate
itemset. In the second scan over the database it finds out the support of global candidate
itemsets in the entire database and outputs global frequent itemsets.
The experiments are done on Oracle 10g RDBMS installed on Microsoft Windows XP
with 1 GB of RAM and 2.40 GHz processor. Each experiment is performed various times
and the best of them is taken.
4.1 Performance analysis of Partition Algorithm
For the purpose of support counting partition algorithm builds a special data structure
called Tidlist. Tidlist is created as a CLOB (character large object). Table 2.3 and Table
2.4 show an example of Tidlists for 1-iteemsets and 2-itemsets respectively. Figure 4.1
shows the Tidlist creation time for different datasets.
Figure 4.2 compares the time taken by the partition algorithm for BMSWEBVIEW-1
dataset with 2 partitions for different support values on Oracle RDBMS. The time taken
includes the Tidlist creation time plus time taken by the partition algorithm for frequent
itemsets generation. For lower support values algorithm takes more time because it
generates too many candidate itemsets. These candidate itemsets are then tested for
minimum support. The algorithm takes two scans over the database.
0
100
200
300
400
500
600
Mushroom BMSWEBVIEW1 T10I4D100K
Time in
seconds
Datasets
Figure 4.1: Tidlist creation time for different datasets
0500
1000150020002500300035004000
0.15% 0.30% 0.45% 0.60%
Time in
seconds
Minimum Support Figure 4.2: Performance of partition for BMSWEBVIEW1 dataset (2 partitions)
The Table 3.2 shows the total number of records, total number of distinct transactions and
distinct items contained in the partitions for BMSWEBVIEW1 dataset.
The Figure 4.3 and Figure 4.4 show the performance of the algorithm for 3 and 4
partitions respectively for BMSWEBVIEW1 dataset. The Table 3.3 and Table 3.4
describe the BMSWEBVIEW1 dataset with 3 and 4 partitions respectively. For 0.15%
support time taken by the partition algorithm for 3 partitions is more than that of 2
partitions.
0500
10001500200025003000350040004500
0.15% 0.30% 0.45% 0.60%
Time in
seconds
Minimum Support Figure 4.3: Performance of partition for BMSWEBVIEW1 dataset (3 partitions)
0500
1000150020002500300035004000
0.15% 0.30% 0.45% 0.60%
Time in
seconds Minimum Support Figure 4.4: Performance of partition for BMSWEBVIEW1 dataset (4 partitions)
The partition algorithm [13] uses Tidlist for support counting. For a particular partition,
in the first pass the Tidlist for the all local frequent 1-itemsets are created. Then frequent
2-itemsets are generated by the intersection of two corresponding frequent 1-itemsets.
The intersection of two Tidlists is very time consuming process. The overall performance
of the partition algorithm mainly depends on the second pass. Because the size of
candidate 2-itemsets C2 is very large, the time taken by the support counting process is
very high. The Table 4.1 shows the candidate 2-itemsets generated for the
BMSWEBVIEW1 and T10I4D100K for 4 partitions and for 0.45% minimum support.
Partition-1 Partition-2 Partition-3 Partition-4
T10I4D100K 180300 178503 172578 182106
BMSWEBVIEW1 13530 13203 13695 13530
Table 4.1: Candidate 2-itemsets (C2) for 0.45% support for 4 partitions The size of MUSHROOM dataset is near about 3 MB and is equal to the size of
BMSWEBVIE1 dataset. But for partition algorithm MUSHROOM dataset takes more
time than BMSWEBVIE1 dataset, because in the case of MUSHROOM dataset average
number of items per transaction is much more than that of the BMSWEBVIEW1 dataset,
hence Tidlist intersection for support counting of pass two (generation of F2) is more
time consuming in this case. The MUSHROOM dataset results are not shown in the
thesis because they are not acceptable for any number of partitions.
The Table 3.6 shows the total number of records, total number of distinct transactions and
distinct items contained in the partitions for T10I4D100K dataset for 4 partitions.
The Figure 4.5 shows the performance of partition algorithm for T10I4D100K dataset for
four partitions. It is obvious form the figure that the time taken by the algorithm is very
high and is not acceptable. The very bad performance of T10I4D100K dataset is due to
the fact that candidate 2-itemsets generated are very large and their support counting
takes too much time.
The partition algorithm does not escalate well in the case of relational database systems.
The algorithm is mainly a main memory based algorithm and works fine in the case of
main memory databases.
0
10000
20000
30000
40000
50000
60000
70000
0.30% 0.45% 0.60%
Time in
seconds
Minimum Support Figure 4.5: Performance of partition for T10I4D100K dataset (4 partitions)
4.2 Partition algorithm with second optimization (SPO)
In this section we will discuss the combination of partition algorithm and second pass
optimization of K-Way join approach.
4.2.1 Second pass optimization of K-Way Join approach for support counting [2, 13] As we have seen earlier, the size of C2 is very large and hence second pass support
counting is most time consuming compared to other passes. The process of generating C2
from F1 and then counting support for C2 and generating F2 can be replaced by
generating F2 directly by reading the database once.
This is shown below:
Insert into F2 select T1.item, T2.item, count (*)
From I_Table T1, I_Table T2
Where T1.tid = T2.tid and T1.item < T2.item
Group by T1.item, T2.item
Having count (*) > minsup;
4.2.2 The Approach
Partition algorithm is implemented with the combination of second pass optimization of
K-Way join approach. In this approach also, the database is scanned two times: first when
the Tidlist is generated for each partition and second when the F2 is generated by second
pass optimization in each partition. Rest of the process for the partition remains the same.
All the local frequent itemsets are combined to generate the global candidate itemsets. In
second phase database is not scanned for final support counting, but the count generated
in the first phase along with the tidlists are used instead. The approach gives better results
than the partition algorithm because in the second pass support counting (F2 generation)
for each partition the Tidlists are not intersected, which is the most time consuming step
in the entire support counting process.
The global frequent 3-itemsets generation SQL script for two partitions is shown below:
Insert into Global_F3
(
Select item1, item2, item3, sum (count) from
(
Select item1, item2, item3, count
From tidt_c3
Union all
Select item1, item2, item3, count
From tidt_cc3
)
Group by item1, item2, item3
Having sum (count)>=179
);
4.3 Performance Comparisons The Figure 4.6 shows the performance comparisons between partition and partition
algorithm with the second pass optimization for BMSWEBVIEW1 for 2 partitions. The
details about the partitions are described in the Table 3.2.
0500
1000150020002500300035004000
0.15% 0.30% 0.45% 0.60%
PartitionPartition with SPO
Time
in seconds
Minimum Support
Figure 4.6: Performance comparisons of partition and partition with SPO algorithm for BMSWEBVIEW1 dataset (2 partitions)
The Figure 4.7 shows the performance comparisons between partition and partition
algorithm with the second pass optimization for BMSWEBVIEW1 for 3 partitions. The
details about the partitions are described in the Table 3.3.
0500
10001500200025003000350040004500
0.15% 0.30% 0.45% 0.60%
Partition
Partition withSPO
Time
in seconds
Minimum Support
Figure 4.7: Performance comparisons of partition and partition with SPO algorithm for BMSWEBVIEW1 dataset (3 partitions)
The Figure 4.8 shows the performance comparisons between partition and partition
algorithm with the second pass optimization for BMSWEBVIEW1 for 4 partitions. The
details about the partitions are described in the Table 3.4
050
1015202530
0. 0. 0 0
00000000000
35004000
15% 30% .45% .60%
PTime
in seconds
artitionPartition with SPO
Minimum Support
Figure 4.8: Performance comparisons of partition and partition with SPO algorithm for BMSWEBVIEW1 dataset (4 partitions)
The Table 4.2 shows improvement in the performance for the partition algorithm with
SPO as compared to partition algorithm for BMSWEBVIEW1 dataset. In general for all
the partitions the performance increases tremendously as we move form 0.15% to 0.60%.
2 Partitions 3 Partitions 4 Partitions
0.15% Support Approx. 5 times Approx. 2.8 times Approx. 3 times
0.30% Support Approx. 7 times Approx. 6.5 times Approx. 6.5 times
0.45% Support Approx. 9.5 times Approx. 8 times Approx. 7.5 times
0.60% Support Approx. 9 times Approx. 9.4 times Approx. 8 times
Table 4.2: Performance comparison
0
10000
20000
30000
40000
50000
60000
70000
0.30% 0.45% 0.60%
PartitionPartition with SPO
Figure 4.9: Performance comparisons of partition and partition with SPO algorithm for
T10I4D100K dataset (4 partitions)
Figure 4.9 shows the performance comparison for T10I4D100K dataset. The Table 3.6
shows the total number of records, total number of distinct transactions and distinct items
contained in the partitions for T10I4D100K dataset for 4 partitions.
T10I4D100K dataset has shown very bad performance for partition algorithm (Figure
4.5). With second pass optimization, algorithm performs approximately 17 times, 30
times and 45 times better than as compared to partition algorithm in the case of 0.30%,
0.45% and 0.60% support values respectively.
Chapter 5 Sampling Algorithm for Association Rule Mining
This chapter presents performance analysis of the Sampling algorithm [10] for
association rule mining.
Sampling algorithm is implemented in the context partition algorithm. The algorithm first
partitions the database into a number of partitions. And then takes one partition as a
sample. It then finds out all the local frequent itemsets in the sample for the reduced
minimum support for that sample. Then these local frequent itemsets along with their
negative borders are tested against the entire dataset for the minimum support in the
entire dataset. Itemsets qualifying the minimum support are frequent itemsets in the entire
database. If negative border of the local frequent itemsets contain frequent itemsets in the
entire database, then only the algorithm scans the database second time to find out
missing frequent itemsets in the database.
The experiments are done on Oracle 10g relational database management system
installed on Microsoft Windows XP with 1 GB of RAM and 2.40 GHz processor speed.
Each experiment is performed various times and the best of them is taken.
5.1 The Negative Border
For any pass k negative border [10] is the set containing itemsets those are not frequent in
that pass, i.e. for any pass k negative border NBd (FK ) is equal to the Ck - Fk where Ck
and Fk are the set of candidate k-itemsets and set of frequent k-itemsets respectively. The
tables below shows the candidate itemsets, frequent itemsets and negative borders for
second pass.
Table 5.1, Table 5.2 and Table 5.3 show examples of candidate 2-itemsets C2, negative
border of frequent 2-itemsets NBd (F2) and frequent 2-itemsets F2 respectively.
ITEM1 ITEM2 1 2 1 3 1 5 2 3 3 5 3 4 2 4
ITEM1 ITEM2
1 5
3 5
3 4
2 4
Table 5.1: Candidate itemset C2
Table 5.2: Negative Border NBd(F2)
ITEM1 ITEM2 SUPPORT
1 2 4 1 3 4 2 3 4
Table 5.3: Frequent itemset F2
5.2 Performance analysis of the Sampling algorithm The tables below show the different partition sizes taken as the sample for the analysis of
sampling algorithm for the BMSWEBVIEW1 and T10I4D100K datasets respectively. It
also shows the number of distinct transactions and distinct items contained in the sample.
Last column of the tables show the percentage of the sample to the dataset.
Sample size (in rows) Distinct Transactions Distinct items Percentage of sample2484 954 388 1.70 5063 1885 384 3.38 9133 3775 415 6.10 19246 7530 451 12.86 37182 14979 466 24.84 74634 29718 486 49.88
Table 5.4: Description of different samples for BMSWEBVIEW1 dataset
Sample size (in rows) Distinct Transactions Distinct items Percentage of sample
7881 780 784 7.88
16703 1637 824 16.70
Table 5.5: Description of different samples for T100I4D100K dataset
Figure 5.1 shows the performance of sampling algorithm for the dataset
BMSWEBVIEW1 for sample size 2484 records. For support values 0.15% and 0.30%
algorithm requires second scan over the database because the negative border of the local
frequent itemsets contains frequent itemsets in the entire database. For support values
0.45%, 0.60% and 0.75% algorithm completes in one scan. As the support value
increases time required to find out frequent itemsets decreases. It is obvious from the
figure 5.1 that the time taken by the algorithm for second scan is more than the time taken
in the local support counting plus first scan for support values of 0.15% and 0.30%. If
candidate itemsets for second scan are large then second scan can take more time as
compared to first scan. Secondly, because TIDLIST data structure is used for support
counting, the intersection of tidlists is very time consuming.
0
1000
2000
3000
4000
5000
6000
0.15% 0.30% 0.45%
Time in
seconds
Second scanSample+First scan
0.60% 0.75%
Minimum Support
Figure 5.1: Performance of sampling algorithm for sample size 2484 for BMSEBVIEW1 dataset
Sample size C2 C3 C4 C5
2484 45451 8310 13062 12884
5063 50086 31846 179042 178030
9133 41905 2687 739 177
19246 45150 3504 420 30
37182 43956 1974 126 5
74634 45451 2807 281 17
Table 5.6: Candidate itemsets in different samples for BMSWEBVIEW1 dataset
Figures 5.2 show the time taken by the algorithm for sample size of 5063 records. Except
0.15% and 0.30% support values algorithm completes in one scan. Table 5.6 shows the
candidate itemsets generated for the different samples for BMSWEBVIEW1 dataset for
the support value of 0.15%. For the sample of 5063 records candidate itemsets for pass 4
and pass 5 are very large. Hence the time taken in local support counting is also high.
0
1000
2000
3000
4000
5000
6000
0.15% 0.30% 0.45% 0.60% 0.75%
Second scanSample+First scan
Time
in seconds
Minimum Support
Figure 5.2: Performance of Sampling algorithm for sample size 5063 for BMSEBVIEW1 dataset
Figures 5.3 show the time taken by the algorithm for sample size of 9133 records for
BMSWEBVIEW1 dataset. Algorithm completes in just one scan for 0.45% and 0.60%
support values.
010002000300040005000600070008000
0.15% 0.30% 0.45% 0.60%
Second scanSample+First scan
Time
in seconds
Minimum Support
Figure 5.3: Performance of Sampling algorithm for sample size 9133 for BMSEBVIEW1 dataset
0
500
1000
1500
2000
2500
3000
0.15% 0.30% 0.45% 0.60%
Second scanSample+First scan
Time in
seconds
Minimum Support
Figure 5.4: Performance of Sampling algorithm for sample size 19246 for BMSEBVIEW1 dataset
Figure 5.4 to Figure 5.6 shows the performance of sampling algorithm for larger sample
sizes. The performance of sampling algorithm mainly depends on the sample chosen for
finding frequent itemsets. If the sample is small enough and contains all the global
frequent itemsets then the algorithm completes in just one scan and in less time. If the
bad sample is chosen the performance of the algorithm can be even worse. It is obvious
from the figures that for higher support values the performance of sampling algorithm for
small sample sizes is better as compared to the large sample sizes. [8]
Figure 5.4 to Figure 5.6 show that the time taken by the algorithm for different support
values increases as the sample size increase.
0
500
1000
1500
2000
2500
3000
3500
0.15% 0.30% 0.45% 0.60%
Second scanSample+First scan
Time in
seconds
Minimum Support
Figure 5.5: Performance of Sampling algorithm for sample size 37182 for BMSEBVIEW1 dataset
0500
10001500200025003000350040004500
0.15% 0.30% 0.45% 0.60%
Second scan
Sample+Firstscan
Time in
seconds
Minimum Support
Figure 5.6: Performance of Sampling algorithm for sample size 74634 for BMSEBVIEW1 dataset
The performance for the dataset T104D100K for the sample size 7881 and 16703 records
are shown in Figure 5.7 and Figure 5.8 respectively. The second scan is very costly in the
case of T10I4D100K dataset. The performance is simply not acceptable. The reason for
this is the TIDLIST data structure. For support counting Tidlists intersected which is very
time consuming.
Figure 5.7: Performance of Sampling algorithm for sample size 7881 for T10I4D100K dataset
Figure 5.8: Performance of Sampling algorithm for sample size 16703 for T10I4D100K
dataset
02000400060008000
1000012000140001600018000
0.45% 0.60% 1.00%
Time in
seconds Second scanSample+First scan
Minimum Support
0
20
40
60
80
10
12
14
0 0 1
00
00
00
00
000
000
000
.45% .60% .00%
STime in
seconds
econd scanSample+First scan
Minimum Support
5.2.2 Errors in Frequent itemsets for BMSWEBVIEW1 dataset Sampling algorithm shows some errors in the frequent itemsets generated, because the
accuracy of the result depends on the sample chosen. No errors are reported for 0.45%,
0.60% and 0.75% support values for all sample sizes. For 0.30% support value only
samples of size 2484 and 5063 records show errors in the frequent itemsets generated.
For 0.15% support value only sample of size 74634 records shows no errors in the
frequent itemsets generated. Table 5.7 and Table 5.8 show the frequent itemsets
generated by the sampling algorithm for BMSWEBVIEW1 dataset for different sample
sizes for 0.15% and 0.30% support values respectively. Table 5.9 shows the percentage
error for different sample sizes for 0.15% and 0.30% support values.
Sample size F1 F2 F3 F4 F5
149639 303 715 336 70 4 2484 295 711 336 70 0
5063 301 714 336 70 0
9133 303 714 336 70 4
19246 303 715 336 70 2
37182 303 715 336 70 3
74634 303 715 336 70 4
Table 5.7: Frequent itemsets generated for BMSWEBVIEW1 dataset for 0.15% support
Sample size F1 F2 F3 F4 F5
149639 225 169 39 2 0
2484 224 169 39 2 0 5063 225 169 37 2 0 9133 225 169 39 2 0 19246 225 169 39 2 0 37182 225 169 39 2 0 74634 225 169 39 2 0
Table 5.8: Frequent itemsets generated for BMSWEBVIEW1 dataset for 0.30% support
Sample size % error for 0.15% support % error for 0.30% support
2484 1.12 0.23
5063 0.49 0.46
9133 0.07 No error
19246 0.14 No error
37182 0.07 No error
Table 5.9: Percentage error for BMSWEBVIEW1 dataset
5.2.3 Errors in Frequent itemsets for T10I4D100K dataset Table 5.10 and 5.11 depicts the frequent itemsets generated by sampling algorithm for
different passes for support value of 0.45% and 0.60% respectively. No errors are
reported for 1.0% support value for both of the sample sizes. Table 5.12 depicts the
percentage error in the frequent itemsets generated for 0.45% and 0.60% support values
Sample size F1 F2 F3 F4 F5 100000 596 522 174 49 11 7881 593 522 172 49 11 16703 596 522 171 48 11
Table 5.10: Frequent itemsets generated for T10I4D100K dataset for 0.45% support
Sample size F1 F2 F3 F4 F5 100000 516 191 48 14 2 7881 516 191 43 12 1 16703 516 191 44 11 1
Table 5.11: Frequent itemsets generated for T10I4D100K dataset for 0.60% support
Sample size % error for 0.45% support % error for 0.60% support 7881 0.37 1.04 16703 0.30 1.04
Table 5.12: Percentage error for T10I4D100K dataset
Chapter 6 Conclusion and Future Work
6.1 Conclusion
In this thesis we have discussed three approaches (apriori, partitioned and sampling) for
association rule mining in the context of database partitioning. We have discussed about
the frequent itemsets generation only, we haven’t discussed about the rule generation part
of the association analysis. Rule generation is very simple as compared to frequent
itemsets mining and it requires very less time. Extensive experiments have been
performed to test the performance of these approaches over two real and one
synthetically generated datasets.
In the case of apriori algorithm K-Way join method is used for support counting. Apriori
gives good results for BMSWEBVIEW1 dataset. For T10I4D100K dataset results are
satisfactory. In the case of MUSHROOM dataset, apriori was not able to generate all the
frequent itemsets even after running for long time. For the support less than 2.0% it
generates up to frequent 3-itemsets. For 2.0% support it generates up to frequent 4-
itemsets. Performance of the algorithm not only depends on the size of the dataset, but it
also depends on the average number of items per transactions. As the average number of
items per transactions increases, frequent itemsets generated for each pass also increases.
Hence support counting requires more time for that.
The partitioning algorithm uses a data structure called TIDLIST for support counting.
TIDLIST is suitable for mining In-memory databases, but it is not suitable for RDBMS
based mining. Since second pass is most costly in terms of the support counting, we can
optimize it to gain good performance. Partition algorithm perform much better if K-Way
join second pass optimization is used in conjunction with TIDLIST for support counting
The sampling approach shows some minute errors on the frequent itemsets generated.
The error depends on the quality of the sample chosen for analysis. If the sample contains
all the items in the dataset within it, the errors in the frequent itemsets are very less. The
algorithm shows less than 1.12% error for BMSWEBVIEW1 dataset for all the samples
considered. For T10I4D100K dataset error is less than 1.04% for all the samples chosen.
6.2 Future Work Some of the future enhancements of the thesis are presented below:
• The work presented in the thesis can be extended for multi-level association rule
mining.
• The work can be enhanced to generate multi-dimensional association rules.
• A tool for generating association rules can be developed. This tool can choose the
approach for frequent itemsets mining according to the properties of the dataset to
be mined.
References
[1] J n and
Kaufmann Publishers.
[2] P. Mishra and S. Chakravarthy. Performance Evaluation and Analysis of SQL Based
Approaches for Association Rule Mining. In BNCOD Proc. 2003.
[3] S. Thom
Algorithm h Database Systems, in CSE. 1998, University of Florida:
Gainesville.
[4] R. Agrawal, T. I
Items in Large Databas
Management of Data. 1993. Washington, D.C.
[5] R grawal
In 20th In
[6] J. Han, J. Pei and Y. Yin. Mining Frequent Patterns without Candidate Generation.
In ACM SIGMOD International Conference on Management of Data. 2000. Dallas.
[7] henoy
SIGMOD International Conference on Management of Data. 2000. Dallas.
[8] A. Sarasere, E. Omiecinsky, and S. Navathe. An Efficient Algorithm for Mining
Association Rules in Large Databases. In 21st International Conference on Very
Large Databases (VLDB). 1995. Zurich, Switzerland.
. Ha M. Kamber, Data Mining: Concepts and Techniques. 2001: Morgan
as. Architectures and Optimizations for Integrating Data Mining
s wit
mielinski and A. Swami. Mining Association Rules between Sets of
es. In ACM SIGMOD International Conference on the
. A and R. Srikant. Fast Algorithms for Mining Association Rules.
ternational Conference on Very Large Databases (VLDB). 1994.
P. S et al. Turbo-Charging Vertical Mining of Large Databases. In ACM
[9] Thoma
Association Rules in Large Databases. In Knowledge Discovery and Data Mining.
1997.
[10] H. Toivonen. Sampling Large Databases for Association Rules. In Proceedings of
22nd In
[11] J Han et al. DMQL: A Data Mining Query Language for Relational Database. In
ACM SIGMOD workshop on research issu
discovery. 1996. Montreal.
[12] Sarawa
with Relational Database System: Alternatives and Implications. In ACM
SIGMOD I
Washington.
[13] . Kona
Partitione
[14] H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient Algorithms for
coverin
Database
[15] K Loney, Oracle Database 10g: The Complete Reference. Osborne ORACLE Press
Series
[16 M. J. Za
Data Mi port TR 617, University of
Rochester, Computer Science Department, 1996.
S. s et al. An Efficient Algorithm for the Incremental Updation of
the ternational Conference on Very Large Databases (VLDB), 1996.
es on data mining and knowledge
S. gi, S. Thomas and R. Agrawal. Integrating Association Rule Mining
nternational Conference on Management of Data. 1998. Seattle,
H.V and S Chakravarthy. Association Rule Mining over Multiple Databases:
d and Incremental approaches, 2003
Dis g Association Rules. In AAAI Workshop on Knowledge Discovery in
s (KDD-94), 1994.
] ki, S. Parthasarathy, W. Li, and M. Ogihara. Evaluation of Sampling for
ning of Association Rules. Technical Re
[17 J.L. Lin a
In 14th In
[18] M. Dudgikar. A Layered Optimizer or Mining Association Rules over RDBMS. In
CSE Department. 2000, University of Florida: Gainesville.
[19] Oracle Database Application Developer's Guide - Fundamentals 10g Release 2
http://down
uk.oracle.com/docs/cd/B19306_01/appdev.102/b14251/adfns_dynamic_sql.htm
[20] Oracle Database PL/SQL User'
http://download-west.oracle.com/docs/cd/B19306_01/appdev.102/b14261.pdf
[21] Frequent It ory: http://fimi.cs.helsinki.fi/data/
[22] Kohavi,
organizer nion. SIGKDD Explorations, 2(2):86-98, 2000.
http://www.ecn.purdue.edu/KDDCUP.
] nd M. H. Dunham. Mining Association Rules: Anti-skew algorithms.
ternational Conference on Data Engineering, February 1998.
load-
s Guide and Reference 10g Release 2
emset Mining Dataset Reposit
R. C. Brodley, B. Frasca, L. Mason, and Z. Zheng. KDD-Cup 2000
s' report: Peeling the o