Association Rule Mining In Partitioned Databases ... grade/Pankaj Kandpal... · Association Rule Mining in Partitioned Databases: Performance Evaluation and Analysis Pankaj Kandpal,

Association Rule Mining In Partitioned Databases:

Performance Evaluation and Analysis

A DISSERTATION Submitted in partial fulfillment

Of the requirements for the award of the degree Of

MASTER OF TECHNOLOGY In

INFORMATION TECHNOLOGY (Specialization: SOFTWARE ENGINEERING)

By

Pankaj Kandpal

Under the Guidance of:

Prof. M. Radhakrishna Mr. Manish Kumar

IIIT-Allahabad

INDIAN INSTITUTE OF INFORMATION TECHNOLOGY, ALLAHABAD

(A Centre of Excellence in Information Technology Established by Govt. of India)

INDIAN INSTITUTE OF INFORMATION TECHNOLOGY ALLAHABAD

(Deemed University)

(A centre of excellence in IT, established by Ministry of HRD, Govt. of India)

Date:

We do hereby recommend that the thesis work prepared under our supervision

by Pankaj Kandpal entitled Association Rule Mining in Partitioned

Databases: Performance Evaluation and Analysis be accepted in partial

fulfillment of the requirements of the degree of Master of Technology in

Information Technology (Software Engineering) for examination.

Countersigned

Prof. M. Radhakrishna

______________________________ Mr. Manish Kumar

Dr. U. S. Tiwary (Dean Academics)

Thesis Advisers

IINNDDIIAANN IINNSSTTIITTUUTTEE OOFF IINNFFOORRMMAATTIIOONN TTEECCHHNNOOLLOOGGYY

AALLLLAAHHAABBAADD (A University Established under sec.3 of UGC Act, 1956 vide Notification No. F.9-4/99-U.3 Dated 04.08.2000

of the Govt. of India )

(A Centre of Excellence in Information Technology Established by Govt. of India)

CERTIFICATE OF APPROVAL*

The foregoing thesis is hereby approved as a creditable study in the area of

information technology carried out and presented in a manner satisfactory to

warrant its acceptance as a pre-requisite to the degree for which it has been

submitted. It is understood that by this approval the undersigned do not

necessarily endorse or approve any statement made, opinion expressed or

conclusion drawn therein but approve the thesis only for the purpose for

which it is submitted.

COMMITTEE ON

FINAL EXAMINATION

FOR EVALUATION

OF THE THESIS

* Only in case the recommendation is concurred in

Candidate Declaration This is to certify that Report entitled “Association Rule Mining in

Partitioned Databases: Performance Evaluation and Analysis” which is

submitted by me in partial fulfillment of the requirement for the completion

of M.Tech in Information Technology (with specialization in Software

Engineering) to Indian Institute of Information Technology, Allahabad

comprises only my original work and due acknowledgement has been made in

the text to all other material used.

PANKAJ KANDPAL

M.Tech (INFORMATION TECHNOLOGY)

SPECIALISATION IN SOFTWARE ENGINEERING

MS200512

To My Family and Friends

Acknowledgements

First and foremost, I would like to express my sincere thanks to my thesis advisors Prof.

M. Radhakrishna and Mr. Manish Kumar, for providing me their precious advices and

suggestions. This Thesis wouldn’t have been a success for me without their cooperation

and valuable comments and suggestions.

Next, I would like to express my esteem gratitude to my family: my father Mr. Bhuwan

Chandra Kandpal, my mother Smt. Kamla Kandpal and my younger brother Mr.

Devesh Kandpal for their unconditional love and support in every part of my life.

Without their support I would never had dreamt of pursuing higher studies.

I would like to thank INDIAN INSTITUTE OF INFORMATION TECHNOLOGY,

ALLAHABAD for providing me such a graceful opportunity to become a part of its

family. It has been a privilege for me to pursue M.Tech. in Software Engineering from

this institute.

I would like to thank Blue Martini Software for contributing the KDD Cup 2000 data

without which the experiments would not be possible.

My special thanks go to Mr. Balwant Singh for providing me software and hardware

support.

Some of my friends deserve special mention. They are Mr. Nilesh Shukla, Mr. Kamal

Sawan, Mr. Abhay S. Pawne, Mr. Vineet Kumar, Mr. Prabhat Saheja, Mr. Imran

Khan, Mr. Anand Atre, Mr. Adish Singh, Mr. S. K. Mada, Mr. Kamal Singh, Mr.

D. N. Lan and Ms. Mallika Srivastav.

For full two years, several souls of IIITA’s M. Tech 2005 batch, suffered the burden of

my company. Heartily thanks go to these fellows, who in spite of that maintained a lively

and jovial work environment. These jolly persons are Mr. Parikshit Totawar, Mr.

Niladree Biswas, Mr. Dhirendra Pratap Singh, Mr. Rama Rao, Mr. Dora babu, Mr.

Ravi Kiran, Mr. Anil Pandey and Mr. Prateek Dayal.

Lastly I would like to thanks all the persons those are related to the thesis directly or

indirectly.

Pankaj Kandpal

July 2007

Abstract

Association Rule Mining in Partitioned Databases: Performance Evaluation and Analysis

Pankaj Kandpal, M.Tech (Software Engineering) Indian Institute of Information technology, Allahabad

July 2007

Data mining is the process of extracting useful information from the huge amount

of data stored in the databases. Data mining tools and techniques help to predict business

trends those can occur in near future. Association rule mining is an important technique

to discover hidden relationships among items in the transaction.

The goal of the thesis is to experimentally evaluate association rule mining

approaches in the context of horizontal database partitioning. Algorithms are

implemented using SQL and PLSQL stored procedures. For experimental evaluation

Oracle 10g RDBMS is used as the database. Apriori, partitioning and sampling

algorithms have been implemented and their performance is evaluated extensively.

Apriori algorithm is implemented using K-Way join approach for support counting.

Partitioning approach is implemented in the traditional (using TIDLISTS for support

counting) as well as by the combination with K-way join second pass optimization. In the

case of sampling algorithm the dataset is first partitioned into a number of given

partitions and then algorithm is applied by considering one partition as a sample.

Table of Contents

Candidate Declaration...................................................................................................I

Acknowledgements..................................................................................................... III

Abstract........................................................................................................................V

List of Figures...........................................................................................................VIII

List of Tables ............................................................................................................... X

Chapter 1 - Introduction .............................................................................................. 1

1.1 Data mining Functionalities ................................................................................ 2

1.1.1 Association Analysis.............................................................................................. 2

1.1.2 Clustering analysis ................................................................................................ 3

1.1.3 Classification analysis............................................................................................ 3

1.1.4 Deviation analysis ................................................................................................. 3

1.2 Architectures - Integrating mining with DBMS ................................................... 4

1.2.1 Loose coupling (or cache mining)......................................................................... 4

1.2.2 Stored procedures and user defined functions ..................................................... 4

1.2.3 SQL based approach ............................................................................................. 5

1.2.4 Integrated approach .............................................................................................. 5

1.3 Database Partitioning and PLSQL....................................................................... 6

1.3.1 Database partitioning............................................................................................ 6

1.3.2 An Introduction to PLSQL................................................................................... 7

1.3.2.1 PLSQL Stored Procedures and Dynamic SQL .......................................... 7

1.3.2.2 Why Use Dynamic SQL? ........................................................................... 9

1.4 Focus of the Thesis............................................................................................ 12

1.5 Thesis Organization .......................................................................................... 13

Chapter 2 - Association Analysis.................................................................................. 14

2.1 Background ...................................................................................................... 14

2.2 Association Rule Mining Algorithms................................................................. 16

2.2.1 Terminology and Concepts ................................................................................16

2.2.2 Example of association rules: ..............................................................................17

2.2.3 Classification of Association Rules .....................................................................18

2.2.4 Apriori Algorithm...............................................................................................20

2.2.5 Partitioning Algorithm........................................................................................24

2.2.6 Sampling Algorithm............................................................................................26

Chapter 3 - Apriori Algorithm for Association Rule Mining........................................ 28

3.1 Datasets for Experiments................................................................................... 28

3.2 Performance analysis......................................................................................... 29

Chapter 4 - Partitioned Algorithm for Association Rule Mining.................................. 36

4.1 Performance analysis of Partition Algorithm..................................................... 36

4.2 Partition algorithm with second optimization (SPO) ......................................... 40

4.2.1 Second pass optimization of K-Way Join approach for support counting .........40

4.2.2 The Approach .....................................................................................................41

4.3 Performance Comparisons ................................................................................ 42

Chapter 5 - Sampling Algorithm for Association Rule Mining .................................... 46

5.1 The Negative Border ......................................................................................... 46

5.2 Performance analysis of the Sampling algorithm ............................................... 47

5.2.2 Errors in Frequent itemsets for BMSWEBVIEW1 dataset ................................54

5.2.3 Errors in Frequent itemsets for T10I4D100K dataset........................................55

Chapter 6 - Conclusion and Future Work ................................................................... 57

6.1 Conclusion........................................................................................................ 57

6.2 Future Work..................................................................................................... 58

References .................................................................................................................. 59

List of Figures

Figure 1.1: Different architectures for integrating mining within DBMS............................. 5

Figure 2.1: Presentation of candidate and frequent itemsets in the database .....................23

Figure 2.2: Partitioning approach for frequent itemsets mining .........................................24

Figure 3.1: Performance of apriori for BMSWEBVIEW1 dataset (2 partitions).................30



Figure 3.4: Performance of apriori for T10I4D100K dataset (2 partitions) ........................33

Figure 3.5: Performance of apriori for T10I4D100K dataset (4 partitions) ........................33

Figure 3.6: Performance of apriori for MUSHROOM dataset (2 partitions)......................34

Figure 4.1: Tidlist creation time for different datasets.........................................................37

Figure 4.2: Performance of partition for BMSWEBVIEW1 dataset (2 partitions)..............37



Figure 4.5: Performance of partition for T10I4D100K dataset (4 partitions) .....................40

Figure 4.6: Performance comparisons of partition and partition with SPO algorithm for

BMSWEBVIEW1 dataset (2 partitions) ..............................................................................42






T10I4D100K dataset (4 partitions) ......................................................................................44

Figure 5.1: Performance of sampling algorithm for sample size 2484 for BMSEBVIEW1

dataset...................................................................................................................................49

Figure 5.2: Performance of Sampling algorithm for sample size 5063 for BMSEBVIEW1

dataset...................................................................................................................................50


dataset...................................................................................................................................50


dataset...................................................................................................................................51


dataset...................................................................................................................................52


dataset...................................................................................................................................52

Figure 5.7: Performance of Sampling algorithm for sample size 7881 for T10I4D100K

dataset...................................................................................................................................53


dataset...................................................................................................................................53

List of Tables Table 2.1: Transaction Database D......................................................................................18

Table 2.2: Frequent Itemsets F3...........................................................................................18

Table 2.3: Tidlists for 1-itemsets ..........................................................................................26

Table 2.4: Tidlists for 2-itemsets ..........................................................................................26

Table 3.1: Details of Datasets...............................................................................................29

Table 3.2: BMSWEBVIEW1 dataset (2 partitions) .............................................................29



Table 3.5: T10I4D100K dataset (2 partitions) .....................................................................32

Table 3.6: T10I4D100K dataset (4 partitions) .....................................................................32

Table 3.7: MUSHROOM dataset (2 partitions) ..................................................................33

Table 3.8: Itemsets in MUSHROOM dataset (1st partition) ...............................................35

Table 4.1: Candidate 2-itemsets (C2) for 0.45% support for 4 partitions...........................39

Table 4.2: Performance comparison ....................................................................................44

Table 5.1: Candidate itemset C2 .........................................................................................47

Table 5.2: Negative Border NBd(F2) ...................................................................................47

Table 5.3: Frequent itemset F2 ............................................................................................47

Table 5.4: Description of different samples for BMSWEBVIEW1 dataset.........................48

Table 5.5: Description of different samples for T100I4D100K dataset ..............................48

Table 5.6: Candidate itemsets in different samples for BMSWEBVIEW1 dataset.............49

Table 5.7: Frequent itemsets generated for BMSWEBVIEW1 dataset for 0.15% support.54

Table 5.8: Frequent itemsets generated for BMSWEBVIEW1 dataset for 0.30% support.55

Table 5.9: Percentage error for BMSWEBVIEW1 dataset ..................................................55

Table 5.10: Frequent itemsets generated for T10I4D100K dataset for 0.45% support ......56

Table 5.11: Frequent itemsets generated for T10I4D100K dataset for 0.60% support ......56

Table 5.12: Percentage error for T10I4D100K dataset ...................................................... 56

Chapter 1 Introduction

There are basically two most important reasons that data mining has attracted a great deal

of attention in the recent years. First, our capability to collect and store the huge amount

of data is rapidly increasing day by day. Due to the decrease in the cost of storage devices

and increase in the processing power of computers, now a days it is possible to store huge

amount of organizational data and process it. The second but the more important reason

is the need to turn such data into useful information and knowledge. The knowledge that

is acquired through the help of data mining can be applied into various applications like

business management, retail and market analysis, engineering design and scientific

exploration [1].

Data mining or Knowledge discovery in databases (KDD) is the process of discovering

previously unknown patterns form the huge amount of data stored in flat files, databases,

data warehouses or any other type of information repository. Database mining deals with

the data stored in database management systems (e.g. Oracle).

If we are data rich then we may or may not be information rich, because the useful

information is often hidden in the data. Data mining tools and techniques are used to

generate information from the data that we have stored in our repositories over the years.

To take advantage in the market over the competitors’, decision makers or managers need

to mine the knowledge hidden in the data collected over the years and use that

information in an effective way.

1.1 Data mining Functionalities

The process of mining is often controlled by the requirements of the users. The user may

be a business analyst or may be a marketing manager. Different users have different need

of information. Depending on the requirements we can use different data mining

techniques. The different types of data mining functionalities and the patterns they

discover are described below: 1.1.1 Association Analysis [1]

Association rule mining is an interesting data mining technique that is used to find out

interesting patterns or associations among the data items stored in the database. Support

and confidence are two measures of the interestingness for the mined patterns. These are

user supplied parameters and differ from user to user. Association rule mining is mainly

used in market basket analysis or retail data analysis. In market basket analysis we

identify different buying habits of customers and analyze them to find associations

among items those are purchased by customers. Items that are frequently purchased

together by customers can be identified. Association analysis is used to help retailers to

plan different types of marketing, item placement and inventory management strategies.

When we do association rule mining in relational database management systems we

generally transform the database into (tid, item) format, where tid stands for transaction

ID and item stands for different items purchased by the customers. There will be multiple

entries for a given transaction ID, because one transaction ID indicates purchase of one

particular customer and a customer can purchase as many items as he want. An

association rule can look like this:

X (buys, computer) X (buys, Windows OS CD) [support =1%, confidence=50%]

Where:

Support = The number of transactions that contain Computer and Windows OS CD The total number of transactions

Confidence = The number of transactions that contain Windows OS CD The number of transactions that contain Computer The above rule will hold if its support and confidence are equal to or greater than the user

specified minimum support and confidence.

1.1.2 Clustering analysis [1] In clustering we group the data items in such a way that the data items in a cluster are

more similar to one another and data items in different clusters are more dissimilar. These

data items are some times called data points. The focus of clustering is to maximize the

intra-class similarity and minimize the interclass similarity. The main clustering methods

are: partitioning methods, hierarchical methods, density based methods, grid-based

methods and model based methods. By the help of clustering technique for example we

can plan our marketing strategy by dividing market areas into different zones according

to the climates or customer behaviors, so that each group is targeted differently

1.1.3 Classification analysis [1] In classification, by the help of the analysis of training data we develop a model which

then is used to predict the class of objects whose class label is not known. The model is

trained so that it can distinguish different data classes. The training data is having data

objects whose class label is known in advance. There are various presentation methods

for the derived model like IF-THEN rules, decision trees, neural networks, mathematical

formula.

The major difference between classification and clustering is that classification is

supervised and clustering is unsupervised. That means in classification the class label is

known in advance, while clustering does not assume any knowledge of clusters.

1.1.4 Deviation analysis [2] Deviations are differences between the current data and previously defined normal

values. Deviation analysis is used to detect anomalies in the datasets. It is very useful for

time-related data analysis, in which we need to identify data deviations which occur over

the time. Deviation analysis tools are helpful in security systems, where authorities can

be warned about the deviation in resource utilization by a particular user.

1.2 Architectures - Integrating mining with DBMS There are various architectures [3] available for integrating data mining process with the

data base management systems. These architectures are depicted in the figure 1.1 and

described briefly below:

1.2.1 Loose coupling (or cache mining) This is an example of multi tier architecture. Mining applications are integrated into

client or into the application server depending on the architecture. The mining kernel can

be considered as the application server. Data is first fetched from the database

management system into the mining kernel and then it is mined according the user need.

Finally the results are sent back to the DBMS. Any intermediate results generated are

also stored back into the DBMS. In this approach the DBMS runs in a different address

space form the mining process. Cache based mining is another type of loose coupling

approach, in which the data is read only once form the DBMS and cached into flat files

on the local disk for future processing.

1.2.2 Stored procedures and user defined functions Mining logic is embedded as an application on the database server. There are two ways in

which mining application is stored on the database server side: stored procedures and

user defined functions. The mining application and DBMS executes in the same address

space. For example in Oracle we can create PLSQL stored procedures or JAVA stored

procedures for our mining algorithms and these procedures are then stored in the

database. In IBM DB2 we can implement mining algorithm with the help of user defined

functions.

Mining as application on Client/app server

Integrated withSQL query engine

Mining as application on database server

Mining usingSQL+ extensions

Integrated approach

TightIntegrationLoose

SQL based approach

Stored procedure

Loosecoupling

Mining extenders/blades

User defined functions

CacheMine

Figure 1.1: Different architectures for integrating mining within DBMS [3]

1.2.3 SQL based approach Here mining algorithm is presented in the form of SQL queries to DBMS query engine

where these SQL queries are executed by the SQL query processor. A mining based

optimizer can be used to optimize these SQL queries. DBMS provides support for

check-pointing and space management which is very useful in the case of these long

running queries.

1.2.4 Integrated approach In Integrated approach querying and mining are treated similarly. There is no distinction

between OLTP, OLAP or mining, the main goal is to get the information form the

database in the most effective way. Here mining operators are essential part of database

query engine. These mining operators or extended SQL is used for mining.

1.3 Database Partitioning and PLSQL 1.3.1 Database partitioning [15] A database partition is a logical division of a database or its constructs like tables or

indexes into distinct independent parts. Database partitioning is done mainly for the

following reasons:

• Performance

• Manageability

• Availability

A database can be partitioned in two ways:

• Building several smaller databases

• Splitting selected elements (splitting a table into various tables)

Database can be partitioned in two manners:

• In Horizontal Partitioning we put different rows in different tables. (Row wise

partitioning)

• In vertical partitioning we put different columns in different tables (column wise

partitioning. Normalization uses vertical partitioning)

Oracle provides various types of partitioning options like hash partitioning, list

partitioning, range partitioning and various combinations of these. We want to randomize

the data allocated to the partitions. For this purpose hash partitioning option will be used ,

in which a hash function is applied to the partition key of each row, and based on the

result the row is placed into appropriate partition.

A hash partitioned table can be created like this:

Create Table My_Table

(

Tid number,

Item number

)

Partition by hash (tid)

Partitions 4;

The above script will create a hash partitioned table having four partitions. This table is

initially empty, when the data is inserted into that table it will be allocated to the different

partitions according to the value of hash function.

But what if the table already contains data? In that case we have to redefine the logical

structure of the table online. For this oracle RDBMS provides the facility of online

redefinition of the table. DBMS_REDIFINITION package [15] is used to partition the

table that already contains data.

1.3.2 An Introduction to PLSQL PLSQL is Oracle’s extension of the Structured Query Language (SQL). PLSQL can be

used to implement business rules through the creation of stored procedures, functions and

packages, triggers to trigger events and add programming logic to the execution of SQL

commands.

1.3.2.1 PLSQL Stored Procedures and Dynamic SQL [19-20] Stored procedures [18] are stored at the database server side and can be invoked by client

applications. Stored procedures are written by users and they include SQL statements.

SQL is a declarative language that allows writing a SQL declarations and sending them to

database engine for execution. Procedural code cannot be executed by SQL. To overcome

this limitation PL/SQL was created.

A PLSQL stored procedure has name, parameters can be passed to them as an input and

they return values to the calling program. The variables they can handle can have basic

data types like characters, integers, numbers, dates or complex data types like large

objects (LOB), Varrays, PLSQL tables.

PLSQL is a complete block structured programming language. PLSQL procedures,

functions and packages are stored at the server side. PLSQL procedures and functions are

collectively called PLSQL stored procedure or subprograms. All PLSQL programs are

made up of blocks, which can be nested within each other. SQL can be easily embedded

inside the PLSQL program and it provides some additional features those are lacking in

SQL.SQL DML statements can directly included in the PLSQL and database tables can

be manipulated easily, after the computation results can be stored in the database directly.

SQL DDL statements can also be included inside the PLSQL stored procedures by the

help of dynamic SQL. We can call PLSQL procedures from client programs easily.

Oracle provides two ways to execute dynamic SQL, Native Dynamic SQL and by using

DBMS_SQL package. Native dynamic SQL is easier to write and its code is compact

compared to the code written with the other method [19].

SQL is static, which remains the same in each execution. Dynamic SQL facilitates us to

develop dynamic SQL statements as character strings at the runtime. The string contains

the text of a SQL statements or PLSQL block and can also contain placeholders for bind

variables. With the help of dynamic SQL we can generalize SQL statements because the

full text of a SQL statement is not known at the compile time. It gives us the facility to

create general purpose flexible applications. Dynamic SQL can be used in several

different development environments, including PLSQL, Pro*C/C++, and Java.

For an example, suppose that user want to run a complex query with a user specified sort

order. Instead of coding the query twice with a different sort order clause (Order By

clause) in each query, query can be developed dynamically to include specified sort order

clause.

1.3.2.2 Why Use Dynamic SQL? [19] Static SQL and dynamic SQL both have advantages and disadvantages. The full text of

static SQL statements is known at the compilation time, which provides the following

advantages:

• Static SQL has better performance than dynamic SQL.

• If a SQL statement complies successfully it states that all the database objects

referenced in the SQL statement are valid and all the necessary privileges are in

place to access the objects.

• Static SQL has some limitations that can be overcome with dynamic SQL.

Dynamic SQL provides the following advantages over static SQL:

• Full text of SQL statement is not known that must be executes in PLSQL

procedure.

• Executing DDL and other SQL statements those are not supported by static SQL

programs.

• Referencing database objects that do not exist at compile time.

• Execution optimization at run time.

• Executing dynamic PLSQL blocks.

The following PLSQL block contains several examples of dynamic SQL:

DECLARE

sql_stmt varchar2 (200);

plsql_block varchar2 (500);

query_str varchar2 (100);

v_deptno number;

BEGIN

query_str:= ‘ SELECT deptno FROM emp WHERE empno = :no’;

EXECUTE IMMEDIATE query_str into v_deptno using 100;

EXECUTE IMMEDIATE ‘CREATE TABLE BMSWEBVIEW1 (tid number, item

number)’;

EXECUTE IMMEDIATE ‘ALTER SYSTEM SET CURSOR_SHARING = SIMILAR’;

plsql_block:= ‘BEGIN pkg_apriori.apriori (:pass_no,:min_sup); END;’;

EXECUTE IMMEDIATE plsql_block USING 4, 179;

END;

The above PLSQL block has no name. It is called an Anonymous PLSQL block. An

Anonymous block is not stored at the server side in the database. Because the anonymous

block doesn’t have a name, it can’t be called by any other block. But, PLSQL functions

and procedures can be called from anonymous block.

A typical format of PLSQL stored procedure is described below:

CREATE OR REPLACE PROCEDURE MyProcedure (Tid in number, Item in number)

AS/IS

/*

This is declaration section. Define and initialize the variables and cursors used in the

block.

*/

BEGIN

/*

This is executable section. Uses flow-control commands (such as if commands

and loops) to execute the commands and assign values to the declared variables.

*/

EXECPTION

/*

This is exception handling section. This section is optional. It provides customized

handling of error conditions.

*/

END;

PLSQL packages are a unit of encapsulation those are used to store related functions and

procedures together. Packages in PLSQL are similar to other programming languages. A

PLSQL package consists of two parts: package specification and package body.

The following is an example of PLSQL package for creating and altering a table at run

time:

CREATE OR REPLACE PACKAGE pkg_new_approach

AS/IS

PROCEDURE table_creation

(initial_tablename VARCHAR2, new_tablename VARCHAR2);

PROCEDURE alter_table_creation

(new_tablename VARCHAR2, buffer_1 VARCHAR2);

END pkg_new_approach;

CREATE OR REPLACE PACKAGE BODY pkg_new_approach

AS

PROCEDURE table_creation

(initial_tablename VARCHAR2, new_tablename VARCHAR2)

IS

Item_1 NUMBER;

buffer_1 VARCHAR2 (50);

buffer_final VARCHAR2 (1000);

type cur_type IS ref CURSOR;

my_rec1 cur_type;

BEGIN

OPEN my_rec1 FOR 'select distinct item from ' || initial_tablename || ' order by item';

EXECUTE IMMEDIATE 'create table ' || new_tablename || '(x number)';

LOOP

FETCH my_rec1

INTO item_1;

EXIT

WHEN my_rec1 % NOTFOUND;

buffer_1 := CONCAT ('x', item_1);

alter_table_creation (new_tablename, buffer_1);

END LOOP;

CLOSE my_rec1;

END;

PROCEDURE alter_table_creation

(new_tablename VARCHAR2, buffer_1 VARCHAR2)

IS

query_str VARCHAR2 (1000);

BEGIN

query_str:= 'alter table ' || new_tablename || ' add ' || buffer_1 || ' number';

EXECUTE IMMEDIATE query_str;

END;

END pkg_new_approach;

1.4 Focus of the Thesis

In this thesis we are concerned about the database mining, in which data is stored in the

relational database management systems (e.g. Oracle). RDBMS provides various

additional benefits those are lacking in the file based mining. SQL and PLSQL stored

procedures [15, 20] are used for the purpose of implementation. For the purpose of

experimentations one synthetic and two real life datasets [21, 22] are used.

The goal of the thesis is to evaluate the performance of association rule mining

algorithms in the context of database partitioning. The thesis focuses on Apriori,

partitioned and sampling algorithms for frequent itemsets mining when the data is

partitioned into a number of given segments. Apriori algorithm scans the database

multiple number of times for counting the support for the itemsets. Partitioning approach

partitions the database for mining frequent itemsets. Sampling algorithm uses the small

sample instead of the entire database for mining.

1.5 Thesis Organization

The structure of the rest of the thesis is as follows:

Chapter 2 presents the background of various association rule mining approaches

developed so far. It covers in detail about the association analysis and association rule

mining algorithms discussed in the thesis.

Chapter 3 discusses the performance analysis of apriori algorithm when it is applied as

the partitioned approach.

Chapter 4 presents the performance analysis of partitioning algorithm. It discusses the

TIDLIST approach for support counting and K-way join second pass optimization.

Chapter 5 presents the sampling approach for frequent itemsets mining.

Chapter 6 discusses the conclusion and future directions about the work done in the

thesis.

Chapter 2 Association Analysis

In this chapter a background of various association rule mining algorithms is discussed.

This chapter also covers in detail about the association analysis and association rule

mining algorithms discussed in the thesis.

2.1 Background The Association rule mining was first introduced in the AIS [4] algorithm. It was again

modified in [5]. There are various algorithms proposed for association rule mining since

the development of the AIS algorithm so that the performance of the algorithm is

improved. Apriori [5] is the very basic and most popular association rule mining

algorithm. Most of the association rule mining algorithms are based on apriori algorithm.

Apriori algorithm scans database multiple times. The FP-tree [6] (Frequent-pattern)

algorithm builds a special type of tree structure in main memory so that it can avoid

multiple scans over the database. The turbo-charging [7] algorithm improves the

performance by the help of data compression techniques.

The partition algorithm [8] is based on apriori algorithm. It firstly partitions the data into

a number of non overlapping partitions and processes each partition separately to

generate frequent itemsets local to each partition and finally it combines all the local

frequent itemsets to generate global frequent itemsets. It reduces the number of complete

database scans up to two and hence improves the performance of mining algorithm.

The Incremental Mining algorithm [9] is another useful technique for speeding up the

mining process when new data is added to the database. [1, 10] Sampling algorithm is

also based on apriori algorithm. Rather than mining entire database, we here draw out a

random sample of data form the database and then finds out frequent itemsets in that

sample instead of the entire database. Finally the rest of the database is used to compute

the actual support of the frequent itemsets that we found in the sample.

Because we are searching for frequent itemsets in the sample, it is possible that we may

miss some global frequent itemsets. To lessen this we use lower support than minimum

support for the sample. In this way we trade off some degree of accuracy against

efficiency. There are various mechanisms so that we can find out all the missing frequent

itemsets those are not find out in the sample.

Most of these algorithms are In-memory algorithms, in which data is directly read from

flat files or first extracted from database to the flat files and then processed in main

memory. Most of these algorithms build specialized data-structures and implement their

own buffer management schemes.

Since then very few attempts have been made to build database based mining approaches.

Various extensions to the standard SQL have also been proposed. These extensions allow

the inclusion of mining operators in the SQL. The data mining query language (DMQL)

[11] includes such mining operators for various types of mining tasks.

[12] Shows various architectural alternatives for coupling data mining with relational

database systems. [3] Have also compared various SQL based approaches for association

rule mining. These are SQL-92 based approaches and SQL-OR based approaches. The

SQL-92 based approaches use the standard SQL language for the mining. SQL-OR based

approaches use the object relational extensions to the SQL. [3] Have implemented apriori

algorithm in the form of SQL queries.

[13] Deals with the partitioned and incremental approaches for association rule mining

and they have evaluated basic k-way join algorithm in the context of multiple databases

and proposed two optimizations of partitioned approach for multi-database mining.

2.2 Association Rule Mining Algorithms 2.2.1 Terminology and Concepts [1] Let I is the set of all items in the database D. Database D contains user transactions. Each

transactions T contains a set of items such that T ⊂ I. Let X and Y are set of item. An

association rule is of the form X where XY, ⊂ I, Y ⊂ I, and

X ∩ Y = φ. Support and Confidence are two measures of rule interestingness.

The rule X Y holds in the database D with support s, where s is the percentage of

transactions in D that contain X U Y. The rule has confidence c if c is the percentage of

transactions in D containing X which also contains Y. i.e.

Support (X Y) = P (XUY)

Confidence (X Y) = P (Y|X)

The rules that satisfy both the user specified minimum support and confidence are said to

the Strong Association rules.

[1] A set of items is called an itemset. An itemset that contains k items is called a k-

itemset. The occurrence frequency of an itemset is the number of transactions that contain

the itemset. This is also known as the frequency or support count of the itemset. An

itemset satisfies minimum support if the occurrence frequency of the itemset is greater

than or equal to the product of minimum support and the total number of transactions in

the entire database. The number of transactions required for the itemset to satisfy

minimum support is referred as the minimum support count. If an itemset satisfies

minimum support then it is called a frequent or large itemset.

An association rule mining algorithm is divided into two parts:-

• Frequent itemsets generation i.e. all the itemsets having support greater than the

user specified minimum support.

• Frequent itemsets generated in the step 1 will be used to generate association rules

that satisfy user specified minimum confidence.

First step is more complex and requires more effort. After the frequent itemsets are

generated the strong association rule generation is simple. Strong association rules satisfy

both minimum support and minimum confidence.

Confidence (X Y) = P (Y | X) = support-count (X U Y) support-count (X) Where support-count (X U Y) is the total number of transactions having itemset {X, Y}

and support-count (X) is the total number of transactions having itemset {X}.

Association rules are generated as follows:-

• For every frequent itemset x, generate all non empty subset of x.

• For every non- empty subset s, of x, generate the association rule

s (x-s) if

support-count (X U Y) is greater or equal to minimum confidence. support-count (X)

Since the association rules are generated directly from frequent itemsets, each rule

automatically satisfies minimum support.

2.2.2 Example of association rules: The Table 2.1 depicts an example Transaction database and Table 2.2 shows that

{1, 2, 3} and {1, 2, 5} are frequent 3-itemsets. The non empty subsets of {1,2,3} are

{1},{2},{3},{1,2}, {1,3} and {2,3}. The association rules generated are:

{1, 2} {3} confidence=2/4 = 50%

{1, 3} {2} confidence=2/2 = 100%

{2, 3} {1} confidence=2/3 = 66%

{1} {2, 3} confidence=2/4 = 50%

{2} {1, 3} confidence=2/6 = 33%

{3} {1, 2} confidence=2/3 = 66%

If minimum confidence is equal to 66% then the following rules are strong rules:

{1, 3} {2}, {2, 3} {1} , {3} {1, 2}.

TID ITEM T1 1 T1 2 T1 5 T2 2 T2 4 T3 2 T3 3 T4 1 T4 2 T4 4 T8 1 T8 2 T8 3 T9 1 T9 2 T9 3 T9 5 Table 2.1: Transaction Database D

Table 2.2: Frequent Itemsets F3

2.2.3 Classification of Association Rules Association rules can be classified into various ways [1]:

• If a rule specifies association between the presence and absence of the items, it is

called Boolean association rule. Ex

Computer Windows OS CD

ITEM1ITEM2 ITEM3 SUPPORT1 2 3 2 1 2 5 2

• If a rule specifies associations among quantitative items, then it is called a

Quantitative association rule. Quantitative values are partitioned into intervals. Ex

Age (Y,”20...25”) ^ Income (Y,”22K..30K”) Buys (X,”Washing Machine”)

• If the rule reference only one dimension, then it is called single-dimensional

association rule. Ex.

Buys (X, computer) Buys(X, Windows OS CD)

• If a rule references two or more dimensions, such as the dimensions buys etc ,

then it is a multidimensional associational rule. Ex.

Age (Y,”20...25”) ^ Income (Y,”22K..30K”) Buys (X,”Washing Machine”)

The above rule involves three dimensions age, income, buys.

• Multi level association rules. Ex

Age (Z,”20..25”) buys (Z,”printer”)

Age (Z,”20..25”) buys (Z,”color printer”)

The above rules are at different levels of abstraction. Printers are higher level abstraction

of color printers. If the rules do not reference items at different levels of abstraction, then

these rules are called single level association rules.

2.2.4 Apriori Algorithm [1] The Apriori algorithm [5] is one of the most important algorithms for association rule

mining because most of the other algorithms are based on it or extensions of it. It is a

Main-memory based algorithm. Main memory imposes a limitation on the size of the

dataset that can be mined.

The algorithm executes in two steps as described above. i.e. frequent itemsets generation

and association rule generation. The frequent itemsets generation is again a two step

process:

• Candidate itemsets (Ck) generation i.e. all possible combination of items those are

potential candidates for frequent itemsets.

• Frequent itemsets (Fk) generation- support for all candidate itemsets are generated

and itemsets having support greater than the user-specified minimum support are

qualified as the frequent itemsets.

This algorithm scans over the database a multiple number of times and it is not possible

to find out number of scans earlier.

The algorithm is presented below: [1, 13]

F1 = {frequent 1-itemsets}

For (k = 2; Fk-1 ≠ 0; k++) loop

Ck = generate (Fk-1);

For all transactions x ∈ D loop

Cx = generate_subset (Ck, x); //candidate generation;

For all candidates c ∈ Cx loop

c.count++

end loop;

end loop;

Fk = { c ∈ Ck | c.count ≥ minsup}

end loop;

Return ∪k {Fk}

Firstly apriori algorithm generates frequent 1-itemsets F1 by directly reading the database

D.Then it iterates through for loop and Fk-1 is used to generate candidate itemsets Ck. In

the next pass Ck is then used in the generation of Fk. The generate procedure generates

potential candidate itemsets and then eliminates itemsets from this set whose subset is not

frequent. The algorithm builds a special hash tree data structure in the memory for

support counting. [1]

For SQL based implementation of the algorithm, the candidate itemsets and frequent

itemsets are represented as relational tables. The SQL for candidate generation in pass k

is presented below: [2]

Insert into Ck

Select I1.item1… I1.itemk-1, I2.itemk-1

From Fk-1 I1, Fk-1 I2

Where I1.item1 = I2.item1 and

:

I1.item k-2 = I2.itemk-2 and

I1.item k-1 < I2.itemk-1;

Frequent itemsets generation form candidate itemsets is the most time consuming part of

the association rule mining. It is called support counting phase. For SQL based

formulations SQL-92 and SQL-OR based approaches are used for support counting.

The K-Way join [13] approach presented below is SQL-92 based approach for support

counting.

Insert into Fk

Select item1, … , itemk, count(*)

From Ck, T T1, … , T Tk

Where T1.item = Ck.item1 and

:

Tk.item = Ck.itemk and

T1.tid = T2.tid and

:

Tk-1.tid = Tk.tid

Group by item1, item2, … ,itemk

Having count(*) > min_sup;

There have been various optimizations [2] proposed for K-Way join approach. These are:

• Pruning the input data.

• Second pass optimization.

• Reuse of item combinations.

Figure 2.1 shows an example how the candidate and frequent itemsets generated in

different passes are presented as database tables in the database.

ITEM1 SUPPORT1 6

2 7

3 6

4 2

5 2

ITEM1ITEM2SUPPORT 1 2 4 1 3 4 1 5 2 2 3 4 2 4 2 2 5 2

Frequent itemset F1 Frequent itemset F2 ITEM1 ITEM2 ITEM3

1 2 3 1 2 5 1 2 4 2 3 5

ITEM1 ITEM2 1 2 1 3 1 5 2 3 3 5 3 4 2 4

Candidate itemset C3

Candidate itemset C2

Figure 2.1: Presentation of candidate and frequent itemsets in the database

2.2.5 Partitioning Algorithm [8] Partitioning algorithm is basically based on apriori algorithm, but it requires only two

complete scans over the database. Figure 2.2 depicts the partitioning approach for

frequent itemsets mining [1]. The partition algorithm is divided into two phases:

• The database is divided into a number of non overlapping partitions and frequent

itemsets local to partition are generated for each partition. The database is

scanned completely for the first time.

• Local frequent itemsets from each partition are combined to generate global

candidate itemsets. Then the database is scanned second time to generate global

frequent itemsets.

DivideDatabase (D)

into n partitions

GeneratefrequentItemsets local to

each partition

Generate global

candidate itemsets

Generate global

frequent itemsets

Combine all local frequent itemsets

Second phase support counting

Second complete scan of D

First completescan of D

Figure 2.2: Partitioning approach for frequent itemsets mining [1] The algorithm is formulated below: [8, 13]

P = partition_database (D)

n = Number of partitions

// Phase I

For i = 1 to n loop

Read-in_partition (pi ∈ P)

Li = gen_large_itemsets (pi)

End loop;

// Generating global candidate itemsets

For (i = 2; Lj i ≠ φ, j = 1,2....,n; i++) loop

Ci G = ∪j = 1,2,...nLij

End loop;

// Phase II

For i = 1 to n loop

Read-in_partition (pi ∈ P)

For all candidates c ∈CG generate_count (c, pi)

end loop;

LG = {c ∈ CG| c.count > minsup}

Return LG;

Here minsup is the minimum support for the entire database D. Minimum support for a

particular partition is the multiplication of minsup and the total number of transactions in

that partition.

[8]Algorithm builds a special data structure called TIDLISTs. For every itemset a tidlist

is generated. A tidlist for an itemset contains the TIDs of all the transactions that contain

that itemset in the partition. The TIDs in the tidlist are maintained in the sorted order.

Tidlists are used for counting the support for the candidate items sets. Cardinality of the

tidlists of an itemset divided by the total number of transactions in a partition gives the

support for that itemset in that partition. Initially the tidlist for the entire partition is

generated. From here we find out tidlists corresponding to the 1-itemsets. Higher level

tidlists are generated by the intersection of tidlists. Figure 2.3 and Figure 2.4 show the

Tidlists and their representations as database tables for 1-itemsets and 2-itemsets

respectively.

ITEM1 COUNT TIDLIST 1 2 100,300 2 3 200,300,400 3 3 100,200,300 4 1 100 5 3 200,300,400

Table 2.3: Tidlists for 1-itemsets

ITEM1 ITEM2 COUNT TIDLIST 1 2 1 300 1 3 2 100,300 1 5 1 300 2 3 2 200,300 2 5 3 200,300,400

Table 2.4: Tidlists for 2-itemsets 2.2.6 Sampling Algorithm [10, 14 and 16] There are various sampling algorithms for association rule mining has been proposed in

[10, 14 and 16]. Among them the sampling algorithm proposed in [10] has the best

performance. The algorithm [10] picks up a random sample form the database and then

finds out frequent itemsets in the sample using support that is less than the user specified

minimum support for the database. These frequent itemsets are denoted by S. Then the

algorithm finds out the negative border [10] of these itemsets denoted by NBd (S). The

negative border is the set of itemsets those are candidate itemsets but did not satisfy

minimum support. Simply NBd (Fk) = Ck - Fk. After that for each itemset X in

S U NBd (S) it checks whether X is frequent itemset in entire database by scanning the

database. [1, 17]

If NBd (S) contains no frequent itemsets then all the frequent itemsets are found.

If NBd (S) contains frequent itemsets then the algorithm constructs a set of candidate

itemsets CG by expanding the negative border of S U NBd (S) until the negative border is

empty. Now for each itemset X in CG the algorithm scans the database for the second

time. In the best case when all the frequent itemsets are found in the sample this

algorithm requires only one scan over the database. In the worst case it requires two scans

over the database. [1, 17]

The performance of sampling algorithm relies on the quality of the sample chosen. If the

sample chosen is a bad sample the number of candidates generates for second scan may

be very large hence second scan can be inefficient.

The sample can be a partition of the database. In that case the partition is treated just like

a random sample chosen.

The sampling algorithm [10] is depicted below:

s = Draw_random_sample (D);

// generate frequent itemsets for the sample drawn.

S = generate_frequent_itemsets (s, low_support);

// counting support for the itemsets and their negative border generated in the sample, in

the database D.

F = {X ∈ S U NBd (S) | X.count >= minsup};

// if NBd (S) contains frequent itemsets, expand border

Repeat

S = S U NBd (S);

Until S does not grow;

// another scan of D

F = {X ∈ S | X.count >= minsup};

Output F; // frequent itemsets in the database D

Chapter 3

Apriori Algorithm for Association Rule Mining

This chapter presents the performance analysis of the apriori algorithm [5] for association

rule mining in the context of partitioning approach. For support counting K-Way join

approach [2] is used. The algorithm executes in two phases. In the first phase the

database (or dataset) is partitioned into a number of given partitions and local frequent

itemsets for each partition are generated using the minimum support count for that

partition. Then all the local frequent itemsets are combined into the following two sets:

• Global frequent itemsets

• Global candidate itemsets

In the second phase support for the global candidate itemsets are calculated in the entire

database. Itemsets qualifying the minimum support are frequent itemsets in the entire

database and hence added to the set of global frequent itemsets. The algorithm scans the

database multiple number of times. TIDLIST data structure for support counting is not

used at all.

The experiments are done on Oracle 10g RDBMS installed on Microsoft Windows XP

with 1 GB of RAM and 2.40 GHz processor. Each experiment is performed various times

and the best of them is taken.

3.1 Datasets for Experiments

Datasets are needed for the purpose of experiments. Some synthetic and real life datasets

[21, 22] have been collected from the internet for experiments. Synthetic datasets are

generated through the synthetic datasets generation utility or program. Real life datasets

are real transactions done on retail items; those are collected over the years for the

purpose of the analysis. These datasets were stored in the flat files and will have to be

transferred into the database tables. To transfer these datasets within the Oracle database,

SQL*Loader utility [15] provided by Oracle RDBMS was used. This utilizes the

functionalities provided by DBMS and saves unnecessary effort that could have been

spend in writing the program for loading the data into DBMS. After the data have been

loaded into the database, it was converted into the format suitable for the algorithms.

The details of the datasets used in the thesis for experiments are given in the Table 3.1

Name Size Total transactions

Total items Average no of items per transactions

BMSWEBVIEW1 3 MB 59602 497 5 T10I4D100K 16 MB 100000 870 10

MUSHROOM 3 MB 8124 119 23

Table 3.1: Details of Datasets

3.2 Performance analysis The Table 3.2 shows the total number of records, total number of distinct transactions and distinct items contained in the partitions for BMSWEBVIEW1 dataset.

Table 3.2: BMSWEBVIEW1 dataset (2 partitions)

Total Rows Distinct Transactions

Total items

Partition 1 74634 29718 486 Partition 2 75005 29884 486

0

50

10

15

20

0

0

0

250

15% .30% 45% .60%

Time in

seconds

0. 0 0. 0

Minimum Support

Figure 3.1: Performance of apriori for BMSWEBVIEW1 dataset (2 partitions) Figure 3.1 shows the performance of apriori algorithm for BMSWEBVIEW1 dataset for

two partitions. It is obvious form the figure that as the support value increases time taken

by the algorithm decreases.

The Table 3.3 and Table 3.4 show the total number of input records, total number of

distinct transactions and distinct items contained in the different partitions for

BMSWEBVIEW1 dataset for 3 and 4 partitions respectively.


Total items

Partition-1 37067 15025 473 Partition-2 75005 29884 486 Partition-3 37567 14693 464


Total Rows Distinct Transactions Total items Partition-1 37067 15025 473 Partition-2 37182 14979 466 Partition-3 37567 14693 464 Partition-4 37823 14905 467


Figure 3.2 and Figure 3.3 show the performance of apriori algorithm for 3 and 4

partitions of BMSWEBVIEW1 dataset respectively for different support values. For

0.15% support value time taken by the algorithm increase as the number of partitions

increases. As the number of partitions increase form 3 to 4 time taken by the algorithm

also increases. But the increase in the time is not much.

0

50

10

15

20

25

0. 0 0. 0

0

0

0

0

300

15% .30% 45% .60%

Time in

seconds

Minimum Support

Figure 3.2: Performance of apriori for BMSWEBVIEW1 dataset (3 partitions)

0

50

10

15

20

25

0. 0 0 0

0

0

0

0

300

15% .30% .45% .60%

Time in

seconds

Minimum Support

Figure 3.3: Performance of apriori for BMSWEBVIEW1 dataset (4 partitions)

The Table 3.5 and Table 3.6 show the total number of input records, total number of

distinct transactions and distinct items contained in the different partitions for

T10I4D100K dataset for 2 and 4 partitions respectively.


Total items


Table 3.5: T10I4D100K dataset (2 partitions)

Table 3.6: T10I4D100K dataset (4 partitions)


Total items

Partition-1 252380 24945 869 Partition-2 252387 24956 869 Partition-3 254040 25209 869 Partition-4 251307 24890 868

Figure 3.4 and Figure 3.5 show the performance of apriori algorithm for 2 and 4

partitions of T10I4D100K dataset respectively for different support values. For 2

partitions time taken in frequent itemsets generation by the algorithm is more as

compared to time taken for 4 partitions for each support value. As the support increases

form 0.45% to 0.75% time taken by the algorithm decreases. But the difference between

the time taken by the algorithm for 0.45% and 0.60% support values is less than as

compared to the difference between the time taken for 0.60% and 0.75% support values.

0

2004006080

10121416

00

00000000

0.45% 0.60% 0.75%

Time in

seconds

Minimum Support

Figure 3.4: Performance of apriori for T10I4D100K dataset (2 partitions)

0200400600

800100012001400

.45% .60% 75%

Time in

seconds

0 0 0.

Minimum Support

Figure 3.5: Performance of apriori for T10I4D100K dataset (4 partitions)

The Table 3.7 shows the total number of input records, total number of distinct

transactions and distinct items contained in the different partitions for MUSHROOM

dataset for 2 partitions respectively.


Total items


Table 3.7: MUSHROOM dataset (2 partitions)

Figure 3.6 depicts the performance of apriori algorithm for MUSHROOM dataset for 2

partitions and for different support values. As support increases for 0.15% to 1.0%, time

taken by the algorithm decreases except for 0.60% support value. But the decrease in

time is not significant as compared to decrease in support values.

Table 3.8 shows the candidate and frequent itemsets generated for the first partition of

MUSHROOM dataset. Figure 3.6 shows the time only up to frequent 3-itemsets.

Candidate 4-itemsets were not generated even after running the algorithm for two

additional hours after the generation of frequent 3-itemsets. For later passes larger

candidate and frequent itemsets are generated. They require more time for support

counting as compared to earlier passes. For example, in the case of 2.0% support value

size of C4 is more than 5 times the size of C3 (C3 = 25113, C4 = 127227). Generation of

C4 form F3 and support counting of C4 to generate F4 is more time consuming than

previous pass.

0200400600800

100012001400

0. 0. 0 0. 1.

1600

15% 30% .45% 60% 00%

Time in

seconds

Minimum Support Figure 3.6: Performance of apriori for MUSHROOM dataset (2 partitions)

support F1 C2 F2 C3 F3 C4 F4

0.15% 115 6555 3255 49907 44244

0.30% 111 6105 3023 45400 39846

0.45% 104 5356 2810 41832 36324

1.0% 94 4371 2413 34091 28475

2.0% 89 3916 2008 25113 20487 127227 119150

Table 3.8: Itemsets in MUSHROOM dataset (1st partition)

The main reason for the bad performance of the MUSHROOM dataset is the average

number of items per transactions. As the average number of items per transactions

increases, frequent itemsets generated for each pass also increases. Hence support

counting requires more time for that.

Chapter 4 Partitioned Algorithm for Association Rule

Mining

In this chapter we will present performance analysis of Partition algorithm [8]. Partition

algorithm finds all frequent itemsets in just two scans over the database. In the first scan

it partitions the database into a given number of partitions and finds out all local

frequents itemsets. Then it merges all local frequent itemsets to form a global candidate

itemset. In the second scan over the database it finds out the support of global candidate

itemsets in the entire database and outputs global frequent itemsets.

The experiments are done on Oracle 10g RDBMS installed on Microsoft Windows XP

with 1 GB of RAM and 2.40 GHz processor. Each experiment is performed various times

and the best of them is taken.

4.1 Performance analysis of Partition Algorithm

For the purpose of support counting partition algorithm builds a special data structure

called Tidlist. Tidlist is created as a CLOB (character large object). Table 2.3 and Table

2.4 show an example of Tidlists for 1-iteemsets and 2-itemsets respectively. Figure 4.1

shows the Tidlist creation time for different datasets.

Figure 4.2 compares the time taken by the partition algorithm for BMSWEBVIEW-1

dataset with 2 partitions for different support values on Oracle RDBMS. The time taken

includes the Tidlist creation time plus time taken by the partition algorithm for frequent

itemsets generation. For lower support values algorithm takes more time because it

generates too many candidate itemsets. These candidate itemsets are then tested for

minimum support. The algorithm takes two scans over the database.

0

100

200

300

400

500

600

Mushroom BMSWEBVIEW1 T10I4D100K

Time in

seconds

Datasets

Figure 4.1: Tidlist creation time for different datasets

0500

1000150020002500300035004000

0.15% 0.30% 0.45% 0.60%

Time in

seconds

Minimum Support Figure 4.2: Performance of partition for BMSWEBVIEW1 dataset (2 partitions)

The Table 3.2 shows the total number of records, total number of distinct transactions and

distinct items contained in the partitions for BMSWEBVIEW1 dataset.

The Figure 4.3 and Figure 4.4 show the performance of the algorithm for 3 and 4

partitions respectively for BMSWEBVIEW1 dataset. The Table 3.3 and Table 3.4

describe the BMSWEBVIEW1 dataset with 3 and 4 partitions respectively. For 0.15%

support time taken by the partition algorithm for 3 partitions is more than that of 2

partitions.

0500

10001500200025003000350040004500

0.15% 0.30% 0.45% 0.60%

Time in

seconds

Minimum Support Figure 4.3: Performance of partition for BMSWEBVIEW1 dataset (3 partitions)

0500

1000150020002500300035004000

0.15% 0.30% 0.45% 0.60%

Time in

seconds Minimum Support Figure 4.4: Performance of partition for BMSWEBVIEW1 dataset (4 partitions)

The partition algorithm [13] uses Tidlist for support counting. For a particular partition,

in the first pass the Tidlist for the all local frequent 1-itemsets are created. Then frequent

2-itemsets are generated by the intersection of two corresponding frequent 1-itemsets.

The intersection of two Tidlists is very time consuming process. The overall performance

of the partition algorithm mainly depends on the second pass. Because the size of

candidate 2-itemsets C2 is very large, the time taken by the support counting process is

very high. The Table 4.1 shows the candidate 2-itemsets generated for the

BMSWEBVIEW1 and T10I4D100K for 4 partitions and for 0.45% minimum support.

Partition-1 Partition-2 Partition-3 Partition-4

T10I4D100K 180300 178503 172578 182106

BMSWEBVIEW1 13530 13203 13695 13530

Table 4.1: Candidate 2-itemsets (C2) for 0.45% support for 4 partitions The size of MUSHROOM dataset is near about 3 MB and is equal to the size of

BMSWEBVIE1 dataset. But for partition algorithm MUSHROOM dataset takes more

time than BMSWEBVIE1 dataset, because in the case of MUSHROOM dataset average

number of items per transaction is much more than that of the BMSWEBVIEW1 dataset,

hence Tidlist intersection for support counting of pass two (generation of F2) is more

time consuming in this case. The MUSHROOM dataset results are not shown in the

thesis because they are not acceptable for any number of partitions.

The Table 3.6 shows the total number of records, total number of distinct transactions and

distinct items contained in the partitions for T10I4D100K dataset for 4 partitions.

The Figure 4.5 shows the performance of partition algorithm for T10I4D100K dataset for

four partitions. It is obvious form the figure that the time taken by the algorithm is very

high and is not acceptable. The very bad performance of T10I4D100K dataset is due to

the fact that candidate 2-itemsets generated are very large and their support counting

takes too much time.

The partition algorithm does not escalate well in the case of relational database systems.

The algorithm is mainly a main memory based algorithm and works fine in the case of

main memory databases.

0

10000

20000

30000

40000

50000

60000

70000

0.30% 0.45% 0.60%

Time in

seconds

Minimum Support Figure 4.5: Performance of partition for T10I4D100K dataset (4 partitions)

4.2 Partition algorithm with second optimization (SPO)

In this section we will discuss the combination of partition algorithm and second pass

optimization of K-Way join approach.

4.2.1 Second pass optimization of K-Way Join approach for support counting [2, 13] As we have seen earlier, the size of C2 is very large and hence second pass support

counting is most time consuming compared to other passes. The process of generating C2

from F1 and then counting support for C2 and generating F2 can be replaced by

generating F2 directly by reading the database once.

This is shown below:

Insert into F2 select T1.item, T2.item, count (*)

From I_Table T1, I_Table T2

Where T1.tid = T2.tid and T1.item < T2.item

Group by T1.item, T2.item

Having count (*) > minsup;

4.2.2 The Approach

Partition algorithm is implemented with the combination of second pass optimization of

K-Way join approach. In this approach also, the database is scanned two times: first when

the Tidlist is generated for each partition and second when the F2 is generated by second

pass optimization in each partition. Rest of the process for the partition remains the same.

All the local frequent itemsets are combined to generate the global candidate itemsets. In

second phase database is not scanned for final support counting, but the count generated

in the first phase along with the tidlists are used instead. The approach gives better results

than the partition algorithm because in the second pass support counting (F2 generation)

for each partition the Tidlists are not intersected, which is the most time consuming step

in the entire support counting process.

The global frequent 3-itemsets generation SQL script for two partitions is shown below:

Insert into Global_F3

(

Select item1, item2, item3, sum (count) from

(

Select item1, item2, item3, count

From tidt_c3

Union all

Select item1, item2, item3, count

From tidt_cc3

)

Group by item1, item2, item3

Having sum (count)>=179

);

4.3 Performance Comparisons The Figure 4.6 shows the performance comparisons between partition and partition

algorithm with the second pass optimization for BMSWEBVIEW1 for 2 partitions. The

details about the partitions are described in the Table 3.2.

0500

1000150020002500300035004000

0.15% 0.30% 0.45% 0.60%

PartitionPartition with SPO

Time

in seconds

Minimum Support

Figure 4.6: Performance comparisons of partition and partition with SPO algorithm for BMSWEBVIEW1 dataset (2 partitions)

The Figure 4.7 shows the performance comparisons between partition and partition


details about the partitions are described in the Table 3.3.

0500

10001500200025003000350040004500

0.15% 0.30% 0.45% 0.60%

Partition

Partition withSPO

Time

in seconds

Minimum Support


The Figure 4.8 shows the performance comparisons between partition and partition


details about the partitions are described in the Table 3.4

050

1015202530

0. 0. 0 0

00000000000

35004000

15% 30% .45% .60%

PTime

in seconds

artitionPartition with SPO

Minimum Support


The Table 4.2 shows improvement in the performance for the partition algorithm with

SPO as compared to partition algorithm for BMSWEBVIEW1 dataset. In general for all

the partitions the performance increases tremendously as we move form 0.15% to 0.60%.

2 Partitions 3 Partitions 4 Partitions

0.15% Support Approx. 5 times Approx. 2.8 times Approx. 3 times

0.30% Support Approx. 7 times Approx. 6.5 times Approx. 6.5 times

0.45% Support Approx. 9.5 times Approx. 8 times Approx. 7.5 times

0.60% Support Approx. 9 times Approx. 9.4 times Approx. 8 times

Table 4.2: Performance comparison

0

10000

20000

30000

40000

50000

60000

70000

0.30% 0.45% 0.60%

PartitionPartition with SPO


T10I4D100K dataset (4 partitions)

Figure 4.9 shows the performance comparison for T10I4D100K dataset. The Table 3.6

shows the total number of records, total number of distinct transactions and distinct items

contained in the partitions for T10I4D100K dataset for 4 partitions.

T10I4D100K dataset has shown very bad performance for partition algorithm (Figure

4.5). With second pass optimization, algorithm performs approximately 17 times, 30

times and 45 times better than as compared to partition algorithm in the case of 0.30%,

0.45% and 0.60% support values respectively.

Chapter 5 Sampling Algorithm for Association Rule Mining

This chapter presents performance analysis of the Sampling algorithm [10] for

association rule mining.

Sampling algorithm is implemented in the context partition algorithm. The algorithm first

partitions the database into a number of partitions. And then takes one partition as a

sample. It then finds out all the local frequent itemsets in the sample for the reduced

minimum support for that sample. Then these local frequent itemsets along with their

negative borders are tested against the entire dataset for the minimum support in the

entire dataset. Itemsets qualifying the minimum support are frequent itemsets in the entire

database. If negative border of the local frequent itemsets contain frequent itemsets in the

entire database, then only the algorithm scans the database second time to find out

missing frequent itemsets in the database.

The experiments are done on Oracle 10g relational database management system

installed on Microsoft Windows XP with 1 GB of RAM and 2.40 GHz processor speed.

Each experiment is performed various times and the best of them is taken.

5.1 The Negative Border

For any pass k negative border [10] is the set containing itemsets those are not frequent in

that pass, i.e. for any pass k negative border NBd (FK ) is equal to the Ck - Fk where Ck

and Fk are the set of candidate k-itemsets and set of frequent k-itemsets respectively. The

tables below shows the candidate itemsets, frequent itemsets and negative borders for

second pass.

Table 5.1, Table 5.2 and Table 5.3 show examples of candidate 2-itemsets C2, negative

border of frequent 2-itemsets NBd (F2) and frequent 2-itemsets F2 respectively.

ITEM1 ITEM2 1 2 1 3 1 5 2 3 3 5 3 4 2 4

ITEM1 ITEM2

1 5

3 5

3 4

2 4

Table 5.1: Candidate itemset C2

Table 5.2: Negative Border NBd(F2)

ITEM1 ITEM2 SUPPORT

1 2 4 1 3 4 2 3 4

Table 5.3: Frequent itemset F2

5.2 Performance analysis of the Sampling algorithm The tables below show the different partition sizes taken as the sample for the analysis of

sampling algorithm for the BMSWEBVIEW1 and T10I4D100K datasets respectively. It

also shows the number of distinct transactions and distinct items contained in the sample.

Last column of the tables show the percentage of the sample to the dataset.

Sample size (in rows) Distinct Transactions Distinct items Percentage of sample2484 954 388 1.70 5063 1885 384 3.38 9133 3775 415 6.10 19246 7530 451 12.86 37182 14979 466 24.84 74634 29718 486 49.88

Table 5.4: Description of different samples for BMSWEBVIEW1 dataset

Sample size (in rows) Distinct Transactions Distinct items Percentage of sample

7881 780 784 7.88

16703 1637 824 16.70

Table 5.5: Description of different samples for T100I4D100K dataset

Figure 5.1 shows the performance of sampling algorithm for the dataset

BMSWEBVIEW1 for sample size 2484 records. For support values 0.15% and 0.30%

algorithm requires second scan over the database because the negative border of the local

frequent itemsets contains frequent itemsets in the entire database. For support values

0.45%, 0.60% and 0.75% algorithm completes in one scan. As the support value

increases time required to find out frequent itemsets decreases. It is obvious from the

figure 5.1 that the time taken by the algorithm for second scan is more than the time taken

in the local support counting plus first scan for support values of 0.15% and 0.30%. If

candidate itemsets for second scan are large then second scan can take more time as

compared to first scan. Secondly, because TIDLIST data structure is used for support

counting, the intersection of tidlists is very time consuming.

0

1000

2000

3000

4000

5000

6000

0.15% 0.30% 0.45%

Time in

seconds

Second scanSample+First scan

0.60% 0.75%

Minimum Support

Figure 5.1: Performance of sampling algorithm for sample size 2484 for BMSEBVIEW1 dataset

Sample size C2 C3 C4 C5

2484 45451 8310 13062 12884

5063 50086 31846 179042 178030

9133 41905 2687 739 177

19246 45150 3504 420 30

37182 43956 1974 126 5

74634 45451 2807 281 17

Table 5.6: Candidate itemsets in different samples for BMSWEBVIEW1 dataset

Figures 5.2 show the time taken by the algorithm for sample size of 5063 records. Except

0.15% and 0.30% support values algorithm completes in one scan. Table 5.6 shows the

candidate itemsets generated for the different samples for BMSWEBVIEW1 dataset for

the support value of 0.15%. For the sample of 5063 records candidate itemsets for pass 4

and pass 5 are very large. Hence the time taken in local support counting is also high.

0

1000

2000

3000

4000

5000

6000

0.15% 0.30% 0.45% 0.60% 0.75%


Time

in seconds

Minimum Support

Figure 5.2: Performance of Sampling algorithm for sample size 5063 for BMSEBVIEW1 dataset

Figures 5.3 show the time taken by the algorithm for sample size of 9133 records for

BMSWEBVIEW1 dataset. Algorithm completes in just one scan for 0.45% and 0.60%

support values.

010002000300040005000600070008000

0.15% 0.30% 0.45% 0.60%


Time

in seconds

Minimum Support


0

500

1000

1500

2000

2500

3000

0.15% 0.30% 0.45% 0.60%


Time in

seconds

Minimum Support


Figure 5.4 to Figure 5.6 shows the performance of sampling algorithm for larger sample

sizes. The performance of sampling algorithm mainly depends on the sample chosen for

finding frequent itemsets. If the sample is small enough and contains all the global

frequent itemsets then the algorithm completes in just one scan and in less time. If the

bad sample is chosen the performance of the algorithm can be even worse. It is obvious

from the figures that for higher support values the performance of sampling algorithm for

small sample sizes is better as compared to the large sample sizes. [8]

Figure 5.4 to Figure 5.6 show that the time taken by the algorithm for different support

values increases as the sample size increase.

0

500

1000

1500

2000

2500

3000

3500

0.15% 0.30% 0.45% 0.60%


Time in

seconds

Minimum Support


0500

10001500200025003000350040004500

0.15% 0.30% 0.45% 0.60%

Second scan

Sample+Firstscan

Time in

seconds

Minimum Support


The performance for the dataset T104D100K for the sample size 7881 and 16703 records

are shown in Figure 5.7 and Figure 5.8 respectively. The second scan is very costly in the

case of T10I4D100K dataset. The performance is simply not acceptable. The reason for

this is the TIDLIST data structure. For support counting Tidlists intersected which is very

time consuming.

Figure 5.7: Performance of Sampling algorithm for sample size 7881 for T10I4D100K dataset


dataset

02000400060008000

1000012000140001600018000

0.45% 0.60% 1.00%

Time in

seconds Second scanSample+First scan

Minimum Support

0

20

40

60

80

10

12

14

0 0 1

00

00

00

00

000

000

000

.45% .60% .00%

STime in

seconds

econd scanSample+First scan

Minimum Support

5.2.2 Errors in Frequent itemsets for BMSWEBVIEW1 dataset Sampling algorithm shows some errors in the frequent itemsets generated, because the

accuracy of the result depends on the sample chosen. No errors are reported for 0.45%,

0.60% and 0.75% support values for all sample sizes. For 0.30% support value only

samples of size 2484 and 5063 records show errors in the frequent itemsets generated.

For 0.15% support value only sample of size 74634 records shows no errors in the

frequent itemsets generated. Table 5.7 and Table 5.8 show the frequent itemsets

generated by the sampling algorithm for BMSWEBVIEW1 dataset for different sample

sizes for 0.15% and 0.30% support values respectively. Table 5.9 shows the percentage

error for different sample sizes for 0.15% and 0.30% support values.

Sample size F1 F2 F3 F4 F5

149639 303 715 336 70 4 2484 295 711 336 70 0

5063 301 714 336 70 0

9133 303 714 336 70 4

19246 303 715 336 70 2

37182 303 715 336 70 3

74634 303 715 336 70 4

Table 5.7: Frequent itemsets generated for BMSWEBVIEW1 dataset for 0.15% support

Sample size F1 F2 F3 F4 F5

149639 225 169 39 2 0

2484 224 169 39 2 0 5063 225 169 37 2 0 9133 225 169 39 2 0 19246 225 169 39 2 0 37182 225 169 39 2 0 74634 225 169 39 2 0

Table 5.8: Frequent itemsets generated for BMSWEBVIEW1 dataset for 0.30% support

Sample size % error for 0.15% support % error for 0.30% support

2484 1.12 0.23

5063 0.49 0.46

9133 0.07 No error

19246 0.14 No error

37182 0.07 No error

Table 5.9: Percentage error for BMSWEBVIEW1 dataset

5.2.3 Errors in Frequent itemsets for T10I4D100K dataset Table 5.10 and 5.11 depicts the frequent itemsets generated by sampling algorithm for

different passes for support value of 0.45% and 0.60% respectively. No errors are

reported for 1.0% support value for both of the sample sizes. Table 5.12 depicts the

percentage error in the frequent itemsets generated for 0.45% and 0.60% support values

Sample size F1 F2 F3 F4 F5 100000 596 522 174 49 11 7881 593 522 172 49 11 16703 596 522 171 48 11

Table 5.10: Frequent itemsets generated for T10I4D100K dataset for 0.45% support

Sample size F1 F2 F3 F4 F5 100000 516 191 48 14 2 7881 516 191 43 12 1 16703 516 191 44 11 1

Table 5.11: Frequent itemsets generated for T10I4D100K dataset for 0.60% support

Sample size % error for 0.45% support % error for 0.60% support 7881 0.37 1.04 16703 0.30 1.04

Table 5.12: Percentage error for T10I4D100K dataset

Chapter 6 Conclusion and Future Work

6.1 Conclusion

In this thesis we have discussed three approaches (apriori, partitioned and sampling) for

association rule mining in the context of database partitioning. We have discussed about

the frequent itemsets generation only, we haven’t discussed about the rule generation part

of the association analysis. Rule generation is very simple as compared to frequent

itemsets mining and it requires very less time. Extensive experiments have been

performed to test the performance of these approaches over two real and one

synthetically generated datasets.

In the case of apriori algorithm K-Way join method is used for support counting. Apriori

gives good results for BMSWEBVIEW1 dataset. For T10I4D100K dataset results are

satisfactory. In the case of MUSHROOM dataset, apriori was not able to generate all the

frequent itemsets even after running for long time. For the support less than 2.0% it

generates up to frequent 3-itemsets. For 2.0% support it generates up to frequent 4-

itemsets. Performance of the algorithm not only depends on the size of the dataset, but it

also depends on the average number of items per transactions. As the average number of

items per transactions increases, frequent itemsets generated for each pass also increases.

Hence support counting requires more time for that.

The partitioning algorithm uses a data structure called TIDLIST for support counting.

TIDLIST is suitable for mining In-memory databases, but it is not suitable for RDBMS

based mining. Since second pass is most costly in terms of the support counting, we can

optimize it to gain good performance. Partition algorithm perform much better if K-Way

join second pass optimization is used in conjunction with TIDLIST for support counting

The sampling approach shows some minute errors on the frequent itemsets generated.

The error depends on the quality of the sample chosen for analysis. If the sample contains

all the items in the dataset within it, the errors in the frequent itemsets are very less. The

algorithm shows less than 1.12% error for BMSWEBVIEW1 dataset for all the samples

considered. For T10I4D100K dataset error is less than 1.04% for all the samples chosen.

6.2 Future Work Some of the future enhancements of the thesis are presented below:

• The work presented in the thesis can be extended for multi-level association rule

mining.

• The work can be enhanced to generate multi-dimensional association rules.

• A tool for generating association rules can be developed. This tool can choose the

approach for frequent itemsets mining according to the properties of the dataset to

be mined.

References

[1] J n and

Kaufmann Publishers.

[2] P. Mishra and S. Chakravarthy. Performance Evaluation and Analysis of SQL Based

Approaches for Association Rule Mining. In BNCOD Proc. 2003.

[3] S. Thom

Algorithm h Database Systems, in CSE. 1998, University of Florida:

Gainesville.

[4] R. Agrawal, T. I

Items in Large Databas

Management of Data. 1993. Washington, D.C.

[5] R grawal

In 20th In

[6] J. Han, J. Pei and Y. Yin. Mining Frequent Patterns without Candidate Generation.

In ACM SIGMOD International Conference on Management of Data. 2000. Dallas.

[7] henoy

SIGMOD International Conference on Management of Data. 2000. Dallas.

[8] A. Sarasere, E. Omiecinsky, and S. Navathe. An Efficient Algorithm for Mining

Association Rules in Large Databases. In 21st International Conference on Very

Large Databases (VLDB). 1995. Zurich, Switzerland.

. Ha M. Kamber, Data Mining: Concepts and Techniques. 2001: Morgan

as. Architectures and Optimizations for Integrating Data Mining

s wit

mielinski and A. Swami. Mining Association Rules between Sets of

es. In ACM SIGMOD International Conference on the

. A and R. Srikant. Fast Algorithms for Mining Association Rules.

ternational Conference on Very Large Databases (VLDB). 1994.

P. S et al. Turbo-Charging Vertical Mining of Large Databases. In ACM

[9] Thoma

Association Rules in Large Databases. In Knowledge Discovery and Data Mining.

1997.

[10] H. Toivonen. Sampling Large Databases for Association Rules. In Proceedings of

22nd In

[11] J Han et al. DMQL: A Data Mining Query Language for Relational Database. In

ACM SIGMOD workshop on research issu

discovery. 1996. Montreal.

[12] Sarawa

with Relational Database System: Alternatives and Implications. In ACM

SIGMOD I

Washington.

[13] . Kona

Partitione

[14] H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient Algorithms for

coverin

Database

[15] K Loney, Oracle Database 10g: The Complete Reference. Osborne ORACLE Press

Series

[16 M. J. Za

Data Mi port TR 617, University of

Rochester, Computer Science Department, 1996.

S. s et al. An Efficient Algorithm for the Incremental Updation of

the ternational Conference on Very Large Databases (VLDB), 1996.

es on data mining and knowledge

S. gi, S. Thomas and R. Agrawal. Integrating Association Rule Mining

nternational Conference on Management of Data. 1998. Seattle,

H.V and S Chakravarthy. Association Rule Mining over Multiple Databases:

d and Incremental approaches, 2003

Dis g Association Rules. In AAAI Workshop on Knowledge Discovery in

s (KDD-94), 1994.

] ki, S. Parthasarathy, W. Li, and M. Ogihara. Evaluation of Sampling for

ning of Association Rules. Technical Re

[17 J.L. Lin a

In 14th In

[18] M. Dudgikar. A Layered Optimizer or Mining Association Rules over RDBMS. In

CSE Department. 2000, University of Florida: Gainesville.

[19] Oracle Database Application Developer's Guide - Fundamentals 10g Release 2

http://down

uk.oracle.com/docs/cd/B19306_01/appdev.102/b14251/adfns_dynamic_sql.htm

[20] Oracle Database PL/SQL User'

http://download-west.oracle.com/docs/cd/B19306_01/appdev.102/b14261.pdf

[21] Frequent It ory: http://fimi.cs.helsinki.fi/data/

[22] Kohavi,

organizer nion. SIGKDD Explorations, 2(2):86-98, 2000.

http://www.ecn.purdue.edu/KDDCUP.

] nd M. H. Dunham. Mining Association Rules: Anti-skew algorithms.

ternational Conference on Data Engineering, February 1998.

load-

s Guide and Reference 10g Release 2

emset Mining Dataset Reposit

R. C. Brodley, B. Frasca, L. Mason, and Z. Zheng. KDD-Cup 2000

s' report: Peeling the o

Documents

Association Rule Mining In Partitioned Databases ... grade/Pankaj Kandpal... · Association Rule Mining in Partitioned Databases: Performance Evaluation and Analysis Pankaj Kandpal,