finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 1/36

GuideGuide

Mr. D. S. SHARMA M.Tech, (Ph.D.,)

Associate Professor

Dept of CSIT

Sri sivani College Of Engg.

Sri sivani college of

engineering

K.KISHORE 07W61A0522

CH. SAINATH 07W61A0510

Y. RAMESH 07W61A0542

K. NAVYA 07W61A0525

B. MANASA 07W61A0529



g g

Traditionally, business analysts performed the task of

extracting useful information from recorded data, but theincreasing volume of data in modern business and science callsfor computer-based approaches.

In data mining, association rule learning is a popular and well

researched method for discovering interesting relations betweenvariables in large databases.

A typical and widely-used example of association rule mining isMarket Basket Analysis.

This project is established using the Boolean matrix Algorithm,a technique to implement the association rules under bi-directional search. Our main objective is to makes a comparative analysis between

the Apriori and the Boolean matrix algorithms, so as to prove



Hardware Requirements

•Processor: Intel Pentium 4 or equivalent or above.(Greater clock speeds and FSB speeds with single-core mainstream

compatibility) •RAM: Minimum of 256 MB or higher.

(Since oracle 8i alone requires a minimum ram of 256MB).

•Video Cards: Video Card Average: 32 bit, Recommended: 64 bit

Makes output screens (graphs, bar charts, pie charts) appear sharper andbetter

•HDD: 20 GB or higher (apt)(The total volume of s/w installed along with database take min 20 GB of

HD)

•Monitor: 15’’ or 17’’ color monitor. Screen resolution: 1024 by 768 pixelsColor quality: 32 bit & Color scheme: Windows Standard(It’s the standard configuration of most PC’s with optimized viewable

space)

•Mouse: Scroll mouse or optical mouse.



ftware Requirements

perating system: Windows XP (almost all versions) or Unix ory operating system that can support java)

ont-end: Java (displayed in form of applets and frames)most all versions of java are compatible to run the code)

ck-end: Oracle 10g. (This is to design the database)





Apriori AlgorithmApriori Algorithm

Apriori uses breadth-first search and a hash tree

structure to count candidate item sets efficiently.It generates candidate item sets of length k from itemsets of length k − 1.

Then it prunes the candidates which have an infrequentsub pattern.According to the downward closure lemma, the

candidate set contains all frequent k -length item sets.

After that, it scans the transaction database todetermine frequent item sets among the candidates.

For determining frequent items quickly, the algorithm

uses a hash tree to store candidate item sets.



Apriori Itemset GenerationApriori Itemset Generation

Pass 1

Generate the candidate itemsets in C 1

Save the frequent itemsets in L1

Pass k

Generate the candidate itemsets in Ck from the frequent

itemsets in Lk -1

Join Lk -1 p with Lk -1q, as follows:insert into Ck

select p.item1, q.item1, . . . , p.itemk -1, q.itemk -1

from Lk -1 p, Lk -1q

where p.item1 = q.item1, . . . p.itemk -2 = q.itemk -2, p.itemk -1 < q.itemk -1

Generate all (k -1)-subsets from the candidate itemsets in Ck

Prune all candidate itemsets from Ck where some (k -1)-subset of the candidate

itemset is not in the frequent itemset Lk -1 Scan the transaction database to

determine the support for each candidate itemset in Ck Save the frequent itemsets

in Lk



Example Assume the user-specified minimum support is 40%, then generate all

frequent itemsets. Given: The transaction database shown below

TID A B C D E F G

T 1

1 0 1 0 0 1 0

T 2

0 0 1 1 0 0 1

T 3

1 1 0 0 1 0 0

T 4

0 0 1 0 1 0 1

T 5

1 0 1 0 1 1 0

T 6

0 0 0 1 1 0 0

T 7

0 0 1 0 1 1 1

T 8 1 0 0 0 0 1 0

T 9

0 1 1 1 0 0 0



C1

Itemset X supp(X)

A 4

B 2

C 6

D 3

F 5

G 4

H 3

L1 after pruning

Pass 1

A

B

C

D

E

FG

H



Pass 2

C2

Item set X supp( X)

A,B 1

A,C 2

A,D 0A,E 2

A,F 3

A,G 0

B,C 1

B,D 1

B,E 1

B,F 0

B,G 0

Item setX

supp(X)

C,D 2

C,E 3

C,F 3

D,E 1

D,F 1

D,G 1

E,F 2

E,G 2

F,G 1

L2 after

pruning

A,C

A,E

A,F

C,D

C,F

E,F

E,G



Pass 3

Itemset X supp(X)

oin AC with AD A,C,D 0

oin AC with AE A,C,E 1

oin AC with AF A,C,F 2oin AC with AG A,C,G 0

oin AE with AF A,E,F 1

oin AE with AG A,E,G 0

oin CEwith CF C,E,F 2

oin CEwith CG C,E,G 2

L3 after pruning

C3

A,C,F

C,E,F

C,E,G



Pass 4

C4

Item set x Support(X)

C,E,F,G 1

Pass 5

For pass 5 we can't form any candidates because there aren'ttwo frequent 4-itemsets beginning with the same 3 items.



Disadvantages:

any one of the frequent set items becomes longer the alg

o go through much iteration and as a result the performanases in terms of response time.

g the execution, every frequent item set is explicitly consi

f scans is required .



In our proposed system, we implement associationrules using Boolean matrix algorithm.

•It operates in a top down process.

•It scans the database only once.

•So the search time—response time becomes quicker.

•It finds the maximum frequent itemsets which meet

the minimum support in short time through theoperation of vector and matrix.

•Works better than most of the currently existingalgorithms such as apriori.



Boolean matrixBoolean matrix• Boolean matrix algorithm need to scan the database only once

and as the structure of the Boolean matrix is simple, it can beunderstood easily, and it is easy to compute withoutgenerating plenty of candidate item sets.

• For the databases are translate into files of matrix and the

files are very small, it reduce plenty of time spending onscanning the database .So the algorithm is efficient.

• This is better and faster than the Apriori approach .

• Thus the technique evaluates the required data much quickerthan most of the algorithms currently existing .



For the same example,

Consider the original matrix and transpose of that matrix

1 0 1 0 0 1 0

0 0 1 1 0 0 1 1 0 1 0 1 0 0 1 01 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1

0 0 1 0 1 0 1 1 1 0 1 1 0 1 0 1

R = 1 0 1 0 1 1 0 T = 0 1 0 0 0 1 0 0 1

0 0 0 1 1 0 0 0 0 1 1 1 1 1 0 0

0 0 1 0 1 1 1 1 0 0 0 1 0 1 1 0

1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0

0 1 1 1 0 0 0

Now multiply both these matrices,we get3 1 1 1 3 0 2 2 1

1 3 0 2 1 1 2 0 2

1 0 3 1 2 1 1 1 1

1 2 1 3 2 1 3 0 1

S = 3 1 1 2 3 1 3 2 1

0 1 1 1 1 2 1 0 1

2 2 1 3 3 1 4 1 1

2 0 1 0 2 0 1 2 01 2 1 1 1 1 1 0 2



Consider the upper triangular matrix

3 1 1 1 3 0 2 2 1

3 0 2 1 1 2 0 2

3 1 2 1 1 1 1

3 2 1 3 0 1Su = 3 1 3 2 1

2 1 0 1

4 1 12 0

2



Consider the maximum element from the longest diagonaland find in which rows the element will be.

The maximum element is 4

So find in which rows 4 existst77 = 4B7(7,1)

It is less than min sup value so thereare no 4-frequent itemsets.

Next maximum element is 3t11=3 , t22=3, t33=3, t44=3, t55=3B1(1,5,2); B4(4,7,2); B5(5,7,2)First consider B1(1,5,2) Find Logical AND operation of a1 and a5 = 1 0 1 0 0 1 0

This indicates B1 = {A,C,F}

Consider the uppertriangular matrix

3 1 1 1 3 02 2 1

3 0 2 1 1

2 0 2 3 1 2 11 1 1

3 2 13 0 1S = 3 1 3

2 12

1 0 1

4 1 1

2 0



Similarly, B4 and B5 values will be known. by performinglogical for a4 and a7 and for B5 for a5 and a7 respectively

So 3-frequent itemsets are { (A,C,F) , (C,E,G) , (C,F,G) }

Similarly, 2-frequent itemsets will be generated.

{ (A,C), (A,E), (A,F), (C,D), (C,E), (C,F), (C,G), (E,F), (E,G) }

Therefore, both the algorithms gives the same output.

2-frequent itemsets are { (A,C), (A,E), (A,F), (C,D), (C,E),(C,F), (C,G), (E,F), (E,G) }

3-frequent itemsets are { (A,C,F) , (C,E,G) , (C,F,G) }



•Helps in improving the marketing tactics.

•Used for collaborative filtering in setting business trends.

•Comprehensive analysis of the Customer choices.

•Comprehensive analysis of the Product Demand inflations.

•Helps to minimize losses and maximize profits.

advantages:



Performance Analysis:

Assumption1.Apriori’s Best Case == Boolean’s Best Case

2.Apriori’s Worst Case == Boolean’s Average Case√Axiom

Let i =1 , and k is the total number of candidate item sets

1. Best Case Most frequent item set

evaluated in pass = i.

2. Average Case Most frequent item set evaluate

in pass > i && < k.

3. Worst Case

Most frequent item set



Proof by Example:

Best Case

T1={bread} T2={bread} T3={bread} T4={bread} Let min supp=50%

Apriori

Pass 1:

C1 :bread - 100% (not pruned)

MFS = { {bread}} stop!

R = 1 T = 1 1 1 1

1

1

1

U = 1 1 1 1 C = 1 1 1 1

1 1 1 1 1 1 1

1 1 1 1 1 1

1 1 1 1 1

Boolean

Pass 1:

C1 :

bread - 100% (not pruned)

MFS = { {bread}}

MFCS={{bread}-100%} not

pruned stop!



Average CaseT1={bread, jam, sugar, cheese} T2={jam, sugar, cheese} T3={sugar, cheese}

T4={cheese} Let min supp=50%

Apriori

Pass 1:

C1 :

bread - 25% ( pruned) jam - 50% (not pruned) sugar - 75% (not pruned)

cheese - 100% (not pruned)

MFS = { {jam}, {sugar}, {cheese}}

Pass 2 :

C2:

{jam, sugar } - 50% (not pruned) {jam, cheese } - 50% (not pruned)

{sugar, cheese } - 75% (not pruned)

MFS = { {jam, sugar},{jam, cheese},{sugar, cheese}}Pass 3;

C3:

{jam, sugar ,cheese} - 50% (not pruned) MFS = { {jam, sugar, cheese}}



Boolean1 1 1 1 1 0 0 0 4 3 2 1 4 3 2 1

R = 0 1 1 1 T = 1 1 0 0 C = 3 3 2 1 U = 3 2 1

0 0 1 1 1 1 1 0 2 2 2 1 2 10 0 0 1 1 1 1 1 1 1 1 1 1

Pass 1:

Max element = 4

4 doesn’t exist in any row so there are are no 4-frequent itemsets.

Next element = 3U(1,1) = 4; U(1,2) = 3

B(1,2,2) = 3 = 0 1 1 1

So, the 3-frequent itemsets are {{jam,sugar,cheese}}

Pass 2 :

Next max ele = 2B(1,2,3,3) = {{sugar,cheese],{jam,sugar},{jam,cheese}}

B(2,3,2) = {{sugar,cheese}}

So, the 2-frequent itemsets are {{sugar,cheese],{jam,sugar},{jam,cheese}}

Thus, algorithm terminates in two passes



Worst CaseT1={bread, jam, sugar, cheese} T2={bread, jam, sugar, cheese} Let min supp=50%

Apriori

Pass 1:C1 :

bread - 100% jam -100% sugar -100% cheese - 100% (nothing pruned)

MFS = { {jam}, {sugar}, {cheese},{bread}}

Pass 2:

C1 : {bread, jam}-100%, {bread, sugar}-100%, {bread, cheese}-100%, {sugar, jam}-10

{cheese, jam}-100%, {sugar, cheese}-100%

(nothing pruned) MFS = {{bread, jam}, {bread, sugar},{bread, cheese}, {sugar, jam},

{cheese, jam}, {sugar, cheese} }

Pass 3:

C1 :

{bread,jam,cheese} 100% , {bread,jam,sugar} 100% ,{jam,sugar,cheese} 100%(nothing pruned)

MFS = { {bread,jam,cheese} , {bread,jam,sugar} ,{jam,sugar,cheese} }

Pass 4:

C1 :

(bread,jam,sugar,cheese}-100% not prunedMFS = { {jam, sugar, cheese, bread}}



Boolean1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4

R = 1 1 1 1 T = 1 1 1 1 C = 4 4 4 4 U = 4 4 4

1 1 1 1 1 1 1 1 4 4 4 4 4 4

1 1 1 1 1 1 1 1 4 4 4 4 4

Pass 1:

Max ele = 4

So, 4-frequent itemsets are { {jam, sugar, cheese, bread}}

Hence 3-frequent itemsets are {{bread,jam,cheese},{bread,jam,sugar} ,

{jam,sugar,cheese} }

2-frequent itemsets are {{bread,jam}, {bread,sugar},{bread,cheese}, {sugar,jam},

{cheese,jam}, {sugar,cheese} }



Apriori Time Complexity

•In best case one pass so O(1)•In worst case all items are considered so O(n)•In Average case only about half the number of items are taken O(n-k)

Boolean matrix Time Complexity

In best case s O(1)

Average Case O(n)

Worst Case O(n)



taflow diagram



Sequence diagram for Apriori Sequence diagram for Boolean

UML Diagrams



Sequence Diagram for pruning the Item sets



Testing:Testing is the process of executing program

with the intent of finding an error.

Us

a bility

Te

st:

Test Case Id Input FormatInput data

Expected Obtained

UT_1.1 Time String‘morning’ or ‘afternoon’ or ‘evening’ or ‘night’

morningNo Error No Error

UT_1.2 Hrs:mins:secs11:35:65

No Error Error

UT_1.3 Hrs:mins03:75

No Error Error

UT_1.4 Hrs32

No Error Error

UT_1.5 Random character stringH2

Error Error

UT_1.6 Integer Number 12.9

Error Error

UT_1.7 Floating Point15

Error Error

UT_1.8 Alphanumeric @w Error Error

UT_1.9 Special Characters 9,0 Error Error

UT_1.10 Empty String Error Error

Test case Id Input Module Next Module(Expected)

Next Module(Obtained)



PT_1 Receive user inputs Retrieve raw candidate transaction table Raw candidate transaction table retrieved

PT_2. Retrieve raw candidate t ransact ion table Construct binary candidate t ransact ion table Binary candidate t ransaction table constructed

PT_3. Binary candidate transaction table constructed Algorithm begins Algorithm began

PT_4 Algorithm began Subset Selection Subset Selected

PT_5 Subset Selected Support Evaluation Support evaluated

PT_6 Support evaluated Prune Pruned

PT_7 Pruned Most frequent item set evaluation Most frequent item set evaluated

PT_8 Most frequent item sets evaluated Frame Association rules Association rules framed

PT_9 Association rules framed Pictorial representation using paint method Pictorially representation began

PT_10 Pictorially representation began Draw Bar chart Bar chart drawn

PT_11 Bar Chart drawn Draw Pie Chart Pie Chart Drawn

PT_12 Pie Chart Drawn Draw Support for single pass graph Support for single pass graph drawn

PT_13 Support for single pass graph drawn Draw Time Complexity Graph Time Complexity Graph Drawn

Pa

th

testi

n g



Conclusion:Discovering frequent itemsets is a key problem in

important data mining applications, such as discoveryof association rules..

BMA efficiently overcomes the difficulty better than allthe algorithms currently existing and could better thanApriori.

We evaluate the performance of the algorithm usingwell-known synthetic benchmark databases, real-lifecensus and stock market databases.



Further Developments:

Parallelizing the Boolean matrix algorithm

this is a way to minimize the duplicate

calculations and to maximize the use of availableprocessors.





Documents

finalmain prjct ppt 24-3-11