36
Guide Guide Mr. D. S. SHARMA M.Tech, (Ph.D.,) Associate Professor Dept of CSIT Sri sivani College Of Engg. Sri sivani college of engineering  K.KISHORE 07W61A0522 CH. SAINATH 07W61A0510 Y. RAMESH 07W61A0542 K. NAVYA 07W61A0525 B. MANASA 07W61A0529  

finalmain prjct ppt 24-3-11

Embed Size (px)

Citation preview

Page 1: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 1/36

GuideGuide

Mr. D. S. SHARMA M.Tech, (Ph.D.,)

Associate Professor

Dept of CSIT

Sri sivani College Of Engg.

Sri sivani college of 

engineering

  K.KISHORE 07W61A0522

CH. SAINATH 07W61A0510

Y. RAMESH 07W61A0542

K. NAVYA 07W61A0525

B. MANASA 07W61A0529

 

Page 2: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 2/36

g g

Traditionally, business analysts performed the task of 

extracting useful information from recorded data, but theincreasing volume of data in modern business and science callsfor computer-based approaches.

In data mining, association rule learning is a popular and well

researched method for discovering interesting relations betweenvariables in large databases.

A typical and widely-used example of association rule mining isMarket Basket Analysis.

This project is established using the Boolean matrix Algorithm,a technique to implement the association rules under bi-directional search. Our main objective is to makes a comparative analysis between

the Apriori and the Boolean matrix algorithms, so as to prove

Page 3: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 3/36

Hardware Requirements

•Processor: Intel Pentium 4 or equivalent or above.(Greater clock speeds and FSB speeds with single-core mainstream

compatibility) •RAM: Minimum of 256 MB or higher.

(Since oracle 8i alone requires a minimum ram of 256MB).

•Video Cards: Video Card Average: 32 bit, Recommended: 64 bit

Makes output screens (graphs, bar charts, pie charts) appear sharper andbetter

•HDD: 20 GB or higher (apt)(The total volume of s/w installed along with database take min 20 GB of 

HD)

•Monitor: 15’’ or 17’’ color monitor. Screen resolution: 1024 by 768 pixelsColor quality: 32 bit & Color scheme: Windows Standard(It’s the standard configuration of most PC’s with optimized viewable

space)

•Mouse: Scroll mouse or optical mouse.

Page 4: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 4/36

ftware Requirements

perating system: Windows XP (almost all versions) or Unix ory operating system that can support java)

ont-end: Java (displayed in form of applets and frames)most all versions of java are compatible to run the code)

ck-end: Oracle 10g. (This is to design the database)

Page 5: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 5/36

Page 6: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 6/36

Apriori AlgorithmApriori Algorithm

Apriori uses breadth-first search and a hash tree

structure to count candidate item sets efficiently.It generates candidate item sets of length k  from itemsets of length k − 1. 

Then it prunes the candidates which have an infrequentsub pattern.According to the downward closure lemma, the

candidate set contains all frequent k -length item sets.

After that, it scans the transaction database todetermine frequent item sets among the candidates.

For determining frequent items quickly, the algorithm

uses a hash tree to store candidate item sets.

Page 7: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 7/36

Apriori Itemset GenerationApriori Itemset Generation 

Pass 1

Generate the candidate itemsets in C 1

Save the frequent itemsets in L1

Pass k  

Generate the candidate itemsets in Ck from the frequent

itemsets in Lk -1

Join Lk -1 p with Lk -1q, as follows:insert into Ck  

select p.item1, q.item1, . . . , p.itemk -1, q.itemk -1

from Lk -1 p, Lk -1q

where p.item1 = q.item1, . . . p.itemk -2 = q.itemk -2, p.itemk -1 < q.itemk -1

Generate all (k -1)-subsets from the candidate itemsets in Ck  

Prune all candidate itemsets from Ck where some (k -1)-subset of the candidate

itemset is not in the frequent itemset Lk -1 Scan the transaction database to

determine the support for each candidate itemset in Ck Save the frequent itemsets

in Lk  

Page 8: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 8/36

Example Assume the user-specified minimum support is 40%, then generate all

frequent itemsets. Given: The transaction database shown below

TID A B C D E F G  

T 1

1 0 1 0 0 1 0

T 2

0 0 1 1 0 0 1

T 3

1 1 0 0 1 0 0

T 4

0 0 1 0 1 0 1

T 5

1 0 1 0 1 1 0

T 6 

0 0 0 1 1 0 0

T 7 

0 0 1 0 1 1 1

T 8 1 0 0 0 0 1 0

T 9

0 1 1 1 0 0 0

Page 9: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 9/36

C1

Itemset X supp(X)

A 4

B 2

C 6

D 3

F 5

G 4

H 3

L1 after pruning  

Pass 1

A

B

C

D

E

FG

H

Page 10: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 10/36

Pass 2

C2

 Item set X supp( X)

A,B 1

A,C 2

A,D 0A,E 2

A,F 3

A,G 0

B,C 1

B,D 1

B,E 1

B,F 0

B,G 0

Item setX 

supp(X)

C,D 2

C,E 3

C,F 3

D,E 1

D,F 1

D,G 1

E,F 2

E,G 2

F,G 1

L2 after 

pruning

A,C

A,E

A,F

C,D

C,F

E,F

E,G

Page 11: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 11/36

Pass 3

Itemset X supp(X)

oin AC with AD A,C,D 0

oin AC with AE A,C,E 1

oin AC with AF A,C,F 2oin AC with AG A,C,G 0

oin AE with AF A,E,F 1

oin AE with AG A,E,G 0

oin CEwith CF C,E,F 2

oin CEwith CG C,E,G 2

L3 after pruning

C3

A,C,F

C,E,F

C,E,G

Page 12: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 12/36

Pass 4

C4

Item set x Support(X)

C,E,F,G 1

Pass 5  

For pass 5 we can't form any candidates because there aren'ttwo frequent 4-itemsets beginning with the same 3 items.

Page 13: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 13/36

Disadvantages:

any one of the frequent set items becomes longer the alg

o go through much iteration and as a result the performanases in terms of response time.

g the execution, every frequent item set is explicitly consi

f scans is required .

Page 14: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 14/36

In our proposed system, we implement associationrules using Boolean matrix algorithm.

•It operates in a top down process.

•It scans the database only once.

•So the search time—response time becomes quicker.

•It finds the maximum frequent itemsets which meet

the minimum support in short time through theoperation of vector and matrix.

•Works better than most of the currently existingalgorithms such as apriori.

Page 15: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 15/36

Boolean matrixBoolean matrix• Boolean matrix algorithm need to scan the database only once

and as the structure of the Boolean matrix is simple, it can beunderstood easily, and it is easy to compute withoutgenerating plenty of candidate item sets.

• For the databases are translate into files of matrix and the

files are very small, it reduce plenty of time spending onscanning the database .So the algorithm is efficient.

• This is better and faster than the Apriori approach .

• Thus the technique evaluates the required data much quickerthan most of the algorithms currently existing .

Page 16: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 16/36

For the same example, 

Consider the original matrix and transpose of that matrix

 

1 0 1 0 0 1 0

0 0 1 1 0 0 1 1 0 1 0 1 0 0 1 01 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1

0 0 1 0 1 0 1 1 1 0 1 1 0 1 0 1

R = 1 0 1 0 1 1 0 T = 0 1 0 0 0 1 0 0 1

0 0 0 1 1 0 0 0 0 1 1 1 1 1 0 0

0 0 1 0 1 1 1 1 0 0 0 1 0 1 1 0

1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0

0 1 1 1 0 0 0

Now multiply both these matrices,we get3 1 1 1 3 0 2 2 1

1 3 0 2 1 1 2 0 2

1 0 3 1 2 1 1 1 1

1 2 1 3 2 1 3 0 1

S = 3 1 1 2 3 1 3 2 1

0 1 1 1 1 2 1 0 1

2 2 1 3 3 1 4 1 1

2 0 1 0 2 0 1 2 01 2 1 1 1 1 1 0 2

Page 17: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 17/36

Consider the upper triangular matrix

 

3 1 1 1 3 0 2 2 1

3 0 2 1 1 2 0 2

3 1 2 1 1 1 1

3 2 1 3 0 1Su = 3 1 3 2 1

2 1 0 1

4 1 12 0

2

Page 18: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 18/36

Consider the maximum element from the longest diagonaland find in which rows the element will be.

The maximum element is 4

So find in which rows 4 existst77 = 4B7(7,1)

It is less than min sup value so thereare no 4-frequent itemsets.

Next maximum element is 3t11=3 , t22=3, t33=3, t44=3, t55=3B1(1,5,2); B4(4,7,2); B5(5,7,2)First consider B1(1,5,2) Find Logical AND operation of a1 and a5 = 1 0 1 0 0 1 0

This indicates B1 = {A,C,F}

Consider the uppertriangular matrix 

3 1 1 1 3 02 2 1

3 0 2 1 1

2 0 2 3 1 2 11 1 1

3 2 13 0 1S = 3 1 3

2 12

1 0 1 

4 1 1

2 0

Page 19: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 19/36

Similarly, B4 and B5 values will be known. by performinglogical for a4 and a7 and for B5 for a5 and a7 respectively

So 3-frequent itemsets are { (A,C,F) , (C,E,G) , (C,F,G) }

Similarly, 2-frequent itemsets will be generated.

{ (A,C), (A,E), (A,F), (C,D), (C,E), (C,F), (C,G), (E,F), (E,G) }

Therefore, both the algorithms gives the same output.

2-frequent itemsets are { (A,C), (A,E), (A,F), (C,D), (C,E),(C,F), (C,G), (E,F), (E,G) }

3-frequent itemsets are { (A,C,F) , (C,E,G) , (C,F,G) }

Page 20: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 20/36

•Helps in improving the marketing tactics.

•Used for collaborative filtering in setting business trends.

•Comprehensive analysis of the Customer choices.

•Comprehensive analysis of the Product Demand inflations.

•Helps to minimize losses and maximize profits.

advantages:

Page 21: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 21/36

Performance Analysis:

Assumption1.Apriori’s Best Case == Boolean’s Best Case

2.Apriori’s Worst Case == Boolean’s Average Case√Axiom 

Let i =1 , and k is the total number of candidate item sets

1. Best Case Most frequent item set

evaluated in pass = i.

2. Average Case Most frequent item set evaluate

in pass > i && < k.

3. Worst Case

Most frequent item set

Page 22: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 22/36

Proof by Example:

Best Case

T1={bread} T2={bread} T3={bread} T4={bread} Let min supp=50%

Apriori

Pass 1:

C1 :bread - 100% (not pruned)

MFS = { {bread}} stop!

R = 1 T = 1 1 1 1

1

1

1

U = 1 1 1 1 C = 1 1 1 1

1 1 1 1 1 1 1

1 1 1 1 1 1

1 1 1 1 1

Boolean

Pass 1:

C1 :

bread - 100% (not pruned)

MFS = { {bread}}

MFCS={{bread}-100%} not

pruned stop! 

Page 23: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 23/36

Average CaseT1={bread, jam, sugar, cheese} T2={jam, sugar, cheese} T3={sugar, cheese}

T4={cheese} Let min supp=50%

Apriori

Pass 1:

C1 :

bread - 25% ( pruned) jam - 50% (not pruned) sugar - 75% (not pruned)

cheese - 100% (not pruned)

MFS = { {jam}, {sugar}, {cheese}}

Pass 2 :

C2:

{jam, sugar } - 50% (not pruned) {jam, cheese } - 50% (not pruned)

{sugar, cheese } - 75% (not pruned)

MFS = { {jam, sugar},{jam, cheese},{sugar, cheese}}Pass 3;

C3:

{jam, sugar ,cheese} - 50% (not pruned) MFS = { {jam, sugar, cheese}}

Page 24: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 24/36

Boolean1 1 1 1 1 0 0 0 4 3 2 1 4 3 2 1

R = 0 1 1 1 T = 1 1 0 0 C = 3 3 2 1 U = 3 2 1

0 0 1 1 1 1 1 0 2 2 2 1 2 10 0 0 1 1 1 1 1 1 1 1 1 1

Pass 1:

Max element = 4

4 doesn’t exist in any row so there are are no 4-frequent itemsets.

Next element = 3U(1,1) = 4; U(1,2) = 3

B(1,2,2) = 3 = 0 1 1 1

So, the 3-frequent itemsets are {{jam,sugar,cheese}}

Pass 2 :

Next max ele = 2B(1,2,3,3) = {{sugar,cheese],{jam,sugar},{jam,cheese}}

B(2,3,2) = {{sugar,cheese}}

So, the 2-frequent itemsets are {{sugar,cheese],{jam,sugar},{jam,cheese}}

Thus, algorithm terminates in two passes

Page 25: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 25/36

Worst CaseT1={bread, jam, sugar, cheese} T2={bread, jam, sugar, cheese} Let min supp=50%

Apriori

Pass 1:C1 :

bread - 100% jam -100% sugar -100% cheese - 100% (nothing pruned)

MFS = { {jam}, {sugar}, {cheese},{bread}}

Pass 2:

C1 : {bread, jam}-100%, {bread, sugar}-100%, {bread, cheese}-100%, {sugar, jam}-10

{cheese, jam}-100%, {sugar, cheese}-100%

(nothing pruned) MFS = {{bread, jam}, {bread, sugar},{bread, cheese}, {sugar, jam},

{cheese, jam}, {sugar, cheese} }

Pass 3:

C1 :

{bread,jam,cheese} 100% , {bread,jam,sugar} 100% ,{jam,sugar,cheese} 100%(nothing pruned)

MFS = { {bread,jam,cheese} , {bread,jam,sugar} ,{jam,sugar,cheese} }

Pass 4:

C1 :

(bread,jam,sugar,cheese}-100% not prunedMFS = { {jam, sugar, cheese, bread}}

Page 26: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 26/36

Boolean1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4

R = 1 1 1 1 T = 1 1 1 1 C = 4 4 4 4 U = 4 4 4

1 1 1 1 1 1 1 1 4 4 4 4 4 4

1 1 1 1 1 1 1 1 4 4 4 4 4

Pass 1:

Max ele = 4

So, 4-frequent itemsets are { {jam, sugar, cheese, bread}}

Hence 3-frequent itemsets are {{bread,jam,cheese},{bread,jam,sugar} ,

{jam,sugar,cheese} }

2-frequent itemsets are {{bread,jam}, {bread,sugar},{bread,cheese}, {sugar,jam},

{cheese,jam}, {sugar,cheese} }

Page 27: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 27/36

Apriori Time Complexity

•In best case one pass so O(1)•In worst case all items are considered so O(n)•In Average case only about half the number of items are taken O(n-k)

Boolean matrix Time Complexity

In best case s O(1)

Average Case O(n)

Worst Case O(n)

Page 28: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 28/36

taflow diagram

Page 29: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 29/36

Sequence diagram for Apriori Sequence diagram for Boolean

UML Diagrams 

Page 30: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 30/36

Sequence Diagram for pruning the Item sets

Page 31: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 31/36

Testing:Testing is the process of executing program

with the intent of finding an error.

Us

a bility

Te

st:

Test Case Id Input FormatInput data

Expected Obtained

UT_1.1 Time String‘morning’ or ‘afternoon’ or ‘evening’ or ‘night’

morningNo Error No Error  

UT_1.2 Hrs:mins:secs11:35:65

No Error Error  

UT_1.3 Hrs:mins03:75

No Error Error  

UT_1.4 Hrs32

No Error Error  

UT_1.5 Random character stringH2

Error Error  

UT_1.6 Integer Number  12.9

Error Error  

UT_1.7 Floating Point15

Error Error  

UT_1.8 Alphanumeric @w Error Error  

UT_1.9 Special Characters 9,0 Error Error  

UT_1.10 Empty String Error Error  

Test case Id Input Module Next Module(Expected)

Next Module(Obtained)

Page 32: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 32/36

PT_1 Receive user inputs Retrieve raw candidate transaction table Raw candidate transaction table retrieved

PT_2. Retrieve raw candidate t ransact ion table Construct binary candidate t ransact ion table Binary candidate t ransaction table constructed

PT_3. Binary candidate transaction table constructed Algorithm begins Algorithm began

PT_4 Algorithm began Subset Selection Subset Selected

PT_5 Subset Selected Support Evaluation Support evaluated

PT_6 Support evaluated Prune Pruned

PT_7 Pruned Most frequent item set evaluation Most frequent item set evaluated

PT_8 Most frequent item sets evaluated Frame Association rules Association rules framed

PT_9 Association rules framed Pictorial representation using paint method Pictorially representation began

PT_10 Pictorially representation began Draw Bar chart Bar chart drawn

PT_11 Bar Chart drawn Draw Pie Chart Pie Chart Drawn

PT_12 Pie Chart Drawn Draw Support for single pass graph Support for single pass graph drawn

PT_13 Support for single pass graph drawn Draw Time Complexity Graph Time Complexity Graph Drawn

Pa

th

testi

n g

Page 33: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 33/36

Conclusion:Discovering frequent itemsets is a key problem in

important data mining applications, such as discoveryof association rules..

BMA efficiently overcomes the difficulty better than allthe algorithms currently existing and could better thanApriori.

We evaluate the performance of the algorithm usingwell-known synthetic benchmark databases, real-lifecensus and stock market databases.

Page 34: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 34/36

Further Developments:

Parallelizing the Boolean matrix algorithm

 this is a way to minimize the duplicate

calculations and to maximize the use of availableprocessors.

Page 35: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 35/36

Page 36: finalmain prjct ppt 24-3-11

8/7/2019 finalmain prjct ppt 24-3-11

http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 36/36