Upload
sriharsha-bellala
View
216
Download
0
Embed Size (px)
Citation preview
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 1/36
GuideGuide
Mr. D. S. SHARMA M.Tech, (Ph.D.,)
Associate Professor
Dept of CSIT
Sri sivani College Of Engg.
Sri sivani college of
engineering
K.KISHORE 07W61A0522
CH. SAINATH 07W61A0510
Y. RAMESH 07W61A0542
K. NAVYA 07W61A0525
B. MANASA 07W61A0529
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 2/36
g g
Traditionally, business analysts performed the task of
extracting useful information from recorded data, but theincreasing volume of data in modern business and science callsfor computer-based approaches.
In data mining, association rule learning is a popular and well
researched method for discovering interesting relations betweenvariables in large databases.
A typical and widely-used example of association rule mining isMarket Basket Analysis.
This project is established using the Boolean matrix Algorithm,a technique to implement the association rules under bi-directional search. Our main objective is to makes a comparative analysis between
the Apriori and the Boolean matrix algorithms, so as to prove
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 3/36
Hardware Requirements
•Processor: Intel Pentium 4 or equivalent or above.(Greater clock speeds and FSB speeds with single-core mainstream
compatibility) •RAM: Minimum of 256 MB or higher.
(Since oracle 8i alone requires a minimum ram of 256MB).
•Video Cards: Video Card Average: 32 bit, Recommended: 64 bit
Makes output screens (graphs, bar charts, pie charts) appear sharper andbetter
•HDD: 20 GB or higher (apt)(The total volume of s/w installed along with database take min 20 GB of
HD)
•Monitor: 15’’ or 17’’ color monitor. Screen resolution: 1024 by 768 pixelsColor quality: 32 bit & Color scheme: Windows Standard(It’s the standard configuration of most PC’s with optimized viewable
space)
•Mouse: Scroll mouse or optical mouse.
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 4/36
ftware Requirements
perating system: Windows XP (almost all versions) or Unix ory operating system that can support java)
ont-end: Java (displayed in form of applets and frames)most all versions of java are compatible to run the code)
ck-end: Oracle 10g. (This is to design the database)
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 5/36
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 6/36
Apriori AlgorithmApriori Algorithm
Apriori uses breadth-first search and a hash tree
structure to count candidate item sets efficiently.It generates candidate item sets of length k from itemsets of length k − 1.
Then it prunes the candidates which have an infrequentsub pattern.According to the downward closure lemma, the
candidate set contains all frequent k -length item sets.
After that, it scans the transaction database todetermine frequent item sets among the candidates.
For determining frequent items quickly, the algorithm
uses a hash tree to store candidate item sets.
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 7/36
Apriori Itemset GenerationApriori Itemset Generation
Pass 1
Generate the candidate itemsets in C 1
Save the frequent itemsets in L1
Pass k
Generate the candidate itemsets in Ck from the frequent
itemsets in Lk -1
Join Lk -1 p with Lk -1q, as follows:insert into Ck
select p.item1, q.item1, . . . , p.itemk -1, q.itemk -1
from Lk -1 p, Lk -1q
where p.item1 = q.item1, . . . p.itemk -2 = q.itemk -2, p.itemk -1 < q.itemk -1
Generate all (k -1)-subsets from the candidate itemsets in Ck
Prune all candidate itemsets from Ck where some (k -1)-subset of the candidate
itemset is not in the frequent itemset Lk -1 Scan the transaction database to
determine the support for each candidate itemset in Ck Save the frequent itemsets
in Lk
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 8/36
Example Assume the user-specified minimum support is 40%, then generate all
frequent itemsets. Given: The transaction database shown below
TID A B C D E F G
T 1
1 0 1 0 0 1 0
T 2
0 0 1 1 0 0 1
T 3
1 1 0 0 1 0 0
T 4
0 0 1 0 1 0 1
T 5
1 0 1 0 1 1 0
T 6
0 0 0 1 1 0 0
T 7
0 0 1 0 1 1 1
T 8 1 0 0 0 0 1 0
T 9
0 1 1 1 0 0 0
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 9/36
C1
Itemset X supp(X)
A 4
B 2
C 6
D 3
F 5
G 4
H 3
L1 after pruning
Pass 1
A
B
C
D
E
FG
H
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 10/36
Pass 2
C2
Item set X supp( X)
A,B 1
A,C 2
A,D 0A,E 2
A,F 3
A,G 0
B,C 1
B,D 1
B,E 1
B,F 0
B,G 0
Item setX
supp(X)
C,D 2
C,E 3
C,F 3
D,E 1
D,F 1
D,G 1
E,F 2
E,G 2
F,G 1
L2 after
pruning
A,C
A,E
A,F
C,D
C,F
E,F
E,G
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 11/36
Pass 3
Itemset X supp(X)
oin AC with AD A,C,D 0
oin AC with AE A,C,E 1
oin AC with AF A,C,F 2oin AC with AG A,C,G 0
oin AE with AF A,E,F 1
oin AE with AG A,E,G 0
oin CEwith CF C,E,F 2
oin CEwith CG C,E,G 2
L3 after pruning
C3
A,C,F
C,E,F
C,E,G
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 12/36
Pass 4
C4
Item set x Support(X)
C,E,F,G 1
Pass 5
For pass 5 we can't form any candidates because there aren'ttwo frequent 4-itemsets beginning with the same 3 items.
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 13/36
Disadvantages:
any one of the frequent set items becomes longer the alg
o go through much iteration and as a result the performanases in terms of response time.
g the execution, every frequent item set is explicitly consi
f scans is required .
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 14/36
In our proposed system, we implement associationrules using Boolean matrix algorithm.
•It operates in a top down process.
•It scans the database only once.
•So the search time—response time becomes quicker.
•It finds the maximum frequent itemsets which meet
the minimum support in short time through theoperation of vector and matrix.
•Works better than most of the currently existingalgorithms such as apriori.
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 15/36
Boolean matrixBoolean matrix• Boolean matrix algorithm need to scan the database only once
and as the structure of the Boolean matrix is simple, it can beunderstood easily, and it is easy to compute withoutgenerating plenty of candidate item sets.
• For the databases are translate into files of matrix and the
files are very small, it reduce plenty of time spending onscanning the database .So the algorithm is efficient.
• This is better and faster than the Apriori approach .
• Thus the technique evaluates the required data much quickerthan most of the algorithms currently existing .
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 16/36
For the same example,
Consider the original matrix and transpose of that matrix
1 0 1 0 0 1 0
0 0 1 1 0 0 1 1 0 1 0 1 0 0 1 01 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1
0 0 1 0 1 0 1 1 1 0 1 1 0 1 0 1
R = 1 0 1 0 1 1 0 T = 0 1 0 0 0 1 0 0 1
0 0 0 1 1 0 0 0 0 1 1 1 1 1 0 0
0 0 1 0 1 1 1 1 0 0 0 1 0 1 1 0
1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0
0 1 1 1 0 0 0
Now multiply both these matrices,we get3 1 1 1 3 0 2 2 1
1 3 0 2 1 1 2 0 2
1 0 3 1 2 1 1 1 1
1 2 1 3 2 1 3 0 1
S = 3 1 1 2 3 1 3 2 1
0 1 1 1 1 2 1 0 1
2 2 1 3 3 1 4 1 1
2 0 1 0 2 0 1 2 01 2 1 1 1 1 1 0 2
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 17/36
Consider the upper triangular matrix
3 1 1 1 3 0 2 2 1
3 0 2 1 1 2 0 2
3 1 2 1 1 1 1
3 2 1 3 0 1Su = 3 1 3 2 1
2 1 0 1
4 1 12 0
2
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 18/36
Consider the maximum element from the longest diagonaland find in which rows the element will be.
The maximum element is 4
So find in which rows 4 existst77 = 4B7(7,1)
It is less than min sup value so thereare no 4-frequent itemsets.
Next maximum element is 3t11=3 , t22=3, t33=3, t44=3, t55=3B1(1,5,2); B4(4,7,2); B5(5,7,2)First consider B1(1,5,2) Find Logical AND operation of a1 and a5 = 1 0 1 0 0 1 0
This indicates B1 = {A,C,F}
Consider the uppertriangular matrix
3 1 1 1 3 02 2 1
3 0 2 1 1
2 0 2 3 1 2 11 1 1
3 2 13 0 1S = 3 1 3
2 12
1 0 1
4 1 1
2 0
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 19/36
Similarly, B4 and B5 values will be known. by performinglogical for a4 and a7 and for B5 for a5 and a7 respectively
So 3-frequent itemsets are { (A,C,F) , (C,E,G) , (C,F,G) }
Similarly, 2-frequent itemsets will be generated.
{ (A,C), (A,E), (A,F), (C,D), (C,E), (C,F), (C,G), (E,F), (E,G) }
Therefore, both the algorithms gives the same output.
2-frequent itemsets are { (A,C), (A,E), (A,F), (C,D), (C,E),(C,F), (C,G), (E,F), (E,G) }
3-frequent itemsets are { (A,C,F) , (C,E,G) , (C,F,G) }
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 20/36
•Helps in improving the marketing tactics.
•Used for collaborative filtering in setting business trends.
•Comprehensive analysis of the Customer choices.
•Comprehensive analysis of the Product Demand inflations.
•Helps to minimize losses and maximize profits.
advantages:
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 21/36
Performance Analysis:
Assumption1.Apriori’s Best Case == Boolean’s Best Case
2.Apriori’s Worst Case == Boolean’s Average Case√Axiom
Let i =1 , and k is the total number of candidate item sets
1. Best Case Most frequent item set
evaluated in pass = i.
2. Average Case Most frequent item set evaluate
in pass > i && < k.
3. Worst Case
Most frequent item set
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 22/36
Proof by Example:
Best Case
T1={bread} T2={bread} T3={bread} T4={bread} Let min supp=50%
Apriori
Pass 1:
C1 :bread - 100% (not pruned)
MFS = { {bread}} stop!
R = 1 T = 1 1 1 1
1
1
1
U = 1 1 1 1 C = 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1
Boolean
Pass 1:
C1 :
bread - 100% (not pruned)
MFS = { {bread}}
MFCS={{bread}-100%} not
pruned stop!
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 23/36
Average CaseT1={bread, jam, sugar, cheese} T2={jam, sugar, cheese} T3={sugar, cheese}
T4={cheese} Let min supp=50%
Apriori
Pass 1:
C1 :
bread - 25% ( pruned) jam - 50% (not pruned) sugar - 75% (not pruned)
cheese - 100% (not pruned)
MFS = { {jam}, {sugar}, {cheese}}
Pass 2 :
C2:
{jam, sugar } - 50% (not pruned) {jam, cheese } - 50% (not pruned)
{sugar, cheese } - 75% (not pruned)
MFS = { {jam, sugar},{jam, cheese},{sugar, cheese}}Pass 3;
C3:
{jam, sugar ,cheese} - 50% (not pruned) MFS = { {jam, sugar, cheese}}
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 24/36
Boolean1 1 1 1 1 0 0 0 4 3 2 1 4 3 2 1
R = 0 1 1 1 T = 1 1 0 0 C = 3 3 2 1 U = 3 2 1
0 0 1 1 1 1 1 0 2 2 2 1 2 10 0 0 1 1 1 1 1 1 1 1 1 1
Pass 1:
Max element = 4
4 doesn’t exist in any row so there are are no 4-frequent itemsets.
Next element = 3U(1,1) = 4; U(1,2) = 3
B(1,2,2) = 3 = 0 1 1 1
So, the 3-frequent itemsets are {{jam,sugar,cheese}}
Pass 2 :
Next max ele = 2B(1,2,3,3) = {{sugar,cheese],{jam,sugar},{jam,cheese}}
B(2,3,2) = {{sugar,cheese}}
So, the 2-frequent itemsets are {{sugar,cheese],{jam,sugar},{jam,cheese}}
Thus, algorithm terminates in two passes
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 25/36
Worst CaseT1={bread, jam, sugar, cheese} T2={bread, jam, sugar, cheese} Let min supp=50%
Apriori
Pass 1:C1 :
bread - 100% jam -100% sugar -100% cheese - 100% (nothing pruned)
MFS = { {jam}, {sugar}, {cheese},{bread}}
Pass 2:
C1 : {bread, jam}-100%, {bread, sugar}-100%, {bread, cheese}-100%, {sugar, jam}-10
{cheese, jam}-100%, {sugar, cheese}-100%
(nothing pruned) MFS = {{bread, jam}, {bread, sugar},{bread, cheese}, {sugar, jam},
{cheese, jam}, {sugar, cheese} }
Pass 3:
C1 :
{bread,jam,cheese} 100% , {bread,jam,sugar} 100% ,{jam,sugar,cheese} 100%(nothing pruned)
MFS = { {bread,jam,cheese} , {bread,jam,sugar} ,{jam,sugar,cheese} }
Pass 4:
C1 :
(bread,jam,sugar,cheese}-100% not prunedMFS = { {jam, sugar, cheese, bread}}
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 26/36
Boolean1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4
R = 1 1 1 1 T = 1 1 1 1 C = 4 4 4 4 U = 4 4 4
1 1 1 1 1 1 1 1 4 4 4 4 4 4
1 1 1 1 1 1 1 1 4 4 4 4 4
Pass 1:
Max ele = 4
So, 4-frequent itemsets are { {jam, sugar, cheese, bread}}
Hence 3-frequent itemsets are {{bread,jam,cheese},{bread,jam,sugar} ,
{jam,sugar,cheese} }
2-frequent itemsets are {{bread,jam}, {bread,sugar},{bread,cheese}, {sugar,jam},
{cheese,jam}, {sugar,cheese} }
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 27/36
Apriori Time Complexity
•In best case one pass so O(1)•In worst case all items are considered so O(n)•In Average case only about half the number of items are taken O(n-k)
Boolean matrix Time Complexity
In best case s O(1)
Average Case O(n)
Worst Case O(n)
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 28/36
taflow diagram
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 29/36
Sequence diagram for Apriori Sequence diagram for Boolean
UML Diagrams
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 30/36
Sequence Diagram for pruning the Item sets
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 31/36
Testing:Testing is the process of executing program
with the intent of finding an error.
Us
a bility
Te
st:
Test Case Id Input FormatInput data
Expected Obtained
UT_1.1 Time String‘morning’ or ‘afternoon’ or ‘evening’ or ‘night’
morningNo Error No Error
UT_1.2 Hrs:mins:secs11:35:65
No Error Error
UT_1.3 Hrs:mins03:75
No Error Error
UT_1.4 Hrs32
No Error Error
UT_1.5 Random character stringH2
Error Error
UT_1.6 Integer Number 12.9
Error Error
UT_1.7 Floating Point15
Error Error
UT_1.8 Alphanumeric @w Error Error
UT_1.9 Special Characters 9,0 Error Error
UT_1.10 Empty String Error Error
Test case Id Input Module Next Module(Expected)
Next Module(Obtained)
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 32/36
PT_1 Receive user inputs Retrieve raw candidate transaction table Raw candidate transaction table retrieved
PT_2. Retrieve raw candidate t ransact ion table Construct binary candidate t ransact ion table Binary candidate t ransaction table constructed
PT_3. Binary candidate transaction table constructed Algorithm begins Algorithm began
PT_4 Algorithm began Subset Selection Subset Selected
PT_5 Subset Selected Support Evaluation Support evaluated
PT_6 Support evaluated Prune Pruned
PT_7 Pruned Most frequent item set evaluation Most frequent item set evaluated
PT_8 Most frequent item sets evaluated Frame Association rules Association rules framed
PT_9 Association rules framed Pictorial representation using paint method Pictorially representation began
PT_10 Pictorially representation began Draw Bar chart Bar chart drawn
PT_11 Bar Chart drawn Draw Pie Chart Pie Chart Drawn
PT_12 Pie Chart Drawn Draw Support for single pass graph Support for single pass graph drawn
PT_13 Support for single pass graph drawn Draw Time Complexity Graph Time Complexity Graph Drawn
Pa
th
testi
n g
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 33/36
Conclusion:Discovering frequent itemsets is a key problem in
important data mining applications, such as discoveryof association rules..
BMA efficiently overcomes the difficulty better than allthe algorithms currently existing and could better thanApriori.
We evaluate the performance of the algorithm usingwell-known synthetic benchmark databases, real-lifecensus and stock market databases.
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 34/36
Further Developments:
Parallelizing the Boolean matrix algorithm
this is a way to minimize the duplicate
calculations and to maximize the use of availableprocessors.
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 35/36
8/7/2019 finalmain prjct ppt 24-3-11
http://slidepdf.com/reader/full/finalmain-prjct-ppt-24-3-11 36/36