Upload
hakhanh
View
226
Download
1
Embed Size (px)
Citation preview
1
Database Mining: Bringing Algorithms to Data
Sharma Chakravarthy
Information Technology Laboratory
Computer Science and Engineering Department
The University of Texas at Arlington, Arlington, TX 76009
Email: [email protected]
URL: http://itlab.uta.edu/sharma
Acknowledgments
• This presentation is based on the work of many of my students, especially Shiby Thomas, Mahesh Dudkiar, Hongen Zhang, Pratyush Mishra, and Himavalli Kona (and others)
• National Science Foundation and other agencies for their support
2/13/2012 © Sharma Chakravarthy 2
Outline
• Data Mining and DW• Database Mining
– Overview – Spectrum
• Database Mining Architectures– Performance comparison
• Association Mining Using SQL‐92 and SQL‐OR– Approaches – optimization– Performance comparison
• Summary and Challenges
2/13/2012 © Sharma Chakravarthy 3
Role of Data Warehouses
• DW makes DM a lot cheaper
• DM is one of the reasons for DW
• OLAP: verification‐driven
– sales in CA Vs. FL in Q1 of 2003
• DM: discovery‐driven
– What factors contribute to non‐payment of loans ?
– Will Microsoft come back?
2/13/2012 © Sharma Chakravarthy 4
2
OLAP Vs. Data Mining
• OLAP is user driven– Analyst generates hypothesis, uses OLAP to verify
– e.g., “people with high debt are bad credit risks”
• Data mining tool generates the hypothesis– Tool performs exploration
• e.g., find risk factors for granting credit
– Discover new patterns that analysts didn’t think of• e.g., debt-to-income ratio
• OLAP and DM complement each other
2/13/2012 © Sharma Chakravarthy 5
Why Database Mining?
• Proliferation of relational DW and the need to mine them without siphoning the data out
• Make mining to ‘co‐exist’ with OLAP and other decision‐support applications
• DM need to be a sub‐process in next generation Business Intelligence (BI) Systems
• Leverage the RDBMS technology for mining
– More than 3+ decades worth of research
• Provide an integrated decision‐support environment for analysts
2/13/2012 © Sharma Chakravarthy 6
Data Mining Vs. Database Mining
• Data mining refers to main memory algorithms for mining
+ Can use arbitrary data structures
+ Can optimize algorithms with proper representation (hash tree for example)
- Limited memory, need for buffer management (need to be implemented separately for each approach !)
- Data has to be siphoned out of its location (mostly from a DBMS or a Data Warehouse)
- Works well only for small data sizes (no scalability)
- Every time data is added to the DB, the process has to be repeated
Solution? Database Mining – Bringing algorithms to data instead of taking data to algorithms
2/13/2012 © Sharma Chakravarthy 7
SQL‐based Mining: Implications
+ Leverage 3+ decades of DBMS R&D
+ Buffer management comes for free !
+ Portability due to standardization (SQL)
+ Fast development of mining algorithms
+ SMP parallelism for free for parallel database engines
+ Data is not replicated outside of DBMS
+ SQL may be extended to include ad hocmining queries
‐ However, No specialized data structures and memory management (UDF’s are exception)
2/13/2012 © Sharma Chakravarthy 8
3
Data Mining Evolution
• File‐Based or Main Memory mining algorithms
– Data mining
• SQL‐Based mining algorithms
– Database mining
• Parallel mining Algorithms
– Both main memory and SQL‐based
2/13/2012 © Sharma Chakravarthy 9
Mining Spectrum • Study architectural alternatives
• Performance evaluation
• Extend the capability of current query processors
Cache-Mine
Loose Coupling
User-defined function
Stored Procedure
Mining extenders/blades
SQL-based approach
Integrated with SQL query
engine
Mining as application on
Client/app. server
Mining as application on
database server
Mining using SQL+ Extensions
Integrated approach
Loose Tight Integration with database
2/13/2012 © Sharma Chakravarthy
Database Mining Spectrum
• Database Mining
– Single database (Directly) e.g. Intelligent Miner
– Single relation (using JDBC)
– Layered (multiple relations, using JDBC)
– Layered (Across databases, using JDBC)
– Integrated Database mining
2/13/2012 © Sharma Chakravarthy 11
Long‐Term Vision
• Unbundle bulky mining operations (e.g., mine)
• Identification of Common operators
• Integration of the above into the Query Optimizer
• No distinction between OLAP and mining
2/13/2012 © Sharma Chakravarthy 12
4
Alternative Architectures
• Loose‐coupling: data read through a cursor
• Stored‐procedure: mining algorithm encapsulated as a stored procedure (SP)
• Cache‐mine‐store: data cached in files outside DB in binary form
• UDF: “heavy‐weight” UDFs placed in SQL queries
• SQL: mining algorithm formulated as SQL-92/SQL-OR queries
2/13/2012 © Sharma Chakravarthy 13
Loosely Coupled
• Loose‐coupling: data read through a cursor
• Intermediate results are stored in the database
• DBMS used as file system
• High context switch between address spaces
• Even with block reads, performance is poor
2/13/2012 © Sharma Chakravarthy 14
Stored procedures• Mining algorithms executed as Stored‐procedures on the server
• Programming flexibility
• Existing file code can be reused
2/13/2012 © Sharma Chakravarthy 15
Cache‐Mine‐Store
• Special case of loose coupling
• Data is read once and cached in files outside DB (in binary form)
• Advantages of SP + better performance
• Additional disk space for caching
2/13/2012 © Sharma Chakravarthy 16
5
UDFs
• UDF: “heavy‐weight” UDFs placed in SQL queries• Little use of DB query processing• Fenced or unfenced mode• Significant code rewrite• Performance is good• Portability is poor
2/13/2012 © Sharma Chakravarthy 17
Stored Procedures and UDFs
• Server side: Advantages– Less traffic congestion
– Better development, Modularization and Integration
– Can return from basic data types like integers to complex structures like tables
• Client side– coding of these functions takes time and effort
2/13/2012 © Sharma Chakravarthy 18
Stored Procedures and UDFs
• Stored procedure uses PL/SQL, so DML can be used in it.
• As a consequence, the result can be stored into a table.
• However, UDF uses host programing languages, such as Java, C.
2/13/2012 © Sharma Chakravarthy 19
Stored Procedures and UDFs
The difference between calling a UDF (DB2) and Stored procedures (Oracle):
DB2: call the UDF in an SQL query directly.insert into tidT1
select T_item, T_cnt, T_tids from (select item, tid from T5D1K group by item, tid) as tt0, table(saveTid(item,
tid)) as tt2
Oracle: using the “CallableStatement” classCallableStatement stmt; // prepare the CALL statement,qs = "{Call SaveTid(?)}"; // the prototype is SaveTid(int rowCount)stmt = LogIn.con.prepareCall (qs);
int rowCount=getRowCount();stmt.setInt (1, rowCount); // set the parameter value rowCount
stmt.execute (); // call the stored procedure
2/13/2012 © Sharma Chakravarthy 20
6
SQL‐Based
• SQL: mining algorithm formulated as SQL-92/SQL-OR queries
• Several alternatives (query/subquery, Kway join)
• Can exploit SQL parallelism (where available)
• No specific extensions for mining
• May facilitate identification of needed extensions
2/13/2012 © Sharma Chakravarthy 21
ExtendedSQL
preprocessor+
Optimizer
(Object) Relational
DBMS
DB
SQL-92
SQL-ORGUI
Integrated Approach
• Unbundle bulky mining operations (e.g., mine)
• Identification of Common operators
• Integration of the above into the Query Optimizer
• No distinction between OLAP and mining
• New optimization for handling large number of joins and self‐joins
• Availability of data structures in a meaningful manner
2/13/2012 © Sharma Chakravarthy 22
Architecture comparison
Experiments on four real-life data sets Used IBM DB2 Universal Server version 5 on RS/6000:
200 MHz CPU, 256 MB main memory, 9GB disk
2/13/2012 © Sharma Chakravarthy 23
Architecture comparisons (DB2)
Data set D
0
2000
4000
6000
8000
10000
12000
CacheSproc
UDFSQL
CacheSproc
UDFSQL
CacheSproc
UDFSQL
Tim
e in
sec
Pass 1 Pass 2 Pass 3 Pass 4
Support -> 0.2% 0.07% 0.02%
Data set A
0
100
200
300
400
500
600
700
800
CacheSproc
UDFSQL
CacheSproc
UDFSQL
CacheSproc
UDFSQL
Tim
e i
n s
ec
Pass 1 Pass 2 Pass 3
Support -> 0.50% 0.35% 0.20%
2/13/2012 © Sharma Chakravarthy 24
7
Summary
• SQL performance good for smaller data sets
• As the size of the data set increases, SQL is not doing as good as cache (better data representation, special data structures, …)
• Stored procedure and UDF did not perform well (no optimization, limited data structures)
• SQL seems comparable with its own advantages!
2/13/2012 © Sharma Chakravarthy 25
Storage Comparison• Additional storage for Cache‐Mine and SQL to
cache/ transform data
• Space for indexing / sorting
• Similar storage overhead for Cache‐mine and SQL
Storage Requirement
0
5
10
15
20
25
30
35
40
45
CacheSpr
ocSQL
CacheSpro
cSQL
CacheSpro
cSQL
Cache
Sproc
SQL
Sp
ace
in m
illi
on
s o
f in
teg
ers
Cache Sort
Dataset-A Dataset-B Dataset-C Dataset-D
2/13/2012 © Sharma Chakravarthy 26
SQL‐92 based approachesto Association Rules
Input to Mining
TID Item1 Item2 Item3 Item4 Item5 100 1 1 1 200 1 1 1 300 1 1 1 1 400 1 1
2/13/2012 © Sharma Chakravarthy
8
Table format for database mining
2/13/2012 © Sharma Chakravarthy 29
TID ITEM ----------- ----------- 100 1 100 3 100 4 200 2 200 3 200 5 300 1 300 2 300 3 300 5 400 2 400 5
• DBMSs have an upper limit on the
• Number of attributes in a table!
• Cannot use a table for horizontal layout
Frequent itemsets table format
2/13/2012 © Sharma Chakravarthy 30
ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ITEM6 ITEM7 ITEM8 NULLM COUNT1 0 0 0 0 0 0 0 2 22 0 0 0 0 0 0 0 2 33 0 0 0 0 0 0 0 2 35 0 0 0 0 0 0 0 2 31 3 0 0 0 0 0 0 3 22 3 0 0 0 0 0 0 3 22 5 0 0 0 0 0 0 3 33 5 0 0 0 0 0 0 3 22 3 5 0 0 0 0 0 4 2
Rule table format
2/13/2012 © Sharma Chakravarthy 31
ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ITEM6 ITEM7 ITEM8 NULLM RULEM CONF SUP
1 3 0 0 0 0 0 0 3 2 100 503 1 0 0 0 0 0 0 3 2 66.67 502 3 0 0 0 0 0 0 3 2 66.67 503 2 0 0 0 0 0 0 3 2 66.67 502 5 0 0 0 0 0 0 3 2 100 755 2 0 0 0 0 0 0 3 2 100 753 5 0 0 0 0 0 0 3 2 66.67 505 3 0 0 0 0 0 0 3 2 66.67 502 3 5 0 0 0 0 0 4 2 66.67 503 2 5 0 0 0 0 0 4 2 66.67 505 2 3 0 0 0 0 0 4 2 66.67 502 3 5 0 0 0 0 0 4 3 100 502 5 3 0 0 0 0 0 4 3 66.67 503 5 2 0 0 0 0 0 4 3 100 50
SQL‐based Association rule mining
• SQL‐92
– K-way joins
– Subquery
– 3‐way joins
– 2‐ group by
– Set-oriented apriori (an improvement of K-way
join)
• SQL‐OR (uses table functions and other features ( blobs, clobs, etc.)
– GatherJoin
– GatherPrune
– GatherCount
– Horizontal
– vertical
– SQL‐bodied functions
2/13/2012 © Sharma Chakravarthy 32
9
SQL‐92 approaches
• Uses only the features of SQL‐92 standard
– “vanilla SQL”
– No table functions
– No stored procedures
– No UDF’s
• Portable (can be used on any RDBMS)
2/13/2012 © Sharma Chakravarthy 33
Performance of SQL‐92 approaches
Experiments on four real-life data sets Used IBM DB2 Universal Server version 5 on RS/6000:
200 MHz CPU, 256 MB main memory, 9GB disk
2/13/2012 © Sharma Chakravarthy 34
Comparison of SQL-92 approaches
2/13/2012 © Sharma Chakravarthy 35
SQL‐92 approaches
Set-oriented Apriori was the overall winner Subquery approach performed well The candidate generation time and the time for first pass is
much smaller than the total time K-way joining was also good for this data set
As good or better than loose-coupling only for high support
For low support considerably worse than loose-coupling 2-GroupBy has the worst performance 3-WayJoin is also not good
2/13/2012 © Sharma Chakravarthy 36
10
Observations on apriori
• C1 (typically input relation) can be substituted by F1 for subsequent passes
• Second pass is the most expensive one; no pruning at all; it is a Cartesian product
• As the number of passes increases (i.e., longer frequent itemsets), the number of joins increases; hence materialization may be effective
2/13/2012 © Sharma Chakravarthy 37
Optimizations (Set‐oriented apriori)
• Pruning non‐frequent items
– Reduces the transaction table size
• Eliminating candidate generation in second pass
– Large C2 in many cases
– Significant performance improvement
• Reusing item combinations
• Space/time tradeoff
2/13/2012 © Sharma Chakravarthy 38
SQL‐OR approaches
• GatherJoin: based on generating k‐item combinations using a table function
• GatherPrune: non‐candidate itemset pruning pushed inside the table function
• GatherCount: support counting pushed inside the table function
• Vertical: input data transformed into vertical format; support counting using UDF
• SQL‐bodied functions: uses the control structures in SQL
• Horizontal: input data transformed into horizontal format
2/13/2012 © Sharma Chakravarthy 39
Performance of SQL‐OR approaches
Data set A
0
500
1000
1500
2000
2500
Vertica l
GpruneGjoin
Gcount
Vertica l
GpruneG join
Gcount
Vertica l
GpruneGjoin
Gcount
Tim
e in
se
c
Prep Pass 1 Pass 2 Pass 3
Support -> 0.5% 0.35% 0.2%
Data set D
0
2000
4000
6000
8000
10000
12000
14000
Vertica l
G join
Gcount
Vertica l
G join
Gcount
Vertica l
G join
Gcount
Tim
e in
sec
Prep Pass 1 Pass 2 Pass 3 Pass 4
Support -> 0.20% 0.07% 0.02%
2/13/2012 © Sharma Chakravarthy 40
11
SQL‐OR approaches
• Vertical approach is the best for higher passes
• Vertical performs poorly when Ck is too large
• GatherJoin good when the number of frequent items per
transaction ( Nf) is small
• GatherCount better when Ck and Nf are large
• Choose the best approach for each pass
• Cost estimates using statistics collected in each pass and
the input data parameters
2/13/2012 © Sharma Chakravarthy 41
Data Characteristics
• Number of items : in thousands
• Number of Transactions: in Millions
• Data set sizes: High Gigabytes
– Discovering all rules rather than verifying if a rule holds
– Completeness and soundness
– Performance
– Scalability
2/13/2012 © Sharma Chakravarthy 42
Input/Output Formats
• Input transaction table in the normal form
• Two attributes (tid, item)
• Example: 1: A, B, C
• Output is a collection of rules
• Rule table schema:
(item1, …, itemk, len, rulem, confidence, support)
• Rule AB ‐> CD, conf. 90%, support 5%
(A, B, C, D, NULL, 4, 2, 0.9, 0.05)
Tid Item
1
1
1 C
B
A
2/13/2012 © Sharma Chakravarthy 43
K‐way Join
• The process of support counting in Kwj is as follows:
In any pass k:
– Frequent itemsets of length k‐1 are used to generate candidate itemsets of length k (Ck).
– Prune some of the candidates generated
– For support counting of these candidate itemsets, k copies of input relation is joined with the Ck.
2/13/2012 © Sharma Chakravarthy 44
12
Candidate Generation in SQL
• Join step: join 2 copies of Fk‐1
insert into Ck
select I1.item1, …, I1.itemk‐1, I2.itemk‐1
from Fk‐1 I1, Fk‐1 I2where I1.item1 = I2.item1 and …. and
I1.itemk‐2 = I2.itemk‐2 andI1.itemk‐1 < I2.itemk‐1
2/13/2012 © Sharma Chakravarthy
Candidate Generation and Pruning
• Prune step: additional joins with
(k‐2) more copies of Fk‐1
• Join predicates enumerated by
skipping an item at a time
• k‐items have k (k‐1)‐item subsets;
Out of that 2 have been used for
generating the K item. No need to
check them. Hence, the other (k‐
2) subsets need to be checked by
doing (k‐2) joins
2/13/2012 © Sharma Chakravarthy
Candidate Set Ck Generation
Insert into Ck
Select I1.item1, I1.item2, …, I1.itemk-1, I2.itemk-1
From Fk-1 I1, Fk-1 I2
Where I1.item1 = I2.item1 ANDI1.item2 = I2.item2 AND
……
I1.itemk-2 = I2.itemk-2 ANDI1.itemk-1 < I2.itemk-1
Example: F3: {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}}=> C4: {1, 2, 3, 4}, and {1, 3, 4, 5}.
2/13/2012 © Sharma Chakravarthy 47
Pruning explanation
• Consider Ck {1 3 4 5}
• The subsets are
– {1 3 4} generated by skipping item at position 4
– {1 3 5} generated by skipping item at position 3
– {1 4 5} generated by skipping item at position 2
– {3 4 5} generated by skipping item at position 1
• First 2 have been used in the generation of {1 3 4 5}
• Hence, skip positions 1 and 2 or 1 through k‐2 (here k is 4) to check for subsets!
2/13/2012 © Sharma Chakravarthy 48
13
PruningPrune step: in the k-itemset of Ck, if there is any (k-1)-subset of Ck
that is not in Fk-1, we need to delete that k-itemset from Ck.
I1.item2 = I3.item1
…I1.itemk-1 = I3.itemk-2
I2.itemk-1 = I3.itemk-1
...
...I1.item1 = Ik.item1
…I1.itemk-1 = Ik.itemk-2
I2.itemk-1 = Ik.itemk-1
Skip Item1
Skip Itemk-2
In the above example, one of the 4-itemset in C4 is {1, 3, 4, 5}. This 4-itemset needs to be deleted because one of the 3-item
subsets {3, 4, 5} is not in F3.
2/13/2012 © Sharma Chakravarthy 49
Candidate Set Ck Generation
Fk-1 I1 Fk-1 I2
Fk-1 I3
..… …. Fk-1 Ik
I1.item1 = I2.item1
..
..
I1.itemk-2 = I2.itemk-2
I1.itemk-1 < I2.itemk-1
(Skip item1)I1.item2 = I3.item1
..
..
I1.itemk-1 = I3.itemk-2
I2.itemk-1 = I3.itemk-1
(Skip itemk-2)I1.item1 = Ik.item1
..
..
I1.itemk-1 = Ik.itemk-2
I2.itemk-1 = Ik.itemk-1
Complete Query Diagram
Candidate Generation
Prune
Example: F3: {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}}
=> C4: {1, 2, 3, 4}, and {1, 3, 4, 5}.
2/13/2012 © Sharma Chakravarthy 50
• Join k copies of input table (T) with Ck anddo a group by on the itemsets
• insert into Fkselect item1, … , itemk, count(*)
from Ck, T t1, … , T tkwhere t1.item = Ck.item1 and
tk.item = Ck.itemk and
t1.tid = t2.tid and t1.item < t2.item and
tk‐1.tid = tk.tid and tk‐1.item < tk.item
group by item1, item2, … ,itemk
having count(*) minsup
SQL‐92 support counting ‐ Kway
requires K joins for the kth pas
2/13/2012 © Sharma Chakravarthy 51
Support Counting for Kwj in pass k
T t1 T t2
t1.tid = t2.tid
t1.item < t2.item
T tk
tk-1.tid = tk.tid
tk-1.item < tk.item
Ck.item1 = t1.item...
Ck.itemk = tk.item
Ck
Having count(*) > minsup
Group byitem1… itemk
Join Ck with k copies of T
Follow up the join with a group by on the items and filter on minsup
Requires k joins for the kth pass
Attributes of Input Table (T) : (tid, item)
Note that Ck is used as an inner relation !
2/13/2012 © Sharma Chakravarthy 52
14
K‐way join plan (Ck outer)
• Series of joins of Ck with k
copies of T
• Final join result is grouped
on the k items
2/13/2012 © Sharma Chakravarthy 53
Ck needs to be materialized if innerRequires additional I/O
No materialization needed if outerCan write a single (large) query for candidate generation, pruning, and support counting
May involve 10’s of joins!
Current optimizers were not designed for that many join optimizations!
Difference between inner and outer Ck
2/13/2012 © Sharma Chakravarthy 54
54
24
53
43
33
23
5 2
32
22
41
3 1
11
IidTid
Transaction
32
33
24
35
Count
Item
F1
Example: Frequent itemsets generation using Kwj
54
53
43
52
42
32
Item2Item1
C2
32
33
24
35
Count
Item
F1
T t1
54
24
53
43
33
23
5 2
32
22
41
3 1
11
IidTid
T t2
54
24
53
43
33
23
5 2
32
22
41
3 1
11
IidTid
F2
Item1 Item2 Count
2 3 2
2 5 3
3 4 2
3 5 2
C3
Item1 Item2 Item3
2 3 5
3 4 5
2/13/2012 © Sharma Chakravarthy 55
Simple and easy to write in SQL
Only down side is the number of joins
Also, Optimizers were never stressed for so many joins
Optimizers were never optimized for self‐joins!
SQL is generated by a script!
SQL‐92 support counting ‐ Kway
2/13/2012 © Sharma Chakravarthy 56
15
Subquery‐based
• Differs in the way Fk is generated
• In kwj, Fk is generated by joining two Fk‐1 tables, pruning (involves k‐2) additional joins, and then support count (involves k more joins)– For a total of (2k‐1) joins
– At higher passes, this becomes large!
• Instead, Fk is generated by using subqueries recursively in this approach !
2/13/2012 © Sharma Chakravarthy 57
Subquery‐based
• Makes use of common prefixes between the itemset in Ck to reduce the amount of work done in support counting
• Break up the support counting phase into a cascade of k subqueries
• The lth subquery Ql finds all tids that match the distinct itemsets formed by the first l columns of Ck
• The output of Ql is joined with T and dl+1 (the distinct itemset formed by the first l+1 columns of Ck) to get Ql+1
• The final output is obtained by doing a group-by on the k items to count support
2/13/2012 © Sharma Chakravarthy 58
2/13/2012 © Sharma Chakravarthy 59
SQL‐92 support counting ‐ Subquery
• Similar to k‐way join, we generate Ck and Fk• In the first pass, generate F1 as
select item, count(*)
from input
group by item
having count(*) minsup
TID ITEM
100 1
100 3
100 4
200 2
200 3
200 5
300 2
300 3
300 4
300 5
400 2
400 5
Input table
ITEM COUNT2 3
3 3
4 2
5 3
F1
ITEM ITEM
1 100
2 200
2 300
2 400
3 100
3 200
3 300
4 100
4 300
5 200
5 300
5 400
60
TID ITEM100 1100 3100 4200 2200 3200 5300 1300 2300 3300 5400 2400 5
ITEM1 ITEM2
2 3
2 4
2 5
3 4
3 5
4 5
ITEM1
2
3
4
ITEM1 TID
2 200
2 300
2 400
3 100
3 200
3 300
4 100
4 300
R1
ITEM1 ITEM2
2 3
2 4
2 5
3 4
3 5
4 5
D2
ITEM1 ITEM2 TID
2 3 200
2 3 300
2 4 300
2 5 200
2 5 300
2 5 400
3 4 100
3 4 300
3 5 200
3 5 300
4 5 300
T
ITEM1 ITEM2 COUNT
2 3 2
2 5 3
3 4 2
3 5 2
F2
C2
D1
TIDITEM
Join D2, R1, and TIDITEM
Group by item1, item2
Subquery: Example
2/13/2012 © Sharma Chakravarthy
16
Support counting: subquery
61
Insert into Fk
Select item1, item2, ……, itemk, count(*)From (Subquery Qk) tGroup by item1, item2, ……, itemk
Having count(*) > minsup
• Its performance is close to K-way join and better than 2-group by approach because it exploits common prefixes between candidate itemsets.
• Needs to materialize the intermediate table
Create table Dl-1 AsSelect distinct item1, item2, ……, iteml-1
From Ck
Subquery Ql (for any l between 1 and k) Select item1, item2, ……, iteml, tidFrom T tl, Rl-1, Dl
Where Rl-1.item1 = Dl.item1 AND…
Rl-1.iteml-1 = Dl.iteml-1
ANDRl-1.tid = tl.tid ANDtl.item = Dl.iteml
Subquery Q0: No subquery Q0
Create table Rl-1 asSelect item1, item2, …, iteml-1, tidfrom tiditem t1, d1-1
where t1.item=d1-1.item1 ANDt1.item=dl-1.item2 AND
…t1.item=dl-1.iteml-1
// Finds all tids that match the distinct itemsets formed by the first l columns of Ck
// Finds all distinct itemsets formed by the first l-1 columns of Ck
// Generate the frequent set in pass k
2/13/2012 © Sharma Chakravarthy
2/13/2012 © Sharma Chakravarthy 62
SQL‐92 support counting ‐ Subquery
• For pass k, (K>1) we generate Ck and Fk tables
• Ck is generated as mentioned earlier.
• For Fk, we generate the subqueries.insert into F2
select item1, item2, count(*)from (Subquery Q2) as tgroup by item1, item2having count(*) minsup
Subquery Q2
select d2.item1, d2.item2, t2.tidfrom input t2, (select item1, tid
from input t1, (select distinct item1 from C2) as d1 where t1.item = d1.item1) as r1,
(select distinct item1, item2 from C2) as d2where r1.item1 = d2.item1 and
r1.tid = t2.tid andt1.item = d2.item2
63
SQL-based Association Rule MiningSupport Counting: Subquery
Rl-1.iteml = Dl.iteml
…
Rl-1.iteml-1 = Dl.iteml-1
T tl
item1, item2, ….., iteml, tid
Subquery Ql
Subquery Q l-1Select distinctitem1, …., iteml from Ck
Dl
tl.item = Dl.iteml
Rl-1 .tid = t1.tid
2/13/2012 © Sharma Chakravarthy
Subquery Q0: No subquery Q0
Subquery optimization
insert into Fk
select item1, …, itemk, count(*)
from (Subquery Qk) t
group by item1, …, itemk
having count(*) > :minsup
2/13/2012 © Sharma Chakravarthy 64
Subbqeury Ql
select item1, …, itemk, tid
from T t, (Subquery Ql‐1) as Rl‐1(select distinct item1, …, itemk from Ck) as Dl
Where Rl‐1. item1 = Dl. Item1 and … and
Rl‐1. iteml‐1 = Dl. Iteml‐1 and
Rl‐1. tid = t. tid and
t. item1 = Dl. Iteml
Subquery Q0: No Subquery Q0
17
2/13/2012 © Sharma Chakravarthy 65
SQL‐92 support counting ‐ Subquery
• insert into Fk
select item1, item2, … , itemk, count(*)from (Subquery Qk) tgroup by item1, item2, … , itemk
having count(*) minsup• Subquery Qi (1 i k)
select item1, item2, … , itemi, tidfrom T ti, (Subquery Qi-1) as ri-1,
(select distinct item1, item2, … , itemi from Ck) as di
where ri-1.item1 = di.item1 and
ri-1.itemi-1 = di.itemi-1 andri-1.tid = ti.tid andti.item = di.itemi Subquery Q0 : No subquery Q0
Support counting: 2‐GroupBy
• Another way to avoid multi‐way joins is to first join T and Ck based on whether the ‘item’ of a (tid, item) pair of T is equal to any of the k items of Ck
• Then to do a group by on the itemsets
2/13/2012 © Sharma Chakravarthy 66
67
Support Counting: 2-GroupBy
Create Table temp ASSelect item1, item2, ……, itemk, count(*)From Ck, TWhere item = Ck.item1 OR
item = CK.item2 OR...
item = Ck.itemk
Group by item1, item2, ……, itemk, tidHaving count(*) = k
Insert into Fk
Select item1, item2, ……, itemk, count(*)From tempGroup by item1, item2, ……, itemk
Having count(*) > minsup
• No multi-way joins.• Much time is lost in
grouping, comparing and having clauses.
• Needs to materialize the intermediate table in Oracle.
Gives all (itemset, tid) pairs such that the tid has all the items in the itemset.
2/13/2012 © Sharma Chakravarthy
2/13/2012 © Sharma Chakravarthy 68
Performance comparisons of SQL‐92 approaches
In both the datasets, Kway join emerged to be the winner.
For lower supports on larger datasets, pass two is the most time consuming
Intelligent Miner is not SQL based and hence appears to be faster.
18
69
Scale-Up Experiment
0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
0 5 0 K 5 0 0 K
N u m b e r o f R e c o r d s
Tim
e (
se
co
nd
s)
K w a y J o i n S u b q u e r y
Figu re 3 .21 S cale-U p E x perim ents fo r S Q L -92 A pproaches
•Both Kway Join and Subquery approaches have the similar scale-up behavior.
2/13/2012 © Sharma Chakravarthy
Optimizations• Reduce the size of input dataset
– Non‐frequent 1‐itemsets are pruned out from the input table and this pruned input table is used instead in further passes.
– Effective for higher supports
• Optimize the second pass.– Skip generation of F1 and C2 and directly generate F2 by joining 2 copies of input dataset.
– Effective for all large data sets
• Reduce the number of joins done in any pass– Materialize all the frequent itemsets contained in any transaction at the end of the pass k and use them for support counting in pass k+1
– Effective for higher iterations
2/13/2012 © Sharma Chakravarthy 70
K‐way join Optimizations
• Pruning non‐frequent items from T
insert into Tf
select t.tid, t.item
from T t, F1 f
where t.item = f.item
2/13/2012 © Sharma Chakravarthy 71
Non‐frequent item pruning
• Reducing the size of T
• T is stored in normal form
• Simply drop the (tid, item) records of non‐frequent items
• Join T with F1• Significant reduction in size of T
• Improved performance especially for higher support
0
300000
600000
900000
1200000
2%
1%
0.7
5%
0.3
3%
2%
1%
0.7
5%
0.5
0%
0.1
0%
R Rf for support values R Rf for support values
T10.I4.D100K T5.I2.D100K
2/13/2012 © Sharma Chakravarthy 72
19
Experiments
Effect of Pruning
2/13/2012 © Sharma Chakravarthy 73
2/13/2012 74
Effect of Pruned Input
© Sharma Chakravarthy
Second pass optimization• C2 is Cartesian product of F1’s
• Usually C2 is very large
• Avoid generating C2• F2 is found by joining 2 copies of the
pruned T
• Significant performance gains
insert into F2
select p.item, q.item,count(*)
from Tf p, Tf q
where p.tid=q.tid and p.item <q.item
group by p.item, q.item
having count(*) > :minsup
Pass 2 optimization (T10.I4.D100K)
0
2000
4000
6000
8000
10000
12000
14000
2% 1% 0.75% 0.33%
Support
Tim
e in
sec
on
ds
With Opt. Without Opt.
2/13/2012 © Sharma Chakravarthy 75
Number of Candidate Itemsets in Different Passes
2/13/2012 © Sharma Chakravarthy 76
C2 C3 C4 C5 C6 C7 C8 C9
T5I2D500K. Sup = 0.10% 307720 126 7 0 -- -- -- --
T5I2D1000K. Sup = 0.10% 309291 127 61 0 -- -- -- --
T10I4D100K. Sup = 0.75% 12470 65 3 0 -- -- -- --
T10I4D100K. Sup = 0.33% 216153 2453 905 354 109 20 2 0
20
Experiments
2/13/2012 © Sharma Chakravarthy 77
Reusing item combinations
• The SQL formulations of Association rule mining generates
item combinations in each pass
• We can reuse them if we store these item combinations
generated in each pass
• This and other optimizations constitute the Set‐oriented
apriori approach
• K‐way join is replaced by single join
• Very useful in later passes
• Similar to the AprioriTid algo.
• Classic space / time trade off
2/13/2012 © Sharma Chakravarthy 78
Set‐oriented apriori
• In the kth pass of the supporting counting phase, we generate
a table Tk which contains all k‐item combinations that are
candidates.
• Tk has the schema (tid, item1, …, itemk)
• We join Tk‐1, Tf, and Ck to generate Tk
• The frequent itemset Fk is obtained by grouping the tuples of
Tk on the k items and applying the minimum support filtering
2/13/2012 © Sharma Chakravarthy 79
• In kth pass create a relation
Combk (tid, item1, item2, …, itemk).
• Join Combk‐1, with T and Ck and insert into Combk only those transactions from T that have candidate itemsets which are one extension to the candidate itemsets in Combk‐1.
• Do a group by on Combk to generate Fk.
• Thus in any pass k, we have only 3 joins, instead of k+1 joins.
• Comb1 and comb2 are quite large; hence takes more time
.
2/13/2012 © Sharma Chakravarthy 80
21
Set‐oriented apriori
insert into Tk
select p.tid, p.item1, . . . p.itemk−1, q.item
from Ck, Tk−1 p, Tf q
where p.item1 = Ck.item1 and
...
p.itemk−1 = Ck.itemk−1 and
q.item = Ck.itemk and
p.tid = q.tid
• T2 is not generated and stored due to second pass optimization
• Only T3 and onwards are generated and stored !
2/13/2012 © Sharma Chakravarthy 81
T3 generation
insert into T3
select p.tid, p.item, q.item, r.item
from Tf p, Tf q, Tf r, Ck
where p.item = C3.item1 and
q.item = C3.item2 and
r.item = C3.item3 and
p.tid = q.tid and
q.tid = r.tid
2/13/2012 © Sharma Chakravarthy 82
Performance experiments• Comparison of the set‐oriented apriori and Subquery approaches for 2 datasets
T10.I4.D100K: Total time
0
3000
6000
9000
12000
15000
18000
21000
24000
Setap
riori
Subqu
ery
Setap
riori
Subqu
ery
Setap
riori
Subqu
ery
Setap
riori
Subqu
ery
Tim
e in
sec
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 Pass 6 Pass 7
Support --> 2% 1% 0.75% 0.33%
T5.I2.D100K: Total time
0
300
600
900
1200
1500
1800
2100
2400
Setap
riori
Subqu
ery
Setap
riori
Subqu
ery
Setap
riori
Subqu
ery
Setap
riori
Subqu
ery
Tim
e in
sec
Pass 1 Pass 2 Pass 3 Pass 4
Support--> 2% 1% 0.5% 0.1%
2/13/2012 © Sharma Chakravarthy 83
Set‐oriented apriori Vs. subquery
0
1000
2000
3000
4000
5000
6000
Pass 3 Pass 4 Pass 5 Pass 6 Pass 7
Tim
e in seco
nds
Set-Apriori Subquery
2/13/2012 © Sharma Chakravarthy 84
22
Scale‐up of set‐oriented Apriori
• Scales linearly with number of transactions and transaction size
0
5000
10000
15000
20000
25000
0 200 400 600 800 1000 1200
Number of Transactions (in thousands)
Tim
e in
sec
T10.I4 T5.I2
0
5000
10000
15000
20000
25000
0 10 20 30 40 50 60
Average transaction length
Tim
e in
sec
1000 750 500
2/13/2012 © Sharma Chakravarthy 85
Summary
• Explored SQL‐aware implementations of association rule mining
• Analyzed the best SQL‐92 option
• Cost formulae to characterize execution time
• Identified optimizations
– Set‐oriented apriori approach
• Performance experiments and scale‐up properties
• Moves us towards our short term vision
• Basis for optimizations for the long term vision
2/13/2012 © Sharma Chakravarthy 86
Mining‐aware Optimizer
• Typically, data is stored in different DBMSs
• How can we perform mining on any RDBMS
• Our experiments indicated that different RDBMSs optimize queries in different ways – Even support variations had impact on the performance in different DBMSs
• Hence a global approach to mining did not seem appropriate !
• Analyze sql‐92 and sql‐or to generate and consolidate heuristics as metadata to be used by a mining‐aware optimizer
2/13/2012 © Sharma Chakravarthy 87
88
Motivation• Limitation: Existing mining tools can’t connect to multiple Database
Management Systems. • Solution: Use Java Database Connectivity (JDBC)
• Limitation: Most of the mining tools use Cache-Mine architecture. Data are copied into the local disk.• Solution: Use SQL-based approach. Three of the approaches
are based purely on SQL-92 and three of them are based on SQL-OR (Oracle).
• Limitation: Existing mining products do not provide expressive rule visualization • Solution: We use “rule-item” relationship in the association rule
visualization to replace the “item-item” relationship [MINESET] or directed graph [MINER].
• Java 3D, a new feature provided by JDK1.2, is used to implement three-dimensional display.
2/13/2012 © Sharma Chakravarthy
23
89
Motivation
• Limitation: Most of the available products can only use the data from one table (DBMiner has 64k transactions)• Solution: We provide the user with an interface to choose the
tables to be used as data source. For each table, the user can specify the columns that correspond to items. A set of JOIN/UNION operations are transparently applied to generate the input data set.
• Limitation: Existing products use only one mining algorithm. However, the choice of an algorithm needs to be based on data as well as DBMS characteristics. • Solution: Implement a Mining Optimizer based on meta data to
decide the algorithm to be used based on the data set and the underlying DBMS used.
2/13/2012 © Sharma Chakravarthy
Short‐term Goal
• Layered architecture
• JDBC provides the database connection and SQL interface
• VMO generates and visualizes the association rules
2/13/2012 © Sharma Chakravarthy 90
2/13/2012 © Sharma Chakravarthy 91
Mapping
• Mining is done on a relation with 2 attributes (Tid, Item)
• However, user has data in relations and has to map it into integer (Tid, Item) format
• Most mining tools accept single Tid and single column items.
• Our mining optimizer accepts multiple Tid columns and Single/Multiple Items (attributes) specified by the user from multiple relations
• Table input(Date, CustomerID, item1, item2, item3)
2/13/2012 © Sharma Chakravarthy 92
Mapping (Contd.)
Date CustID ITEM
1/1/00 100 Milk
1/1/00 100 Eggs
1/1/00 100 Bread
1/2/00 200 Sugar
1/2/00 200 Eggs
1/2/00 200 Cake
Date CustID ITEM
1/3/00 300 Milk
1/3/00 300 Sugar
1/3/00 300 Eggs
1/3/00 300 Cake
1/4/00 400 Sugar
1/4/00 400 Cake
InputTable1 InputTable2
Number (TIDD1) CustID (TIDD2) TIDI1/1/00 100 1
1/2/00 200 2
1/3/00 300 3
1/4/00 400 4
ITEMD ITEMIBread 1Cake 2Eggs 3Milk 4Sugar 5
MappedTidsTable
MappedItemsTable
TID ITEM
1 1
1 3
1 4
2 2
2 3
2 5
3 2
3 3
3 4
3 5
4 2
4 5
FinalInputTable
24
2/13/2012 © Sharma Chakravarthy 93
RULES (FINAL)Rule Head Symbol Rule Body Confidence(%) Support(%)
Cake => Eggs 67 50Eggs => Cake 67 50Eggs => Milk 67 50Milk => Eggs 100 50Cake => Sugar 100 75Sugar => Cake 100 75Eggs => Sugar 67 50Sugar => Eggs 67 50Cake => Eggs, 67 50Eggs => Cake, 67 50Sugar => Cake, 67 50Cake, Eggs => Sugar 100 50Cake, Sugar => Eggs 67 50Eggs, Sugar => Cake 100 50
Rules table with descriptions mapped back
Rule Visualization
Rule Table with Filter capability
The key is to construct a where clause using the standard SQL operators, such as ‘LIKE’, ‘NOT’, ‘IN’, ‘AND’, etc
2/13/2012 © Sharma Chakravarthy 94
Rule Visualization
Rule Table with Sort capability
2/13/2012 © Sharma Chakravarthy 95
96
Rule Visualization
# of Rules based on # of Items in Rule head
• Customize the graph • by modifying the
attributes
• Add more constraints• by modifying the
where clause
2/13/2012 © Sharma Chakravarthy
25
97
3-D visualization
3-D VisualizationObtained by clicking On the previous (head:3)
All database access using relational operators
Each column is a rule
1st coat, lock, pump-> bike, eggs
(50% sup, 80% conf)
2/13/2012 © Sharma Chakravarthy
98
Rule Generation steps•Combine the frequent sets Fk in all passes into one table FISETS with
columns ITEM1, ITEM2, ITEM3, ………, ITEMk, NULLM, COUNT
•Generate all non-empty subsets for each tuple in table FISETS and store the results into table Subsets with columns TITEM1, TITEM2, TITEM3, ………, TITEMk, TNULLM, TRULEM, TCOUNT
•Join the tables FISETS and Subsets to get the association rules.
Insert into RulesSelect TITEM1, TITEM2, ……., TITEMk, TNULLM, TRULEM,
TCOUNT, (TCOUNT/COUNT)*100From Subsets t1, FISETS t2Where (t1.titem1 = t2.item1 or t1.TRULEM<=1) AND
(t1.titem2 = t2.item2 or t1.TRULEM<=2) AND…
(t1.titemk= t2.itemk or t1.TRULEM<=k) ANDt1.TRULEM = t2.NULLM AND
(TCOUNT/COUNT)*100 >= :min confidence
2/13/2012 © Sharma Chakravarthy
99
Rule Generation
FISETS table
Find the non-empty subsets with 2 items within the rule head for the tuple with red arrow
TITEM1 TITEM2 TITEM3 TITEM4 TITEM5 TITEM6 TITEM7 TITEM8 TNULLM TRULEM TCOUNT
2 3 5 0 0 0 0 0 4 3 22 5 3 0 0 0 0 0 4 3 23 5 2 0 0 0 0 0 4 3 2
Subsets table generated by a table function
ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ITEM6 ITEM7 ITEM8 NULLM COUNT
1 0 0 0 0 0 0 0 2 22 0 0 0 0 0 0 0 2 33 0 0 0 0 0 0 0 2 35 0 0 0 0 0 0 0 2 31 3 0 0 0 0 0 0 3 22 3 0 0 0 0 0 0 3 22 5 0 0 0 0 0 0 3 33 5 0 0 0 0 0 0 3 22 3 5 0 0 0 0 0 4 2
ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ITEM6 ITEM7 ITEM8 NULLM RULEM CONF SUP
2 3 5 0 0 0 0 0 4 3 100 502 5 3 0 0 0 0 0 4 3 66.67 503 5 2 0 0 0 0 0 4 3 100 50
Join tables FISETS and Subsets
Association Rules
Join
2/13/2012 © Sharma Chakravarthy
100
Reverse mapping of items
ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ITEM6 ITEM7 ITEM8 NULLM RULEM CONF SUP
1 3 0 0 0 0 0 0 3 2 100 503 1 0 0 0 0 0 0 3 2 66.67 502 3 0 0 0 0 0 0 3 2 66.67 503 2 0 0 0 0 0 0 3 2 66.67 502 5 0 0 0 0 0 0 3 2 100 755 2 0 0 0 0 0 0 3 2 100 753 5 0 0 0 0 0 0 3 2 66.67 505 3 0 0 0 0 0 0 3 2 66.67 502 3 5 0 0 0 0 0 4 2 66.67 503 2 5 0 0 0 0 0 4 2 66.67 505 2 3 0 0 0 0 0 4 2 66.67 502 3 5 0 0 0 0 0 4 3 100 502 5 3 0 0 0 0 0 4 3 66.67 503 5 2 0 0 0 0 0 4 3 100 50
TID ITEM100 Milk100 Eggs100 Bread200 Sugar200 Eggs200 Cake300 Milk300 Sugar300 Eggs300 Cake400 Sugar400 Cake
ITEM NUMBER DESCRIPTION1 Milk2 Sugar3 Eggs4 Bread5 Cake
Rule Head Symbol Rule Body Confidence(%) Support(%)Milk => Eggs 100 50Eggs => Milk 67 50Sugar => Eggs 67 50Eggs => Sugar 67 50Sugar => Cake 100 75Cake => Sugar 100 75Eggs => Cake 67 50Cake => Eggs 67 50Sugar => Eggs, Cake 67 50Eggs => Sugar, Cake 67 50Cake => Sugar, Eggs 67 50Sugar, => Cake 100 50Sugar, => Eggs 67 50Eggs, Cake => Sugar 100 50
Input
Description
Intermediate Rule Table
FinalRuleTable
Mining
Mapping Back
2/13/2012 © Sharma Chakravarthy
26
SQL‐based Association rule mining
• SQL‐92
– K-way joins
– Subquery
– 3‐way joins
– 2‐ group by
– Set-oriented apriori (an improvement of K-way
join)
• SQL‐OR (uses table functions and other features ( blobs, clobs, etc.)
– GatherJoin
– GatherPrune
– GatherCount
– Horizontal
– verticalTid
– SQL‐bodied functions
2/13/2012 © Sharma Chakravarthy 101
2/13/2012 102
SQL‐ OR based approaches
• Use UDF’s (IBM DB2/UDB) and Stored procedures (Oracle) along with some Object Relational constructs.
• Advantages :
– Flexibility in the way queries can be written.
– Can make use of complex data structures – a way to mimic the main memory algorithms.
• Disadvantages :
– Specific to a given RDBMS
– Involves considerable effort for developing and testing these procedures.
© Sharma Chakravarthy
Methodology for experiments
• Synthetic data sets, generated by using IBM’s data‐generator.
• Datasets are named as TxxIyyDzzzK.
– xx denotes the average number of items present per transaction.
– yy denotes the average support of each item in the dataset.
– zzzK denotes the total number of transactions in K (1000’s).
– Example: T5I2D1000K.
• Tested on Oracle 8i and IBM DB2/UDB V6.1
• Each experiment has been performed 4 times in a row and average taken
• Most results are shown for three datasets – T5I2D500K, T5I2D1000K and T10I4D100K.
2/13/2012 © Sharma Chakravarthy 103
2/13/2012 104
VerticalTid (Vtid)
• Uses a different representation for the input data. Uses two procedures:– SaveTid
• To change the representation of the input data
– CountAndK• To use this changed representation for support counting
© Sharma Chakravarthy
27
2/13/2012 105
VerticalTid (contd…)
CountAnd2 Procedure1“1,3”, “2,3,4”54
2“1,2,3”, “2,3,4”53
2“1,2,3”, “1,3”433“2,3,4,”, “2,3,4”52
1“2,3,4”, “1,3”42
2“2,3,4”, “1,2,3”32CountTidList1, TidList2Item2Item1
F2
Item1 Item2 Count
2 3 2
2 5 3
3 4 2
3 5 2
54
53
43
52
42
32
Item2Item1
C2
32
33
24
35
CountItem
F1
TransactionItem Tid
1 12 22 32 43 13 23 34 14 35 25 35 4
SaveTid Procedure
5
4
3
2
1
Item
3“2,3,4”
2“1,3”
3“1,2,3”
3“2,3,4”
1“1”
Count
TidList (Clob)
TidListTable
© Sharma Chakravarthy
2/13/2012 106
VerticalTid (contd …)
253
243
352
232
CountItem2Item1
F2C3
543
532
Item3
Item2
Item1
Item1 Item2 Item3 TidList1, TidList2, TidList3
Count
2 3 5 “2,3,4”, “1,2,3”, “2,3,4” 2
3 4 5 “1,2,3”, “1,3”, “2,3,4” 1
CountAnd3 Procedure
SaveTid Procedure
5
4
3
2
1
Item
3“2,3,4”
2“1,3”
3“1,2,3”
3“2,3,4”
1“1”
Count
TidList (Clob)
TidListTable
© Sharma Chakravarthy
2/13/2012 107
Experiment
T5I2D10K T5I2D100K
TidListTable 22 102
Pass 1 0 0
Pass 2 3,148 55,487
Pass 3 8 175
Pass 4 1 11
Time taken (in secs) for mining different datasets
Im_Vtid
© Sharma Chakravarthy
2/13/2012 108
Optimization
• In pass k, reduce the number of CLOBs passed to CountAndK stored procedure.
1. Create CLOBs for only those items whose count > minsup
2. For Pass 2: use second pass optimization of k‐way join
3. For pass k (k >2), CountAndK stored procedure has been modified to create TidList for the frequent itemsets and materialize them.
4. In pass k+1 use this materialized relation, Ck+1 and TidListTable for support counting.
CountAndK procedure now receives only 2 CLOBs (TidList’s)
© Sharma Chakravarthy
28
2/13/2012 109
Optimization
Item1 Item2 Item3 TidList1, TidList2, TidList3
TidList Count
2 3 5 “2,3,4”, “1,2,3”, “2,3,4” “2,3” 2
2 3 7 “1,2,3”, “1,3”, “1,3,4” “1,3” 2
Modified CountAnd3 Stored Procedure
FComb3Item1 Item2 Item3 TidList Count
2 3 5 “2,3” 2
2 3 7 “1,3” 3
F3
Item1 Item2 Item3 Count2 3 5 22 3 7 2
C4
Item1 Item2 Item3 Item4
2 3 5 7
Item1 Item2 Item3 Item4 TidList, TidList
Count
2 3 5 7 “2,3”, “1,3,4” 2
Modified CountAnd4 Stored Procedure
© Sharma Chakravarthy
2/13/2012 110
Experiments
GJn
Vtid
© Sharma Chakravarthy
2/13/2012 111
Gather Join (Gjn)
• Differs in the way candidate itemsets are generated.
– Uses CombinationK procedure for candidate itemset generation.
• CombinationK procedure scans the input dataset and collects all the items that correspond to a transaction in a Vector. This vector is then used in the generation of candidate itemset of length K.
– For support counting: Just a simple group by on the k‐items of the candidate itemset is sufficient to identify frequent itemsets.
© Sharma Chakravarthy
2/13/2012 112
Gather Join (contd…)
• For Oracle implementation the CombinationK stored procedure has been modified.
– It scans the input datasets and collects all items bought in a transaction in a Vector and uses it for generation of candidate itemsets.
– ItemList table is not materialized.
© Sharma Chakravarthy
29
2/13/2012 113
Experiments
could not run for D500K and D1000K
© Sharma Chakravarthy
2/13/2012 114
Optimization
• To reduce the number of candidate itemsets that are generated.
• Fact: The frequent itemsets of length k‐1 are sufficient to generate all candidate itemsets of length k.
• Pass k (k>2): Materialize the tuples, which contain frequent itemsets and use only these tuples, instead of the input table for generation of Ck+1.– CombinationK stored procedure has been modified to insert the corresponding Transaction Id along with the item combinations.
© Sharma Chakravarthy
2/13/2012 115
OptimizationC2
Tid Item1 Item2
1 1 3
1 1 4
1 3 4
2 2 3
2 2 5
2 3 5
3 2 3
3 2 4
3 2 5
3 3 4
3 3 5
3 4 5
4 2 5
Modified Combination2
5424534333235 23222413 111IidTid
Transaction
F2
Item1 Item2 Count
2 3 2
2 5 3
3 4 2
3 5 2
FComb2
Tid Item1 Item2
1 3 4
2 2 3
2 2 5
2 3 5
3 2 3
3 2 5
3 3 4
3 3 5
4 2 5
© Sharma Chakravarthy
2/13/2012 116
Optimization
C3
Tid Item1 Item2 Item3
2 2 3 5
3 2 3 4
3 2 3 5
3 3 4 5
Modified Combination3
FComb2
Tid Item1 Item2
1 3 4
2 2 3
2 2 5
2 3 5
3 2 3
3 2 5
3 3 4
3 3 5
4 2 5F3
Item1 Item2 Item3 Count
2 3 5 2
Tid Item
1 3
1 4
2 2
2 3
2 5
3 2
3 3
3 4
3 5
© Sharma Chakravarthy
30
2/13/2012 117
Experiments
© Sharma Chakravarthy
2/13/2012 118
Gather Count (Gcnt)
• Similar to Gjn, except that in pass 2, this approach uses a 2‐Dimensional Array to store all the two item combinations and outputs only those item combinations whose support > minsup.
• Because of memory constraints, the same is not possible for other passes.– As the C2 table is not materialized, pruning can’t be done for pass 3.
© Sharma Chakravarthy
2/13/2012 119
Experiments (Gcnt)
© Sharma Chakravarthy
2/13/2012 120
Optimization
• As Gather Count is very similar to Gather Join approach, the same optimization is applicable to Gather Count approach also.
• For pass 3 onwards, the Im_Gcnt approach uses the modified CombinationK procedures.
© Sharma Chakravarthy
31
2/13/2012 121
Experiments
© Sharma Chakravarthy
2/13/2012 122
Experiment
© Sharma Chakravarthy
2/13/2012 123
Conclusion: SQL‐OR
© Sharma Chakravarthy
2/13/2012 124
IM_Vtid
Conclusion: SQL‐OR
© Sharma Chakravarthy
32
2/13/2012 125
Conclusion: SQL‐OR
• Ranking of the naïve approaches:
1. Gather Count
2. Gather Join
3. Vertical Tid
• Though SQL‐OR based approach seem very simple in the way support counting is done, processing of CLOBs is time consuming.
• Optimizations reduce the number of CLOBs that needs to be processed and hence are very effective.
© Sharma Chakravarthy
2/13/2012 126
Metadata Table
• Based on the cardinality.
• Underlying RDBMS.
• Whether we can use any extra space.
© Sharma Chakravarthy
2/13/2012 127
Metadata TableT5I2DzzzK IBM DB2/UDB Oracle Support Value
Extra Space No Extra Space Extra Space No Extra Space10K -NA- Gjn IM_Gcnt Gcnt S = 0.20 %
-NA- Gjn IM_Gcnt Gcnt S = 0.15 %-NA- Gjn IM_Gcnt Gcnt S = 0.10 %
50K -NA- Gjn IM_Gcnt Gcnt S = 0.20 %-NA- Gjn IM_Gcnt Gcnt S = 0.15 %-NA- Gjn IM_Gcnt Gcnt S = 0.10 %
100K -NA- Gjn IM_Gcnt Gcnt S = 0.20 %-NA- Gjn IM_Gcnt Gcnt S = 0.15 %-NA- Gjn IM_Gcnt Gcnt S = 0.10 %
500K -NA- Gjn IM_Gcnt Gcnt S = 0.20 %-NA- Gjn IM_Gcnt Gcnt S = 0.15 %-NA- Gjn IM_Gcnt Gcnt S = 0.10 %
1000K -NA- Gjn IM_Gcnt Gcnt S = 0.20 %
-NA- Gjn IM_Gcnt Gcnt S = 0.15 %
-NA- Gjn IM_Gcnt Gcnt S = 0.10 %
Summary Table for SQL-OR based Approaches
© Sharma Chakravarthy
SQL‐92 based approaches
Implementation on Oracle and
IBM DB2/UDB
33
Experiments
2/13/2012 © Sharma Chakravarthy 129
2/13/2012 130
Experiments
© Sharma Chakravarthy
2/13/2012 131
Second Pass Optimization on Pruned Input (SpoPi)
© Sharma Chakravarthy
2/13/2012 132
Reuse of Item Combinations on Pruned Input
© Sharma Chakravarthy
34
2/13/2012 133
Reuse of Item Combinations and Second Pass Optimization
© Sharma Chakravarthy
2/13/2012 134
Combination of all Optimizations
© Sharma Chakravarthy
2/13/2012 135
Comparison of SQL based approaches
© Sharma Chakravarthy
Results
Table Name Ranking Supp = 0.2% Supp = 0.15%
Supp = 0.1%
T5I2D100K First RicSpo RicSpo Kwj
Second All All RicSpo
Last RicPi RicPi RicPi
T5I2D500K First RicSpo RicSpo Spo
Second Spo Spo RicSpo
Last RicPi RicPi RicPi
Ranking Supp = 2.0% Supp = 1.0% Supp = 0.75%
Supp = 0.33%
T10I4D100K First All RicSpo RicSpo Ric
Second Pi All All Spo
Last Ric RicPi RicPi RicSpo
Trends in Oracle for SQL-92 based approaches
2/13/2012 © Sharma Chakravarthy 136
35
Results
Table Name Ranking Supp = 0.2% Supp = 0.15% Supp = 0.1%
T5I2D100K First RicSpo Spo RicSpo
Second Spo RicSpo All
Last RicPi SpoPi SpoPi
T5I2D500K First Spo Spo Spo
Second RicSpo RicSpo RicSpo
Last SpoPi SpoPo SpoPi
Ranking Supp = 2.0% Supp = 1.0% Supp = 0.75%
T10I4D100K First Spo RicSpo RicSpo
Second RicSpo All All
Last Ric Kwj Kwj
Trends in IBM DB2/UDB for SQL-92 based approaches
2/13/2012 © Sharma Chakravarthy 137
2/13/2012 138
Metadata TableT5I2DzzzK IBM DB2/UDB Oracle Support Value
Extra Space No Extra Space Extra Space No Extra Space10K RicSpo Spo RicSpo Spo S = 0.20 %
RicSpo Spo RicSpo Spo S = 0.15 %RicSpo Spo Spo Spo S = 0.10 %
50K RicSpo Spo RicSpo Spo S = 0.20 %Spo Spo RicSpo Spo S = 0.15 %Spo Spo Spo Spo S = 0.10 %
100K RicSpo Spo RicSpo Spo S = 0.20 %Spo Spo RicSpo Spo S = 0.15 %Spo Spo Spo Spo S = 0.10 %
500K Spo Spo RicSpo Spo S = 0.20 %Spo Spo RicSpo Spo S = 0.15 %Spo Spo Spo Spo S = 0.10 %
1000K RicSpo Spo RicSpo Spo S = 0.20 %
Spo Spo RicSpo Spo S = 0.15 %
Spo Spo Kwj Spo S = 0.10 %
Summary Table for SQL-92 based Approaches
© Sharma Chakravarthy
Thank You !!!
2/13/2012 © Sharma Chakravarthy 139
2/13/2012 140
Discussion
© Sharma Chakravarthy
36
References
• Thuraisingham, B., A Primer for Understanding and Applying Data Mining. IEEE, 2000. Vol. 2, No.1: p. 28-31.
• Thomas, S., Architectures and optimizations for integrating Data Mining algorithms with Database Systems, PhD thesis, CISE department 1998, University of Florida: Gainesville.
• Agrawal, R., T. Imielinski, and A. Swami. Mining Association Rules between sets of items in large databases. in ACM SIGMOD International Conference on the Management of Data. 1993. Washington, D.C.
• Agrawal, R. and R. Srikant. Fast Algorithms for mining association rules. in 20th Int'l Conference on Very Large Databases (VLDB). 1994.
• Sarasere, A., E. Omiecinsky, and S. Navathe. An efficient algorithm for mining association rules in large databases. in 21st Int'l Cong. on Very Large Databases (VLDB). 1995. Zurich, Switzerland.
• Shenoy, P., et al. Turbo-charging Vertical Mining of Large Databases. in ACM SIGMOD Int'l Conference on Management of Data. 2000. Dallas.
• Han, J., J. Pei, and Y. Yin. Mining Frequent Patterns wihtout Candidate Generation. in ACM SIGMOD Int'l Conference on Management of Data. 2000. Dallas.
• Houtsma, M. and A. Swami. Set-Oriented Mining for Association Rules in Relational Databases. in 11th International Conference on Data Engineering (ICDE). 1995.
2/13/2012 © Sharma Chakravarthy 141
References
• Han, J., et al. DMQL: A data mining query language for relational database. in ACM SIGMOD workshop on research issues on data mining and knowledge discovery. 1996. Montreal.
• Meo, R., G. Psaila, and S. Ceri. A New SQL-like Operator for Mining Association Rules. in Proceedings of the 22nd VLDB Conference. 1996. Mumbai, India.
• Agrawal, R. and K. Shim, Developing tightly-coupled Data Mining Applications on a Relational Database System. 1995, IBM Almaden Research Center: San Jose, California.
• Sarawagi, S., S. Thomas, and R. Agrawal. Integrating Association Rule Mining with Rekational Database System: Alternatives and Implications. in ACM SIGMOD Int'l Conference on Management of Data. 1998. Seattle, Washington.
• Dudgikar, M., A Layered Optimizer or Mining Association Rules over RDBMS, in CSE Department. 2000, University of Florida: Gainesville.
• Thomas. S, and Chakravarthy. S. Performance evaluation and optimization of join queries for association rule mining. Proc. of the First International Conference on Data Warehousing and Knowledge Discovery, DaWaK '99, Florence, Italy, August 1999
• R. Balachandran, S. Padmanabhan, and S. Chakravarthy, “Enhanced DB-Subdue: Supporting Subtle Aspects of Graph Mining Using a Relational Approach”, To appear in PAKDD, Singapore, April 2006 (Short paper)
• A. Srinivasan, S. Sreshta, and S. Chakravarthy, “Discovery of Significant Intervals in Sequential Data”, in the Proc. of 1st ADBIS Workshop on Data Mining and Knowledge Discovery, Tallinn, Estonia, September 2005, pp. 87-98.
2/13/2012 © Sharma Chakravarthy 142
References
• H. Kona, S. Chakravarthy, and A. Arora, “SQL-Based Approach to Incremental Association Rule Mining”, in the Proc. of 1st ADBIS Workshop on Data Mining and Knowledge Discovery, Tallinn, Estonia, September 2005, pp. 11-24
• Sharma Chakravarthy, Ramji Beera, and Ram Balachandran, DB-Subdue: Database Approach to Graph Mining, In PAKDD conference, Sydney, May 2004.
• P. Mishra and S. Chakravarthy, “Performance Evaluation of SQL-OR Variants for Association Rule Mining”, in Proc. Of DaWaK (Data warehousing and Knowledge Discovery), September 2003, Prague.
• P. Mishra and S. Chakravarthy, “Performance Evaluation and Analysis of K-way join variants for Association Rule Mining”, in Proceedings of BNCOD 2003, Sheffield, UK, July 2003, 95-114.
• M. Dudgikar, S. Chakravarthy, R. Liuzi, and L. Wong, “A Layered Optimizer for Mining Association Rules over Relational Database Management Systems”, 2003 International Conference on Artificial Intelligence (IC-AI'2003), June 2002, Monte Carlo Resort, Las Vegas, Nevada
• S. Chakravarthy and H. Zhang, Visualization of association Rules over RDBMSs, in the proceedings of ACM SAC 2003, Melbourne, FL, March 2003 (Multi-media and Visualization Track).
• P. Mishra and S. Chakravarthy, “Performance Evaluation and Analysis of K-way join variants for Association Rule Mining”, in Proceedings of BNCOD 2003, Sheffield, UK, July 2003, 95-114.
• Mr. Srihari Padmanabhan, “Relational Database Approach to Graph Mining and Hierarchical Reduction”, Fall 2005 http://itlab.uta.edu/itlabweb/students/sharma/theses/pad05ms.pdf
• Mr. Sunit Sreshta, “SQL_Based Approach to Significant Interval Discovery in Time-Series Data”, Summer 2005 http://itlab.uta.edu/itlabweb/students/sharma/theses/shr05ms.pdf
2/13/2012 © Sharma Chakravarthy 143
References
• R. Balachandran, “Relational Approach to Modeling and Implementing Subtle Aspects of Graph Mining”, Fall 2003. http://www.cse.uta.edu/Research/Publications/Downloads/CSE-2003-41.pdf
• Ms. A. Krishnamurthy, “Significant Interval and Episode Discovery in Time-Series Data”, Fall 2003. http://www.cse.uta.edu/Research/Publications/Downloads/CSE-2003-39.pdf
• Mr. P. Mishra, “Performance Evaluation and Analysis of SQL-Based Approaches for Association Rule Mining”, Fall 2002. http://www.cse.uta.edu/Research/Publications/Downloads/CSE-2003-3.pdf
• Mr. Hongen Zhang, “Mining and Visualization of Association Rules in Relations DBMSs'', Summer 2000. http://itlab.uta.edu/sharma/People/ThesisWeb/etd.pdf
• S. Thomas and S. Chakravarthy, “Incremental Mining of Constrained Associations'', in Proc. of High Performance Computing (HiPC), Bangalore, India, Dec. 2000.
2/13/2012 © Sharma Chakravarthy 144