Scalable Mining For Classification Rules in Relational Databases
מוצג ע”י : נדב גרוסאוג
Min Wang Bala Iyer Jeffrey Scott Vitter
AbstractAbstract• Problem : Increase in Size of Training Set
• MIND (MINing in Database) Classifier
• Can be Implemented easily over SQL
• Other Classifiers Need O(N) space In Memory.
• MIND Scales Well Over :
• I/O
• # of Processors
Over ViewOver View
• Introduction
• Algorithm
• Database Implementation
• Performance
• Experimental Results
• Conclusions
Introduction - Classification ProblemIntroduction - Classification Problem
no
no
yes
yes
salary <= 62K
safe
safe
risky
Age <= 30
DETAIL TABLEDETAIL TABLE
CLASSIFYERCLASSIFYER
Introduction - Scalability In Introduction - Scalability In ClassificationClassification
Importance Of Scalability:Importance Of Scalability:
• Use a Very Large Training SetUse a Very Large Training Set – Data is Not – Data is Not Memory Resident.Memory Resident.
• Number Of CPUsNumber Of CPUs – better usage of – better usage of resources.resources.
Introduction - Scalability In Introduction - Scalability In ClassificationClassification
Properties of MIND: • Scalable in memory• Scalable In CPU • Uses SQL • Easy to implement
Assumptions Attribute Values Are Discrete We focus on the growth stage(no pruning)
The Algorithm - DataStracture The Algorithm - DataStracture
DATA in DETAIL TABLE
DETAIL(attr1,attr2,….,class,leaf_num)
attrattrii = i attribute = i attribute
classclass = Class type = Class type
leaf_numleaf_num = the number of leaf the = the number of leaf the example belongs to(this data can be example belongs to(this data can be calculated by the known tree)calculated by the known tree)
The Algorithm - The Algorithm - ginigini index index S - data Set
C - number of Classes
Pi - relative frequency of class i in S
ginigini index : index :
The AlgorithmThe AlgorithmGrowTree(DETAIL TABLE)
Initialize tree T and put all records of DETAIL in rootwhile (some leaf in T is not a STOP node)
for each attribute i doevaluate gini index for each non-STOP leaf
at each split value with respect to attribute ifor each non-STOP leaf do
get the overall best split for it;partition the records and grow the tree for one more level according to best splits;mark all small or pure leaves as STOP nodes;
return T;
Database Implementation - Database Implementation - Dimension tableDimension table
• For Each Attribute and each level of the tree
INSERT INTO DIMi
SELECT leaf_num,class,attri,count(*)
FROM DETAIL
WHERE leaf_num,<> STOP
GROUP BY leaf_num,class,attri
Size of Dimi = #leaves * #distinct values of attri * #classes
Database Implementation - Database Implementation - Dimension table SQL Dimension table SQL
SELECT FROM DETAIL INSERT INTO DIM1 leaf_num,class,attr1,count(*) WHERE leaf_num,<> STOP
GROUP BY leaf_num,class,attr1
INSERT INTO DIM2 leaf_num,class,attr2,count(*) WHERE leaf_num,<> STOP
GROUP BY leaf_num,class,attr2• • •
Database Implementation - Database Implementation - UP/DOWN - split UP/DOWN - split
for each attribute we find all possible split places:INSERT INTO UP
SELECT d1.leaf_num, d1.attri,d1.class,SUM(d2.count)
FROM(FULL OUTER JOIN DIMi d1, DIMi d2 ON d1.leaf_num = d2.leaf_num AND
d2. attri <= d1. attri AND d1.class = d2.class
GROUP BY d1.leaf_num, d1. attri, d1.class
Database Implementation - Class Database Implementation - Class ViewView
create view for each class k and attribute i:
CREATE VIEW Ck_UP(leaf_num,attri,count)SELECT leaf_num,attri,count
FROM UP WHERE class = k
Database Implementation - GINI Database Implementation - GINI VALUEVALUE
create view for all gini values:CREATE VIEW GINI_VALUE(leaf_num,
attri,gini)ASSELECT u1.leaf_num, u1.attri,ƒgini
FROM C1_UP u1,..,Cc_UP uc,C1_DOWN d1...,Cc_DOWN dc
WHERE u1.attri = .. = uc. attri = .. = dc. attri
AND u1.leaf_num = .. = uc.leaf_num = .. = dc.leaf_num
Database Implementation - MIN Database Implementation - MIN GINI VALUEGINI VALUE
create table for minimum gini values for attribute i :
INSERT INTO MIN_GINISELECT leaf_num,i,attri,gini
FROM GINI_VALUE aWHERE a.gini =
(SELECT MIN(gini) FROM GINI_VALUE b WHERE a.leaf_num = b.leaf_num
Database Implementation - Database Implementation - BEST SPLITBEST SPLIT
create view over MIN_GINI for best split :CREATE VIEW BEST_SPLIT
(leaf_num,attr_name,attr_value)SELECT leaf_num, attr_name,attr_value
FROM MIN_GINI aWHERE a.gini =
(SELECT MIN(gini) FROM MIN_GINI b WHERE a.leaf_num = b.leaf_num
Database Implementation - Database Implementation - PartitioningPartitioning
Build new nodes by spliting old nodes according to BEST_SPLIT values
Set correct node to recoreds:Update leaf_node - is done by a function
No need to UPDATE data or DB
PerformancePerformance
I/O cost of MIND:I/O cost of MIND:
I/O cost of SPRINT:I/O cost of SPRINT:
Experimental ResultsExperimental Results
Normalized time to Normalized time to
finish building the treefinish building the tree
Normalized time to buildNormalized time to build
the tree per examplethe tree per example
Experimental ResultsExperimental Results
Normalized time to buildNormalized time to build
the tree per # of processorthe tree per # of processor
Time to build tree Time to build tree
By Training Set SizeBy Training Set Size
ConclusionsConclusions• MIND works over DB• MIND works well because
– MIND rephrases the classification to a DB problem
– MIND avoid UPDATES the DETAIL table– Parallelism and Scaling Are achived by the use
of RDBMS– MIND uses a user function to get the
performance gain in the DIMi creation.