Upload
truman
View
44
Download
0
Embed Size (px)
DESCRIPTION
Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints. Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta Dipartimento di Informatica, Università di Torino. Outline. Motivations - PowerPoint PPT Presentation
Citation preview
Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints
Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta Dipartimento di Informatica, Università di Torino
Outline
Motivations
Knowledge Discovery from Database (KDD), Inductive Databases
Constraint-Based Mining
Incremental Constraint Evaluation Association Rule Mining Incremental Algorithms Constraints properties
Item Dependent Constraints (IDC) Context Dependent Constraints (CDC)
Incremental Algorithms for IDC and CDC Performance results and Conclusions
Motivations: KDD process and Inductive Databases (IDB)
KDD process consists of a non trivial extraction of implicit, previously unknown, and potentially useful information from data
Inductive Databases have been proposed by Mannila and Imielinski [CACM’96] as a support for KDD
KDD is an interactive and iterative process
Inductive Databases contain both data and inductive generalizations (e.g. patterns, models) extracted from the data.
users can query the inductive database with an advanced, ad-hoc data mining query language
constrained-based queries
Motivations: Constraint-Based Mining and Incrementality
Why constraints?
can be pushed in the pattern computation and pruning the search space;
provide to the user a tool to express her interests (both in data and in knowledge).
In IDB constraint-based queries are very often a refinement of previous ones
Explorative process
Reconciling backgroung and extracted knowledge
Why executing each query always from scratch? The new query can be executed incrementally! [Baralis et al., DaWak’99]
The number of such groups must be sufficient
(user defined statistical evaluation measures, such as support)
,, G
from the groups of the database (grouping constraints)
, I
of set of items (itemsets) (on some schema I)
satisfying some user defined constraints (mining constraints)
, (M)T
extraction from a source table
A Generic Mining Language
R=Q( ) A very generic constraint-based mining query requests:
extraction from a source table
T
In our case R contains association rules
from the groups of the database (grouping constraints)
, G
satisfying some user defined constraints (mining constraints)
, (M)
The number of such groups must be sufficient
(user defined statistical evaluation measures, such as support)
,
of set of items (itemsets) (on some schema I)
, I
An Example
purchase
R=Q(purchase,customer,product,price>100,support_count>=2)
transaction customer product date price quantity
1 1001 hiking boots 12/7/98 140 1
1 1001 ski pants 12/7/98 180 1
3 1001 jacket 17/7/98 300 1
2 2256 col shirt 12/7/98 25 2
2 2256 ski pants 13/7/98 180 1
2 2256 jacket 13/7/98 300 1
4 3441 col shirt 13/7/98 25 3
5 3441 jacket 20/8/98 300 2
R
Mining Query
2{ski pants}2{jacket, ski pants}
3{jacket}support_countitemset
2/3
2/3
frequency
1jacketski pants
2/3ski pantsjacket
confidenceheadbody
Incremental Algorithms
We studied an incremental approach to answer new constraint-based queries which makes use of the information (rules with support and confidence) contained in previous results
We individuated two classes of query constraints:
item dependent (IDC) context dependent (CDC)
We propose two newly developed incremental algorithms which allow the exploitation of past resultsin the two cases (IDC and CDC)
Relationships between two queries
Query equivalence: R1=R2 no computation is needed [FQAS’04]
Query containment: [This paper]
Inclusion: R2 R1 and common elements have the same statistical measures. R2=C(R1)
Dominance: R2 R1 but common elements do not have the same statistical measures. R2C(R1)
We can speed up the execution time of a new query using results of previous queries. Which previous queries?
How can we recongnize inclusion or dominance between two constraints-based queries?
IDC vs CDC
transaction customer product date price quantity
1 1001 ski pants 12/7/98 140 1
1 1001 hiking boots 12/7/98 180 1
2 1001 jacket 17/7/98 300 2
2 2256 col shirt 12/7/98 25 2
2 2256 ski pants 13/7/98 140 2
3 2256 jacket 13/7/98 300 1
4 2256 col shirt 13/7/98 25 3
4 2256 jacket 20/8/98 300 2
CDC: qty > 1 2
2
2
Item Dependent Constraints (IDC )
are functionally dependent on the item extracted are satisfied for a given itemset either for all the groups in the database or for none if an itemset is common to R1 and R2, it will have the same support: inclusion
Context Dependent Constraints (CDC ) depend on the transactions in the database might be satisfied for a given itemset only for some groups in the database a common itemset to R1 and R2 might not have the same support: dominance
IDC: price > 150
Incremental Algorithm for IDC
Q2…..
Constraint: price >10…..
Current query
Fail
Item Domain Tableitem priceABC
12148
categoryhi-techhi-techhousing
item C belongs to a row that does not satisfy the new IDC constraint
Rules in memory BODY HEAD
A B…
1R1
Q1…..
Constraint: price > 5 …..
Previous query
SUPP CONF
A C2
…… ………
BODY HEAD
A B 2 1R2
SUPP CONF
… … … …
(R2=P(R1))
delete from R1 all rules containing item C
Incremental Algorithm for CDC
Q2…..
Constraint: qty >10…..
Current query
read the DB
find groups -in which new constraints are satisfied-containing items belonging to BHF
update support counters in BHF
R2
BODY HEAD
… … … …SUPP CONF
build BHF
…
Q1…..
Constraint: qty > 5
…..
Previous query
Rules in memory
BODY HEAD
… … … …SUPP CONF
R1
Body-Head Forest (BHF)
g m
a (4) f
g (3)
body (head) tree contains itemsets which are candidates for being in the body (head) part of the rule
an itemset is represented as a single path in the tree and vice versa
each path in the body (head) tree is associated to a counter representing the body (rule) support
a f g
rule:
rule support = 3
confidence = 3/4
Experiments (1): IC vs CD algorithm
ID algorithm
execution time vsconstraint selectivity
execution time vsvolume of previous result
(a) (b)
CD algorithm
(c) (d)
Experiments(2): CARE vs Incremental
execution time vs cardinality of previous result
(a) (b)
(c)
execution time vs support threshold
execution time vs selectivity of constraints
Conclusions and future works We proposed two incremental algorithms to constraint-based mining which make use of the information contained in previous result to answer new queries.
The first algorithm deals with item dependent constraints, while the second one with context dependent ones.
We evaluated the incremental algorithms on a pretty large dataset. The result set shows that the approach reduces drastically the execution time.
An interesting direction for future research: integration of condensed representations with these incremental techniques
the end
questions??questions??