Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta

Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints

Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta Dipartimento di Informatica, Università di Torino

Outline

Motivations

Knowledge Discovery from Database (KDD), Inductive Databases

Constraint-Based Mining

Incremental Constraint Evaluation Association Rule Mining Incremental Algorithms Constraints properties

Item Dependent Constraints (IDC) Context Dependent Constraints (CDC)

Incremental Algorithms for IDC and CDC Performance results and Conclusions

Motivations: KDD process and Inductive Databases (IDB)

KDD process consists of a non trivial extraction of implicit, previously unknown, and potentially useful information from data

Inductive Databases have been proposed by Mannila and Imielinski [CACM’96] as a support for KDD

KDD is an interactive and iterative process

Inductive Databases contain both data and inductive generalizations (e.g. patterns, models) extracted from the data.

users can query the inductive database with an advanced, ad-hoc data mining query language

constrained-based queries

Motivations: Constraint-Based Mining and Incrementality

Why constraints?

can be pushed in the pattern computation and pruning the search space;

provide to the user a tool to express her interests (both in data and in knowledge).

In IDB constraint-based queries are very often a refinement of previous ones

Explorative process

Reconciling backgroung and extracted knowledge

Why executing each query always from scratch? The new query can be executed incrementally! [Baralis et al., DaWak’99]

The number of such groups must be sufficient

(user defined statistical evaluation measures, such as support)

,, G

from the groups of the database (grouping constraints)

, I

of set of items (itemsets) (on some schema I)

satisfying some user defined constraints (mining constraints)

, (M)T

extraction from a source table

A Generic Mining Language

R=Q( ) A very generic constraint-based mining query requests:

extraction from a source table

T

In our case R contains association rules

from the groups of the database (grouping constraints)

, G

satisfying some user defined constraints (mining constraints)

, (M)

The number of such groups must be sufficient

(user defined statistical evaluation measures, such as support)

,

of set of items (itemsets) (on some schema I)

, I

An Example

purchase

R=Q(purchase,customer,product,price>100,support_count>=2)

transaction customer product date price quantity

1 1001 hiking boots 12/7/98 140 1

1 1001 ski pants 12/7/98 180 1

3 1001 jacket 17/7/98 300 1

2 2256 col shirt 12/7/98 25 2

2 2256 ski pants 13/7/98 180 1

2 2256 jacket 13/7/98 300 1

4 3441 col shirt 13/7/98 25 3

5 3441 jacket 20/8/98 300 2

R

Mining Query

2{ski pants}2{jacket, ski pants}

3{jacket}support_countitemset

2/3

2/3

frequency

1jacketski pants

2/3ski pantsjacket

confidenceheadbody

Incremental Algorithms

We studied an incremental approach to answer new constraint-based queries which makes use of the information (rules with support and confidence) contained in previous results

We individuated two classes of query constraints:

item dependent (IDC) context dependent (CDC)

We propose two newly developed incremental algorithms which allow the exploitation of past resultsin the two cases (IDC and CDC)

Relationships between two queries

Query equivalence: R1=R2 no computation is needed [FQAS’04]

Query containment: [This paper]

Inclusion: R2 R1 and common elements have the same statistical measures. R2=C(R1)

Dominance: R2 R1 but common elements do not have the same statistical measures. R2C(R1)

We can speed up the execution time of a new query using results of previous queries. Which previous queries?

How can we recongnize inclusion or dominance between two constraints-based queries?

IDC vs CDC

transaction customer product date price quantity

1 1001 ski pants 12/7/98 140 1

1 1001 hiking boots 12/7/98 180 1

2 1001 jacket 17/7/98 300 2

2 2256 col shirt 12/7/98 25 2

2 2256 ski pants 13/7/98 140 2

3 2256 jacket 13/7/98 300 1

4 2256 col shirt 13/7/98 25 3

4 2256 jacket 20/8/98 300 2

CDC: qty > 1 2

2

2

Item Dependent Constraints (IDC )

are functionally dependent on the item extracted are satisfied for a given itemset either for all the groups in the database or for none if an itemset is common to R1 and R2, it will have the same support: inclusion

Context Dependent Constraints (CDC ) depend on the transactions in the database might be satisfied for a given itemset only for some groups in the database a common itemset to R1 and R2 might not have the same support: dominance

IDC: price > 150

Incremental Algorithm for IDC

Q2…..

Constraint: price >10…..

Current query

Fail

Item Domain Tableitem priceABC

12148

categoryhi-techhi-techhousing

item C belongs to a row that does not satisfy the new IDC constraint

Rules in memory BODY HEAD

A B…

1R1

Q1…..

Constraint: price > 5 …..

Previous query

SUPP CONF

A C2

…… ………

BODY HEAD

A B 2 1R2

SUPP CONF

… … … …

(R2=P(R1))

delete from R1 all rules containing item C

Incremental Algorithm for CDC

Q2…..

Constraint: qty >10…..

Current query

read the DB

find groups -in which new constraints are satisfied-containing items belonging to BHF

update support counters in BHF

R2

BODY HEAD

… … … …SUPP CONF

build BHF

…

Q1…..

Constraint: qty > 5

…..

Previous query

Rules in memory

BODY HEAD

… … … …SUPP CONF

R1

Body-Head Forest (BHF)

g m

a (4) f

g (3)

body (head) tree contains itemsets which are candidates for being in the body (head) part of the rule

an itemset is represented as a single path in the tree and vice versa

each path in the body (head) tree is associated to a counter representing the body (rule) support

a f g

rule:

rule support = 3

confidence = 3/4

Experiments (1): IC vs CD algorithm

ID algorithm

execution time vsconstraint selectivity

execution time vsvolume of previous result

(a) (b)

CD algorithm

(c) (d)

Experiments(2): CARE vs Incremental

execution time vs cardinality of previous result

(a) (b)

(c)

execution time vs support threshold

execution time vs selectivity of constraints

Conclusions and future works We proposed two incremental algorithms to constraint-based mining which make use of the information contained in previous result to answer new queries.

The first algorithm deals with item dependent constraints, while the second one with context dependent ones.

We evaluated the incremental algorithms on a pretty large dataset. The result set shows that the approach reduces drastically the execution time.

An interesting direction for future research: integration of condensed representations with these incremental techniques

the end

questions??questions??

Documents

Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta