Upload
nguyenlien
View
213
Download
0
Embed Size (px)
Citation preview
Outline for this session
1 Frequent Itemset Mining
2 Closed and Maximal Itemset Mining
3 Assessing Itemsets
4 Beyond Frequent Itemsets
5 Case study
2 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Outline
1 Frequent Itemset Mining
2 Closed and Maximal Itemset Mining
3 Assessing Itemsets
4 Beyond Frequent Itemsets
5 Case study
3 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Two applicative problems
Log analysis
A log contains Web pages accessed by every visitor during asession. What are the pages that are frequently loaded during asame session?
Basket analysis
A shop knows what products were bought together. Which onesare frequently bought together?
4 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Two applicative problems
Log analysis
A log contains Web pages accessed by every visitor during asession. What are the pages that are frequently loaded during asame session?
Basket analysis
A shop knows what products were bought together. Which onesare frequently bought together?
4 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Itemset: definition
Definition
Given a set of attributes A, an itemset X is a subset of attributes,i. e., X ⊆ A.
Input:
a1 a2 . . . ano1 d1,1 d1,2 . . . d1,n
o2 d2,1 d2,2 . . . d2,n...
......
. . ....
om dm,1 dm,2 . . . dm,n
where di ,j ∈ {true,false}
Question
How many itemsets are there?
2|A|.
5 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Itemset: definition
Definition
Given a set of attributes A, an itemset X is a subset of attributes,i. e., X ⊆ A.
Input:
a1 a2 . . . ano1 d1,1 d1,2 . . . d1,n
o2 d2,1 d2,2 . . . d2,n...
......
. . ....
om dm,1 dm,2 . . . dm,n
where di ,j ∈ {true,false}
Question
How many itemsets are there?2|A|.
5 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Support: definition
Definition
Given D, a set O of objects described with Boolean attributes inA, the support of an itemset X ⊆ A in D is:
{o ∈ O | ∀x ∈ X , do,x = true}
6 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Frequency: definition
Definition (absolute frequency)
Given D, a set O of objects described with Boolean attributes inA, the absolute frequency of an itemset X ⊆ A in D is:
|{o ∈ O | ∀x ∈ X , do,x = true}|
The relative frequency of the itemset X is the estimated jointprobability of having all attributes in X .
7 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Frequency: definition
Definition (relative frequency)
Given D, a set O of objects described with Boolean attributes inA, the absolute frequency of an itemset X ⊆ A in D is:
|{o ∈ O | ∀x ∈ X , do,x = true}||O|
The relative frequency of the itemset X is the estimated jointprobability of having all attributes in X .
7 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Frequency: definition
Definition (relative frequency)
Given D, a set O of objects described with Boolean attributes inA, the absolute frequency of an itemset X ⊆ A in D is:
|{o ∈ O | ∀x ∈ X , do,x = true}||O|
The relative frequency of the itemset X is the estimated jointprobability of having all attributes in X .
7 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Frequent itemset mining
Definition
Given D, a set of objects described with Boolean attributes in Aand µ ∈ N, listing every itemset whose frequency is at least µ.
Input:
a1 a2 . . . ano1 d1,1 d1,2 . . . d1,n
o2 d2,1 d2,2 . . . d2,n...
......
. . ....
om dm,1 dm,2 . . . dm,n
where di ,j ∈ {true,false}
and a minimal frequency µ ∈ N.
8 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Frequent itemset mining
Definition
Given D, a set of objects described with Boolean attributes in Aand µ ∈ N, listing every itemset whose frequency is at least µ.
Output: every X ⊆ A such that at least µ objects have allattributes in X .
8 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Frequent itemset mining: illustration
Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.
a1 a2 a3
o1 × × ×o2 × ×o3 ×o4 ×
9 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Frequent itemset mining: illustration
Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.
a1 a2 a3
o1 × × ×o2 × ×o3 ×o4 ×
The frequent itemsets are ∅ (4), {a1} (2),{a2} (3), {a3} (2) and {a1, a2} (2).
9 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Completeness
Learning the parameters of a clustering or a classification model isglobally modeling the data. Since every object influences theoutput, the optimal parameters are usually unreachable within areasonable time and those tasks are solved in an approximate way.
In contrast, local patterns, such as itemsets, are “anomalies” in thedata. Often, those anomalies can be completely listed.
Anyway, in the worst case, listing the frequent itemsets takesO(2|A|) time and counting them is #P-complete.
10 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Completeness
Learning the parameters of a clustering or a classification model isglobally modeling the data. Since every object influences theoutput, the optimal parameters are usually unreachable within areasonable time and those tasks are solved in an approximate way.
In contrast, local patterns, such as itemsets, are “anomalies” in thedata. Often, those anomalies can be completely listed.
Anyway, in the worst case, listing the frequent itemsets takesO(2|A|) time and counting them is #P-complete.
10 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Completeness
Learning the parameters of a clustering or a classification model isglobally modeling the data. Since every object influences theoutput, the optimal parameters are usually unreachable within areasonable time and those tasks are solved in an approximate way.
In contrast, local patterns, such as itemsets, are “anomalies” in thedata. Often, those anomalies can be completely listed.
Anyway, in the worst case, listing the frequent itemsets takesO(2|A|) time and counting them is #P-complete.
10 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Inductive database vision
Querying data:
{d ∈ D | q(d ,D)}where:
D is a dataset (tuples),
q is a query.
11 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Inductive database vision
Querying patterns:
{X ∈ P | Q(X ,D)}where:
D is the dataset,
P is the pattern space,
Q is an inductive query.
11 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Inductive database vision
Querying the frequent itemsets:
{X ∈ P | Q(X ,D)}where:
D is the dataset,
P is the pattern space,
Q is an inductive query.
11 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Inductive database vision
Querying the frequent itemsets:
{X ∈ P | Q(X ,D)}where:
D is a set O of objects described with Boolean attributes in A,
P is the pattern space,
Q is an inductive query.
11 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Inductive database vision
Querying the frequent itemsets:
{X ∈ P | Q(X ,D)}where:
D is a set O of objects described with Boolean attributes in A,
P is 2A,
Q is an inductive query.
11 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Inductive database vision
Querying the frequent itemsets:
{X ∈ P | Q(X ,D)}where:
D is a set O of objects described with Boolean attributes in A,
P is 2A,
Q is (X ,D) 7→ |{o ∈ O | ∀x ∈ X , do,x = true}| ≥ µ.
11 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Inductive database vision
Querying the frequent itemsets:
{X ∈ P | Q(X ,D)}where:
D is a set O of objects described with Boolean attributes in A,
P is 2A,
Q is (X ,D) 7→ f (X ,D) ≥ µ.
11 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Frequency constraint
The frequency constraint is a joint probability. It:
filters the relevant itemsets, defined as the subsets of attributeswhose joint probabilities are high enough;
decreases the (exponential) size of the output (a frequent itemsetremains frequent if the minimal frequency is lowered);
decreases the (exponential) running time.
In practice, high minimal frequencies are to be tested first.
13 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Frequency constraint
The frequency constraint is a joint probability. It:
filters the relevant itemsets, defined as the subsets of attributeswhose joint probabilities are high enough;
decreases the (exponential) size of the output (a frequent itemsetremains frequent if the minimal frequency is lowered);
decreases the (exponential) running time.
In practice, high minimal frequencies are to be tested first.
13 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Frequency constraint
The frequency constraint is a joint probability. It:
filters the relevant itemsets, defined as the subsets of attributeswhose joint probabilities are high enough;
decreases the (exponential) size of the output (a frequent itemsetremains frequent if the minimal frequency is lowered);
decreases the (exponential) running time.
In practice, high minimal frequencies are to be tested first.
13 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Frequency constraint
The frequency constraint is a joint probability. It:
filters the relevant itemsets, defined as the subsets of attributeswhose joint probabilities are high enough;
decreases the (exponential) size of the output (a frequent itemsetremains frequent if the minimal frequency is lowered);
decreases the (exponential) running time.
In practice, high minimal frequencies are to be tested first.
13 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Frequent Itemset Mining
Frequency constraint
The frequency constraint is a joint probability. It:
filters the relevant itemsets, defined as the subsets of attributeswhose joint probabilities are high enough;
decreases the (exponential) size of the output (a frequent itemsetremains frequent if the minimal frequency is lowered);
decreases the (exponential) running time.
In practice, high minimal frequencies are to be tested first.
13 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Outline
1 Frequent Itemset Mining
2 Closed and Maximal Itemset Mining
3 Assessing Itemsets
4 Beyond Frequent Itemsets
5 Case study
14 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Redundancy of information
Theorem
Every subset of a frequent itemset is a frequent itemset.
To remove some redundant itemsets, the frequent itemsets can beconstrained to be:
maximal the largest (w.r.t. ⊆) frequent itemsets;
closed the largest (w.r.t. ⊆) frequent itemsets among thosehaving a same support.
Other condensed representations of the frequent itemsets exist: thefree itemsets, the non-derivable itemsets, etc.
15 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Redundancy of information
Theorem
Every subset of a frequent itemset is a frequent itemset.
To remove some redundant itemsets, the frequent itemsets can beconstrained to be:
maximal the largest (w.r.t. ⊆) frequent itemsets;
closed the largest (w.r.t. ⊆) frequent itemsets among thosehaving a same support.
Other condensed representations of the frequent itemsets exist: thefree itemsets, the non-derivable itemsets, etc.
15 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Redundancy of information
Theorem
Every subset of a frequent itemset is a frequent itemset.
To remove some redundant itemsets, the frequent itemsets can beconstrained to be:
maximal the largest (w.r.t. ⊆) frequent itemsets;
closed the largest (w.r.t. ⊆) frequent itemsets among thosehaving a same support.
Other condensed representations of the frequent itemsets exist: thefree itemsets, the non-derivable itemsets, etc.
15 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Maximality and closedness
Definition
Given the set F of frequent itemsets, X ∈ F is maximal if andonly if ∀Y ∈ F ,X 6⊂ Y .
Definition
Given the set F of frequent itemsets, X ∈ F is closed if and onlyif ∀Y ∈ F ,X ⊂ Y ⇒ f (X ,D) 6= f (Y ,D).
16 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Maximality and closedness
Definition
Given the set F of frequent itemsets, X ∈ F is maximal if andonly if ∀Y ∈ F ,X 6⊂ Y .
Definition
Given the set F of frequent itemsets, X ∈ F is closed if and onlyif ∀Y ∈ F ,X ⊂ Y ⇒ f (X ,D) 6= f (Y ,D).
16 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Support equivalence classes
17 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Inclusions
Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:
F = {X ⊆ A | f (X ,D) ≥ µ};
C = {X ∈ F | ∀Y ∈ F , (X ⊂ Y ⇒ f (X ,D) 6= f (Y ,D))};
M = {X ∈ F | ∀Y ∈ F ,X 6⊂ Y }.
Theorem
M⊆ C ⊆ F .
18 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Inclusions
Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:
F = {X ⊆ A | f (X ,D) ≥ µ};
C = {X ∈ F | ∀Y ∈ F , (X ⊂ Y ⇒ f (X ,D) 6= f (Y ,D))};
M = {X ∈ F | ∀Y ∈ F ,X 6⊂ Y }.
Theorem
M⊆ C ⊆ F .
18 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Frequent itemset mining: illustration
Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.
a1 a2 a3
o1 × × ×o2 × ×o3 ×o4 ×
The frequent itemsets are ∅ (4), {a1} (2),{a2} (3), {a3} (2) and {a1, a2} (2);
Among those frequent itemsets, the closedones are in red.
Among the closed itemsets, the maximalones are in blue.
19 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Frequent itemset mining: illustration
Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.
a1 a2 a3
o1 × × ×o2 × ×o3 ×o4 ×
The frequent itemsets are ∅ (4), {a1} (2),{a2} (3), {a3} (2) and {a1, a2} (2);
Among those frequent itemsets, the closedones are in red.
Among the closed itemsets, the maximalones are in blue.
19 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Frequent itemset mining: illustration
Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.
a1 a2 a3
o1 × × ×o2 × ×o3 ×o4 ×
The frequent itemsets are ∅ (4), {a1} (2),{a2} (3), {a3} (2) and {a1, a2} (2);
Among those frequent itemsets, the closedones are in red.
Among the closed itemsets, the maximalones are in blue.
19 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Maximality: a condensed representation
Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:
F = {X ⊆ A | f (X ,D) ≥ µ};
M = {X ∈ F | ∀Y ∈ F ,X 6⊂ Y }.
M is a condensed representation of F
M⊆ F ;
.
20 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Maximality: a condensed representation
Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:
F = {X ⊆ A | f (X ,D) ≥ µ};
M = {X ∈ F | ∀Y ∈ F ,X 6⊂ Y }.
M is a condensed representation of FM⊆ F ;
.
20 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Maximality: a condensed representation
Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:
F = {X ⊆ A | f (X ,D) ≥ µ};
M = {X ∈ F | ∀Y ∈ F ,X 6⊂ Y }.
M is a condensed representation of FM⊆ F ;
F can be expressed as a function of M only.
20 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Maximality: a condensed representation
Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:
F = {X ⊆ A | f (X ,D) ≥ µ};
M = {X ∈ F | ∀Y ∈ F ,X 6⊂ Y }.
M is a condensed representation of FM⊆ F ;
F = ∪X∈M2X .
20 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Maximal itemsets and frequencies
Theorem
Given a minimal frequency µ ∈ N, the related set M ⊆ 2A ofmaximal itemsets in a dataset D, their frequencies, and X ⊆ A:
f (X ,D) < µ if ∀M ∈M, X 6⊆ M;
f (X ,D) ≥ max{M∈M|X⊆M}
f (M,D) otherwise.
The maximal itemsets and their frequencies carry little informationabout the frequencies of the other frequent itemsets: only lowerbounds, usually close to the minimal frequency.
21 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Maximal itemsets and frequencies
Theorem
Given a minimal frequency µ ∈ N, the related set M ⊆ 2A ofmaximal itemsets in a dataset D, their frequencies, and X ⊆ A:
f (X ,D) < µ if ∀M ∈M, X 6⊆ M;
f (X ,D) ≥ max{M∈M|X⊆M}
f (M,D) otherwise.
The maximal itemsets and their frequencies carry little informationabout the frequencies of the other frequent itemsets: only lowerbounds, usually close to the minimal frequency.
21 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Retrieving a lower bound of a frequency
Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.
a1 a2 a3
o1 × × ×o2 × ×o3 ×o4 ×
The maximal itemsets are {a3} (2) and{a1, a2} (2).
From that information, it canonly be proved that f ({a2},D) ≥ 2.
The more frequent an itemset, the farther the lower bound of itsfrequency derived from the maximal itemsets and their frequencies.
22 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Retrieving a lower bound of a frequency
Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.
a1 a2 a3
o1 × × ×o2 × ×o3 ×o4 ×
The maximal itemsets are {a3} (2) and{a1, a2} (2). From that information, it canonly be proved that f ({a2},D) ≥ 2.
The more frequent an itemset, the farther the lower bound of itsfrequency derived from the maximal itemsets and their frequencies.
22 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Retrieving a lower bound of a frequency
Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.
a1 a2 a3
o1 × × ×o2 × ×o3 ×o4 ×
The maximal itemsets are {a3} (2) and{a1, a2} (2). From that information, it canonly be proved that f ({a2},D) ≥ 2.
The more frequent an itemset, the farther the lower bound of itsfrequency derived from the maximal itemsets and their frequencies.
22 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
The frequency frontier (µ = 3)
23 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Closedness: a condensed representation
Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:
F = {X ⊆ A | f (X ,D) ≥ µ};
C = {X ∈ F | ∀Y ∈ F ,X ⊂ Y ⇒ f (X ,D) 6= f (Y ,D)}.
C is a condensed representation of F
C ⊆ F ;
.
24 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Closedness: a condensed representation
Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:
F = {X ⊆ A | f (X ,D) ≥ µ};
C = {X ∈ F | ∀Y ∈ F ,X ⊂ Y ⇒ f (X ,D) 6= f (Y ,D)}.
C is a condensed representation of FC ⊆ F ;
.
24 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Closedness: a condensed representation
Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:
F = {X ⊆ A | f (X ,D) ≥ µ};
C = {X ∈ F | ∀Y ∈ F ,X ⊂ Y ⇒ f (X ,D) 6= f (Y ,D)}.
C is a condensed representation of FC ⊆ F ;
F can be expressed as a function of C only.
24 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Closedness: a condensed representation
Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:
F = {X ⊆ A | f (X ,D) ≥ µ};
C = {X ∈ F | ∀Y ∈ F ,X ⊂ Y ⇒ f (X ,D) 6= f (Y ,D)}.
C is a condensed representation of FC ⊆ F ;
F = ∪X∈C2X .
24 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Closed itemsets and frequencies
Theorem
Given a minimal frequency µ ∈ N, the related set C ⊆ 2A ofclosed itemsets in a dataset D, their frequencies, and X ⊆ A:
f (X ,D) < µ if ∀C ∈ C, X 6⊆ C;
f (X ,D) = max{C∈C|X⊆C}
f (C ,D) otherwise.
The closed itemsets and their frequencies carry all the informationabout the frequencies of the other frequent itemsets.
25 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Closed itemsets and frequencies
Theorem
Given a minimal frequency µ ∈ N, the related set C ⊆ 2A ofclosed itemsets in a dataset D, their frequencies, and X ⊆ A:
f (X ,D) < µ if ∀C ∈ C, X 6⊆ C;
f (X ,D) = max{C∈C|X⊆C}
f (C ,D) otherwise.
The closed itemsets and their frequencies carry all the informationabout the frequencies of the other frequent itemsets.
25 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Retrieving the frequency
Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.
a1 a2 a3
o1 × × ×o2 × ×o3 ×o4 ×
The closed itemsets are ∅ (4), {a2} (3),{a3} (2) and {a1, a2} (2).
From thatinformation, it can be proved thatf ({a1},D) = 2.
The frequency of a frequent itemset is a valuable piece ofinformation to assess its quality.
26 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Retrieving the frequency
Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.
a1 a2 a3
o1 × × ×o2 × ×o3 ×o4 ×
The closed itemsets are ∅ (4), {a2} (3),{a3} (2) and {a1, a2} (2). From thatinformation, it can be proved thatf ({a1},D) = 2.
The frequency of a frequent itemset is a valuable piece ofinformation to assess its quality.
26 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Retrieving the frequency
Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.
a1 a2 a3
o1 × × ×o2 × ×o3 ×o4 ×
The closed itemsets are ∅ (4), {a2} (3),{a3} (2) and {a1, a2} (2). From thatinformation, it can be proved thatf ({a1},D) = 2.
The frequency of a frequent itemset is a valuable piece ofinformation to assess its quality.
26 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Frequent vs. maximal vs. closed
Maximality vs. closedness
Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;
Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.
Every unclosed itemset is less informative than the closed itemsetwith the same support.
Conclusion
If the itemsets are to be directly interpreted, only list closed item-sets: unclosed itemsets do not bring any additional information.
27 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Frequent vs. maximal vs. closed
Maximality vs. closedness
Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;
Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.
Every unclosed itemset is less informative than the closed itemsetwith the same support.
Conclusion
If the itemsets are to be directly interpreted, only list closed item-sets: unclosed itemsets do not bring any additional information.
27 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Frequent vs. maximal vs. closed
Maximality vs. closedness
Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;
Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.
Every unclosed itemset is less informative than the closed itemsetwith the same support.
Conclusion
If the itemsets are to be directly interpreted, only list closed item-sets: unclosed itemsets do not bring any additional information.
27 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Summary
Frequent itemsets are subsets of Boolean attributes that enoughobjects share;
In the worst case, the number of frequent itemsets is exponential inthe number of attributes, and so is the time to list them;
Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;
The maximal itemset are the largest frequent itemsets;
Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.
The closed itemsets are the largest frequent itemsets among thosehaving a same support.
28 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Summary
Frequent itemsets are subsets of Boolean attributes that enoughobjects share;
In the worst case, the number of frequent itemsets is exponential inthe number of attributes, and so is the time to list them;
Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;
The maximal itemset are the largest frequent itemsets;
Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.
The closed itemsets are the largest frequent itemsets among thosehaving a same support.
28 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Summary
Frequent itemsets are subsets of Boolean attributes that enoughobjects share;
In the worst case, the number of frequent itemsets is exponential inthe number of attributes, and so is the time to list them;
Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;
The maximal itemset are the largest frequent itemsets;
Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.
The closed itemsets are the largest frequent itemsets among thosehaving a same support.
28 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Summary
Frequent itemsets are subsets of Boolean attributes that enoughobjects share;
In the worst case, the number of frequent itemsets is exponential inthe number of attributes, and so is the time to list them;
Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;
The maximal itemset are the largest frequent itemsets;
Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.
The closed itemsets are the largest frequent itemsets among thosehaving a same support.
28 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Summary
Frequent itemsets are subsets of Boolean attributes that enoughobjects share;
In the worst case, the number of frequent itemsets is exponential inthe number of attributes, and so is the time to list them;
Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;
The maximal itemset are the largest frequent itemsets;
Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.
The closed itemsets are the largest frequent itemsets among thosehaving a same support.
28 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Closed and Maximal Itemset Mining
Summary
Frequent itemsets are subsets of Boolean attributes that enoughobjects share;
In the worst case, the number of frequent itemsets is exponential inthe number of attributes, and so is the time to list them;
Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;
The maximal itemset are the largest frequent itemsets;
Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.
The closed itemsets are the largest frequent itemsets among thosehaving a same support.
28 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Assessing Itemsets
Outline
1 Frequent Itemset Mining
2 Closed and Maximal Itemset Mining
3 Assessing Itemsets
4 Beyond Frequent Itemsets
5 Case study
29 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Assessing Itemsets
Assessing a collection of itemsets
Itemset mining is an unsupervised task: it is about discoveringhidden correlations between any number of Boolean attributes. Asa consequence, there are only intrinsic measures of the quality ofan itemset or a collection of itemsets.
However itemsets can be mined in a supervised context too. Forexample, given classified objects, how much an itemsetdiscriminate a particular class can be a quality measure. There aretens of ways to define the discriminating power of an itemset.
30 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Assessing Itemsets
Assessing a collection of itemsets
Itemset mining is an unsupervised task: it is about discoveringhidden correlations between any number of Boolean attributes. Asa consequence, there are only intrinsic measures of the quality ofan itemset or a collection of itemsets.
However itemsets can be mined in a supervised context too. Forexample, given classified objects, how much an itemsetdiscriminate a particular class can be a quality measure. There aretens of ways to define the discriminating power of an itemset.
30 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Assessing Itemsets
Lift
The lift quantifies how expected (lift close to 1) or unexpected (liftsignificantly different from 1) the frequency of an itemset given thefrequencies of its individual attributes. It is the ratio between therelative frequency of the itemset and its expected relative frequencyunder the assumption that its attributes are independent.
Definition
Given D, a set O of objects described with Boolean attributes in
A, the lift of an itemset X ⊆ A is |O||X |−1f (X ,D)∏a∈X f ({a},D) .
Frequent itemset mining can only discover high-lift itemsetsindicating positively correlated attributes.
31 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Assessing Itemsets
Lift
The lift quantifies how expected (lift close to 1) or unexpected (liftsignificantly different from 1) the frequency of an itemset given thefrequencies of its individual attributes. It is the ratio between therelative frequency of the itemset and its expected relative frequencyunder the assumption that its attributes are independent.
Definition
Given D, a set O of objects described with Boolean attributes in
A, the lift of an itemset X ⊆ A is |O||X |−1f (X ,D)∏a∈X f ({a},D) .
Frequent itemset mining can only discover high-lift itemsetsindicating positively correlated attributes.
31 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Assessing Itemsets
Lift
The lift quantifies how expected (lift close to 1) or unexpected (liftsignificantly different from 1) the frequency of an itemset given thefrequencies of its individual attributes. It is the ratio between therelative frequency of the itemset and its expected relative frequencyunder the assumption that its attributes are independent.
Definition
Given D, a set O of objects described with Boolean attributes in
A, the lift of an itemset X ⊆ A is |O||X |−1f (X ,D)∏a∈X f ({a},D) .
Frequent itemset mining can only discover high-lift itemsetsindicating positively correlated attributes.
31 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Assessing Itemsets
Correlation
Frequent itemsets (especially under a minimal lift constraint)support the discovery of correlations between any number ofBoolean attributes.
32 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Assessing Itemsets
Statistical measures
Swap randomization randomizes the dataset D while preservingthe numbers of true values per object and perattribute. The itemsets in D that are significantly lessfrequent in the randomized dataset are interesting.
Compression ratio is how informative an itemset X (what increaseswith |X |f (X ,D) and with how infrequent theinvolved objects/attributes) divided by its descriptionlength (what increases with its |X |+ f (X ,D)).
33 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Assessing Itemsets
Statistical measures
Swap randomization randomizes the dataset D while preservingthe numbers of true values per object and perattribute. The itemsets in D that are significantly lessfrequent in the randomized dataset are interesting.
Compression ratio is how informative an itemset X (what increaseswith |X |f (X ,D) and with how infrequent theinvolved objects/attributes) divided by its descriptionlength (what increases with its |X |+ f (X ,D)).
33 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Beyond Frequent Itemsets
Outline
1 Frequent Itemset Mining
2 Closed and Maximal Itemset Mining
3 Assessing Itemsets
4 Beyond Frequent Itemsets
5 Case study
34 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Beyond Frequent Itemsets
Constraining the itemsets
To filter the most relevant itemsets and, often, fasten theirextraction, some algorithms allow to specify additional constraintsthe output itemsets must satisfy.
High utility itemset mining
Given D, a set of objects described with Boolean attributes in A,and a utility function u : A → R, listing every itemset X ⊆ Awith f (X ,D)
∑x∈X u(x) exceeding a given threshold.
Getting closer to querying inductive databases.
35 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Beyond Frequent Itemsets
Constraining the itemsets
To filter the most relevant itemsets and, often, fasten theirextraction, some algorithms allow to specify additional constraintsthe output itemsets must satisfy.
High utility itemset mining
Given D, a set of objects described with Boolean attributes in A,and a utility function u : A → R, listing every itemset X ⊆ Awith f (X ,D)
∑x∈X u(x) exceeding a given threshold.
Getting closer to querying inductive databases.
35 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Beyond Frequent Itemsets
Constraining the itemsets
To filter the most relevant itemsets and, often, fasten theirextraction, some algorithms allow to specify additional constraintsthe output itemsets must satisfy.
High utility itemset mining
Given D, a set of objects described with Boolean attributes in A,and a utility function u : A → R, listing every itemset X ⊆ Awith f (X ,D)
∑x∈X u(x) exceeding a given threshold.
Getting closer to querying inductive databases.
35 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Beyond Frequent Itemsets
Mining cliques
a
b
c
d
a b c d
a × × ×b × ×c × × ×d × × × ×
(Maximal) cliques are (closed) itemsets whose support include thesame vertices (and possibly some more).
36 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Beyond Frequent Itemsets
Mining cliques
a
b
c
d
a b c d
a × × ×b × ×c × × ×d × × × ×
(Maximal) cliques are (closed) itemsets whose support include thesame vertices (and possibly some more).
36 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Beyond Frequent Itemsets
Mining cliques
a
b
c
d
a b c d
a × × ×b × ×c × × ×d × × × ×
(Maximal) cliques are (closed) itemsets whose support include thesame vertices (and possibly some more).
36 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Beyond Frequent Itemsets
Noise tolerance
True values that are erroneously false in the dataset fragment the“real” itemsets and their supports.
Noise-tolerant itemsets accept, in their supports, objects having afew false values for some attributes in the itemset. To be sensible,noise tolerance should be per object and per attribute.
The (noise-tolerant) itemsets can be heuristically grown oragglomerated in a post-processing step.
37 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Beyond Frequent Itemsets
Noise tolerance
True values that are erroneously false in the dataset fragment the“real” itemsets and their supports.
Noise-tolerant itemsets accept, in their supports, objects having afew false values for some attributes in the itemset. To be sensible,noise tolerance should be per object and per attribute.
The (noise-tolerant) itemsets can be heuristically grown oragglomerated in a post-processing step.
37 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Beyond Frequent Itemsets
Noise tolerance
True values that are erroneously false in the dataset fragment the“real” itemsets and their supports.
Noise-tolerant itemsets accept, in their supports, objects having afew false values for some attributes in the itemset. To be sensible,noise tolerance should be per object and per attribute.
The (noise-tolerant) itemsets can be heuristically grown oragglomerated in a post-processing step.
37 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Beyond Frequent Itemsets
Collections of more complex objects
Frequent (closed/maximal) patterns are easily defined incollections of words, sequences of itemsets or (labeled) graphs.
Itemsets are easily generalized to n-ary (rather than binary)relations too.
38 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Beyond Frequent Itemsets
Collections of more complex objects
Frequent (closed/maximal) patterns are easily defined incollections of words, sequences of itemsets or (labeled) graphs.
Itemsets are easily generalized to n-ary (rather than binary)relations too.
38 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Beyond Frequent Itemsets
Large individual objects
Frequent patterns can be defined as repetitions in a large word, alarge sequence of itemsets or a large (labeled) graph.
Different definitions of the frequency have been proposed. Forsome of them, sub-patterns of frequent patterns can be infrequentand searching for closed/maximal patterns makes little sense.
39 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Beyond Frequent Itemsets
Large individual objects
Frequent patterns can be defined as repetitions in a large word, alarge sequence of itemsets or a large (labeled) graph.
Different definitions of the frequency have been proposed. Forsome of them, sub-patterns of frequent patterns can be infrequentand searching for closed/maximal patterns makes little sense.
39 / 41Loıc Cerf Mineracao de Dados Aplicada
N
Case study
Outline
1 Frequent Itemset Mining
2 Closed and Maximal Itemset Mining
3 Assessing Itemsets
4 Beyond Frequent Itemsets
5 Case study
40 / 41Loıc Cerf Mineracao de Dados Aplicada
N