97
Minera¸ ao de Dados Aplicada Itemset Mining Lo¨ ıc Cerf October, 11 th 2017 DCC – ICEx – UFMG

Minera˘c~ao de Dados Aplicada - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~lcerf/slides/mda13.pdf · Minera˘c~ao de Dados Aplicada Itemset Mining Lo c Cerf October, 11th 2017 DCC

Embed Size (px)

Citation preview

Mineracao de Dados AplicadaItemset Mining

Loıc Cerf

October, 11th 2017DCC – ICEx – UFMG

Outline for this session

1 Frequent Itemset Mining

2 Closed and Maximal Itemset Mining

3 Assessing Itemsets

4 Beyond Frequent Itemsets

5 Case study

2 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Outline

1 Frequent Itemset Mining

2 Closed and Maximal Itemset Mining

3 Assessing Itemsets

4 Beyond Frequent Itemsets

5 Case study

3 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Two applicative problems

Log analysis

A log contains Web pages accessed by every visitor during asession. What are the pages that are frequently loaded during asame session?

Basket analysis

A shop knows what products were bought together. Which onesare frequently bought together?

4 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Two applicative problems

Log analysis

A log contains Web pages accessed by every visitor during asession. What are the pages that are frequently loaded during asame session?

Basket analysis

A shop knows what products were bought together. Which onesare frequently bought together?

4 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Itemset: definition

Definition

Given a set of attributes A, an itemset X is a subset of attributes,i. e., X ⊆ A.

Input:

a1 a2 . . . ano1 d1,1 d1,2 . . . d1,n

o2 d2,1 d2,2 . . . d2,n...

......

. . ....

om dm,1 dm,2 . . . dm,n

where di ,j ∈ {true,false}

Question

How many itemsets are there?

2|A|.

5 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Itemset: definition

Definition

Given a set of attributes A, an itemset X is a subset of attributes,i. e., X ⊆ A.

Input:

a1 a2 . . . ano1 d1,1 d1,2 . . . d1,n

o2 d2,1 d2,2 . . . d2,n...

......

. . ....

om dm,1 dm,2 . . . dm,n

where di ,j ∈ {true,false}

Question

How many itemsets are there?2|A|.

5 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Support: definition

Definition

Given D, a set O of objects described with Boolean attributes inA, the support of an itemset X ⊆ A in D is:

{o ∈ O | ∀x ∈ X , do,x = true}

6 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Frequency: definition

Definition (absolute frequency)

Given D, a set O of objects described with Boolean attributes inA, the absolute frequency of an itemset X ⊆ A in D is:

|{o ∈ O | ∀x ∈ X , do,x = true}|

The relative frequency of the itemset X is the estimated jointprobability of having all attributes in X .

7 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Frequency: definition

Definition (relative frequency)

Given D, a set O of objects described with Boolean attributes inA, the absolute frequency of an itemset X ⊆ A in D is:

|{o ∈ O | ∀x ∈ X , do,x = true}||O|

The relative frequency of the itemset X is the estimated jointprobability of having all attributes in X .

7 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Frequency: definition

Definition (relative frequency)

Given D, a set O of objects described with Boolean attributes inA, the absolute frequency of an itemset X ⊆ A in D is:

|{o ∈ O | ∀x ∈ X , do,x = true}||O|

The relative frequency of the itemset X is the estimated jointprobability of having all attributes in X .

7 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Frequent itemset mining

Definition

Given D, a set of objects described with Boolean attributes in Aand µ ∈ N, listing every itemset whose frequency is at least µ.

Input:

a1 a2 . . . ano1 d1,1 d1,2 . . . d1,n

o2 d2,1 d2,2 . . . d2,n...

......

. . ....

om dm,1 dm,2 . . . dm,n

where di ,j ∈ {true,false}

and a minimal frequency µ ∈ N.

8 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Frequent itemset mining

Definition

Given D, a set of objects described with Boolean attributes in Aand µ ∈ N, listing every itemset whose frequency is at least µ.

Output: every X ⊆ A such that at least µ objects have allattributes in X .

8 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Frequent itemset mining: illustration

Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.

a1 a2 a3

o1 × × ×o2 × ×o3 ×o4 ×

9 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Frequent itemset mining: illustration

Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.

a1 a2 a3

o1 × × ×o2 × ×o3 ×o4 ×

The frequent itemsets are ∅ (4), {a1} (2),{a2} (3), {a3} (2) and {a1, a2} (2).

9 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Completeness

Learning the parameters of a clustering or a classification model isglobally modeling the data. Since every object influences theoutput, the optimal parameters are usually unreachable within areasonable time and those tasks are solved in an approximate way.

In contrast, local patterns, such as itemsets, are “anomalies” in thedata. Often, those anomalies can be completely listed.

Anyway, in the worst case, listing the frequent itemsets takesO(2|A|) time and counting them is #P-complete.

10 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Completeness

Learning the parameters of a clustering or a classification model isglobally modeling the data. Since every object influences theoutput, the optimal parameters are usually unreachable within areasonable time and those tasks are solved in an approximate way.

In contrast, local patterns, such as itemsets, are “anomalies” in thedata. Often, those anomalies can be completely listed.

Anyway, in the worst case, listing the frequent itemsets takesO(2|A|) time and counting them is #P-complete.

10 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Completeness

Learning the parameters of a clustering or a classification model isglobally modeling the data. Since every object influences theoutput, the optimal parameters are usually unreachable within areasonable time and those tasks are solved in an approximate way.

In contrast, local patterns, such as itemsets, are “anomalies” in thedata. Often, those anomalies can be completely listed.

Anyway, in the worst case, listing the frequent itemsets takesO(2|A|) time and counting them is #P-complete.

10 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Inductive database vision

Querying data:

{d ∈ D | q(d ,D)}where:

D is a dataset (tuples),

q is a query.

11 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Inductive database vision

Querying patterns:

{X ∈ P | Q(X ,D)}where:

D is the dataset,

P is the pattern space,

Q is an inductive query.

11 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Inductive database vision

Querying the frequent itemsets:

{X ∈ P | Q(X ,D)}where:

D is the dataset,

P is the pattern space,

Q is an inductive query.

11 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Inductive database vision

Querying the frequent itemsets:

{X ∈ P | Q(X ,D)}where:

D is a set O of objects described with Boolean attributes in A,

P is the pattern space,

Q is an inductive query.

11 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Inductive database vision

Querying the frequent itemsets:

{X ∈ P | Q(X ,D)}where:

D is a set O of objects described with Boolean attributes in A,

P is 2A,

Q is an inductive query.

11 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Inductive database vision

Querying the frequent itemsets:

{X ∈ P | Q(X ,D)}where:

D is a set O of objects described with Boolean attributes in A,

P is 2A,

Q is (X ,D) 7→ |{o ∈ O | ∀x ∈ X , do,x = true}| ≥ µ.

11 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Inductive database vision

Querying the frequent itemsets:

{X ∈ P | Q(X ,D)}where:

D is a set O of objects described with Boolean attributes in A,

P is 2A,

Q is (X ,D) 7→ f (X ,D) ≥ µ.

11 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Frequency frontier (µ = 3)

12 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Frequency frontier (µ = 3)

12 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Frequency constraint

The frequency constraint is a joint probability. It:

filters the relevant itemsets, defined as the subsets of attributeswhose joint probabilities are high enough;

decreases the (exponential) size of the output (a frequent itemsetremains frequent if the minimal frequency is lowered);

decreases the (exponential) running time.

In practice, high minimal frequencies are to be tested first.

13 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Frequency constraint

The frequency constraint is a joint probability. It:

filters the relevant itemsets, defined as the subsets of attributeswhose joint probabilities are high enough;

decreases the (exponential) size of the output (a frequent itemsetremains frequent if the minimal frequency is lowered);

decreases the (exponential) running time.

In practice, high minimal frequencies are to be tested first.

13 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Frequency constraint

The frequency constraint is a joint probability. It:

filters the relevant itemsets, defined as the subsets of attributeswhose joint probabilities are high enough;

decreases the (exponential) size of the output (a frequent itemsetremains frequent if the minimal frequency is lowered);

decreases the (exponential) running time.

In practice, high minimal frequencies are to be tested first.

13 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Frequency constraint

The frequency constraint is a joint probability. It:

filters the relevant itemsets, defined as the subsets of attributeswhose joint probabilities are high enough;

decreases the (exponential) size of the output (a frequent itemsetremains frequent if the minimal frequency is lowered);

decreases the (exponential) running time.

In practice, high minimal frequencies are to be tested first.

13 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Frequent Itemset Mining

Frequency constraint

The frequency constraint is a joint probability. It:

filters the relevant itemsets, defined as the subsets of attributeswhose joint probabilities are high enough;

decreases the (exponential) size of the output (a frequent itemsetremains frequent if the minimal frequency is lowered);

decreases the (exponential) running time.

In practice, high minimal frequencies are to be tested first.

13 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Outline

1 Frequent Itemset Mining

2 Closed and Maximal Itemset Mining

3 Assessing Itemsets

4 Beyond Frequent Itemsets

5 Case study

14 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Redundancy of information

Theorem

Every subset of a frequent itemset is a frequent itemset.

To remove some redundant itemsets, the frequent itemsets can beconstrained to be:

maximal the largest (w.r.t. ⊆) frequent itemsets;

closed the largest (w.r.t. ⊆) frequent itemsets among thosehaving a same support.

Other condensed representations of the frequent itemsets exist: thefree itemsets, the non-derivable itemsets, etc.

15 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Redundancy of information

Theorem

Every subset of a frequent itemset is a frequent itemset.

To remove some redundant itemsets, the frequent itemsets can beconstrained to be:

maximal the largest (w.r.t. ⊆) frequent itemsets;

closed the largest (w.r.t. ⊆) frequent itemsets among thosehaving a same support.

Other condensed representations of the frequent itemsets exist: thefree itemsets, the non-derivable itemsets, etc.

15 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Redundancy of information

Theorem

Every subset of a frequent itemset is a frequent itemset.

To remove some redundant itemsets, the frequent itemsets can beconstrained to be:

maximal the largest (w.r.t. ⊆) frequent itemsets;

closed the largest (w.r.t. ⊆) frequent itemsets among thosehaving a same support.

Other condensed representations of the frequent itemsets exist: thefree itemsets, the non-derivable itemsets, etc.

15 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Maximality and closedness

Definition

Given the set F of frequent itemsets, X ∈ F is maximal if andonly if ∀Y ∈ F ,X 6⊂ Y .

Definition

Given the set F of frequent itemsets, X ∈ F is closed if and onlyif ∀Y ∈ F ,X ⊂ Y ⇒ f (X ,D) 6= f (Y ,D).

16 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Maximality and closedness

Definition

Given the set F of frequent itemsets, X ∈ F is maximal if andonly if ∀Y ∈ F ,X 6⊂ Y .

Definition

Given the set F of frequent itemsets, X ∈ F is closed if and onlyif ∀Y ∈ F ,X ⊂ Y ⇒ f (X ,D) 6= f (Y ,D).

16 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Support equivalence classes

17 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Inclusions

Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:

F = {X ⊆ A | f (X ,D) ≥ µ};

C = {X ∈ F | ∀Y ∈ F , (X ⊂ Y ⇒ f (X ,D) 6= f (Y ,D))};

M = {X ∈ F | ∀Y ∈ F ,X 6⊂ Y }.

Theorem

M⊆ C ⊆ F .

18 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Inclusions

Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:

F = {X ⊆ A | f (X ,D) ≥ µ};

C = {X ∈ F | ∀Y ∈ F , (X ⊂ Y ⇒ f (X ,D) 6= f (Y ,D))};

M = {X ∈ F | ∀Y ∈ F ,X 6⊂ Y }.

Theorem

M⊆ C ⊆ F .

18 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Frequent itemset mining: illustration

Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.

a1 a2 a3

o1 × × ×o2 × ×o3 ×o4 ×

The frequent itemsets are ∅ (4), {a1} (2),{a2} (3), {a3} (2) and {a1, a2} (2);

Among those frequent itemsets, the closedones are in red.

Among the closed itemsets, the maximalones are in blue.

19 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Frequent itemset mining: illustration

Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.

a1 a2 a3

o1 × × ×o2 × ×o3 ×o4 ×

The frequent itemsets are ∅ (4), {a1} (2),{a2} (3), {a3} (2) and {a1, a2} (2);

Among those frequent itemsets, the closedones are in red.

Among the closed itemsets, the maximalones are in blue.

19 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Frequent itemset mining: illustration

Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.

a1 a2 a3

o1 × × ×o2 × ×o3 ×o4 ×

The frequent itemsets are ∅ (4), {a1} (2),{a2} (3), {a3} (2) and {a1, a2} (2);

Among those frequent itemsets, the closedones are in red.

Among the closed itemsets, the maximalones are in blue.

19 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Maximality: a condensed representation

Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:

F = {X ⊆ A | f (X ,D) ≥ µ};

M = {X ∈ F | ∀Y ∈ F ,X 6⊂ Y }.

M is a condensed representation of F

M⊆ F ;

.

20 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Maximality: a condensed representation

Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:

F = {X ⊆ A | f (X ,D) ≥ µ};

M = {X ∈ F | ∀Y ∈ F ,X 6⊂ Y }.

M is a condensed representation of FM⊆ F ;

.

20 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Maximality: a condensed representation

Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:

F = {X ⊆ A | f (X ,D) ≥ µ};

M = {X ∈ F | ∀Y ∈ F ,X 6⊂ Y }.

M is a condensed representation of FM⊆ F ;

F can be expressed as a function of M only.

20 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Maximality: a condensed representation

Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:

F = {X ⊆ A | f (X ,D) ≥ µ};

M = {X ∈ F | ∀Y ∈ F ,X 6⊂ Y }.

M is a condensed representation of FM⊆ F ;

F = ∪X∈M2X .

20 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Maximal itemsets and frequencies

Theorem

Given a minimal frequency µ ∈ N, the related set M ⊆ 2A ofmaximal itemsets in a dataset D, their frequencies, and X ⊆ A:

f (X ,D) < µ if ∀M ∈M, X 6⊆ M;

f (X ,D) ≥ max{M∈M|X⊆M}

f (M,D) otherwise.

The maximal itemsets and their frequencies carry little informationabout the frequencies of the other frequent itemsets: only lowerbounds, usually close to the minimal frequency.

21 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Maximal itemsets and frequencies

Theorem

Given a minimal frequency µ ∈ N, the related set M ⊆ 2A ofmaximal itemsets in a dataset D, their frequencies, and X ⊆ A:

f (X ,D) < µ if ∀M ∈M, X 6⊆ M;

f (X ,D) ≥ max{M∈M|X⊆M}

f (M,D) otherwise.

The maximal itemsets and their frequencies carry little informationabout the frequencies of the other frequent itemsets: only lowerbounds, usually close to the minimal frequency.

21 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Retrieving a lower bound of a frequency

Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.

a1 a2 a3

o1 × × ×o2 × ×o3 ×o4 ×

The maximal itemsets are {a3} (2) and{a1, a2} (2).

From that information, it canonly be proved that f ({a2},D) ≥ 2.

The more frequent an itemset, the farther the lower bound of itsfrequency derived from the maximal itemsets and their frequencies.

22 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Retrieving a lower bound of a frequency

Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.

a1 a2 a3

o1 × × ×o2 × ×o3 ×o4 ×

The maximal itemsets are {a3} (2) and{a1, a2} (2). From that information, it canonly be proved that f ({a2},D) ≥ 2.

The more frequent an itemset, the farther the lower bound of itsfrequency derived from the maximal itemsets and their frequencies.

22 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Retrieving a lower bound of a frequency

Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.

a1 a2 a3

o1 × × ×o2 × ×o3 ×o4 ×

The maximal itemsets are {a3} (2) and{a1, a2} (2). From that information, it canonly be proved that f ({a2},D) ≥ 2.

The more frequent an itemset, the farther the lower bound of itsfrequency derived from the maximal itemsets and their frequencies.

22 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

The frequency frontier (µ = 3)

23 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Closedness: a condensed representation

Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:

F = {X ⊆ A | f (X ,D) ≥ µ};

C = {X ∈ F | ∀Y ∈ F ,X ⊂ Y ⇒ f (X ,D) 6= f (Y ,D)}.

C is a condensed representation of F

C ⊆ F ;

.

24 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Closedness: a condensed representation

Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:

F = {X ⊆ A | f (X ,D) ≥ µ};

C = {X ∈ F | ∀Y ∈ F ,X ⊂ Y ⇒ f (X ,D) 6= f (Y ,D)}.

C is a condensed representation of FC ⊆ F ;

.

24 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Closedness: a condensed representation

Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:

F = {X ⊆ A | f (X ,D) ≥ µ};

C = {X ∈ F | ∀Y ∈ F ,X ⊂ Y ⇒ f (X ,D) 6= f (Y ,D)}.

C is a condensed representation of FC ⊆ F ;

F can be expressed as a function of C only.

24 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Closedness: a condensed representation

Given D, a set of objects described with Boolean attributes in A,and a minimal frequency µ ∈ N, let:

F = {X ⊆ A | f (X ,D) ≥ µ};

C = {X ∈ F | ∀Y ∈ F ,X ⊂ Y ⇒ f (X ,D) 6= f (Y ,D)}.

C is a condensed representation of FC ⊆ F ;

F = ∪X∈C2X .

24 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Closed itemsets and frequencies

Theorem

Given a minimal frequency µ ∈ N, the related set C ⊆ 2A ofclosed itemsets in a dataset D, their frequencies, and X ⊆ A:

f (X ,D) < µ if ∀C ∈ C, X 6⊆ C;

f (X ,D) = max{C∈C|X⊆C}

f (C ,D) otherwise.

The closed itemsets and their frequencies carry all the informationabout the frequencies of the other frequent itemsets.

25 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Closed itemsets and frequencies

Theorem

Given a minimal frequency µ ∈ N, the related set C ⊆ 2A ofclosed itemsets in a dataset D, their frequencies, and X ⊆ A:

f (X ,D) < µ if ∀C ∈ C, X 6⊆ C;

f (X ,D) = max{C∈C|X⊆C}

f (C ,D) otherwise.

The closed itemsets and their frequencies carry all the informationabout the frequencies of the other frequent itemsets.

25 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Retrieving the frequency

Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.

a1 a2 a3

o1 × × ×o2 × ×o3 ×o4 ×

The closed itemsets are ∅ (4), {a2} (3),{a3} (2) and {a1, a2} (2).

From thatinformation, it can be proved thatf ({a1},D) = 2.

The frequency of a frequent itemset is a valuable piece ofinformation to assess its quality.

26 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Retrieving the frequency

Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.

a1 a2 a3

o1 × × ×o2 × ×o3 ×o4 ×

The closed itemsets are ∅ (4), {a2} (3),{a3} (2) and {a1, a2} (2). From thatinformation, it can be proved thatf ({a1},D) = 2.

The frequency of a frequent itemset is a valuable piece ofinformation to assess its quality.

26 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Retrieving the frequency

Let 2 the minimal absolute frequency or, equivalently, 50% theminimal relative frequency.

a1 a2 a3

o1 × × ×o2 × ×o3 ×o4 ×

The closed itemsets are ∅ (4), {a2} (3),{a3} (2) and {a1, a2} (2). From thatinformation, it can be proved thatf ({a1},D) = 2.

The frequency of a frequent itemset is a valuable piece ofinformation to assess its quality.

26 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Frequent vs. maximal vs. closed

Maximality vs. closedness

Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;

Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.

Every unclosed itemset is less informative than the closed itemsetwith the same support.

Conclusion

If the itemsets are to be directly interpreted, only list closed item-sets: unclosed itemsets do not bring any additional information.

27 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Frequent vs. maximal vs. closed

Maximality vs. closedness

Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;

Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.

Every unclosed itemset is less informative than the closed itemsetwith the same support.

Conclusion

If the itemsets are to be directly interpreted, only list closed item-sets: unclosed itemsets do not bring any additional information.

27 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Frequent vs. maximal vs. closed

Maximality vs. closedness

Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;

Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.

Every unclosed itemset is less informative than the closed itemsetwith the same support.

Conclusion

If the itemsets are to be directly interpreted, only list closed item-sets: unclosed itemsets do not bring any additional information.

27 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Summary

Frequent itemsets are subsets of Boolean attributes that enoughobjects share;

In the worst case, the number of frequent itemsets is exponential inthe number of attributes, and so is the time to list them;

Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;

The maximal itemset are the largest frequent itemsets;

Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.

The closed itemsets are the largest frequent itemsets among thosehaving a same support.

28 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Summary

Frequent itemsets are subsets of Boolean attributes that enoughobjects share;

In the worst case, the number of frequent itemsets is exponential inthe number of attributes, and so is the time to list them;

Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;

The maximal itemset are the largest frequent itemsets;

Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.

The closed itemsets are the largest frequent itemsets among thosehaving a same support.

28 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Summary

Frequent itemsets are subsets of Boolean attributes that enoughobjects share;

In the worst case, the number of frequent itemsets is exponential inthe number of attributes, and so is the time to list them;

Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;

The maximal itemset are the largest frequent itemsets;

Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.

The closed itemsets are the largest frequent itemsets among thosehaving a same support.

28 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Summary

Frequent itemsets are subsets of Boolean attributes that enoughobjects share;

In the worst case, the number of frequent itemsets is exponential inthe number of attributes, and so is the time to list them;

Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;

The maximal itemset are the largest frequent itemsets;

Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.

The closed itemsets are the largest frequent itemsets among thosehaving a same support.

28 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Summary

Frequent itemsets are subsets of Boolean attributes that enoughobjects share;

In the worst case, the number of frequent itemsets is exponential inthe number of attributes, and so is the time to list them;

Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;

The maximal itemset are the largest frequent itemsets;

Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.

The closed itemsets are the largest frequent itemsets among thosehaving a same support.

28 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Closed and Maximal Itemset Mining

Summary

Frequent itemsets are subsets of Boolean attributes that enoughobjects share;

In the worst case, the number of frequent itemsets is exponential inthe number of attributes, and so is the time to list them;

Maximality condenses a lot the frequent itemsets but theinformation about their frequencies is lost;

The maximal itemset are the largest frequent itemsets;

Closedness condenses little the frequent itemsets but theinformation about their frequencies is kept.

The closed itemsets are the largest frequent itemsets among thosehaving a same support.

28 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Assessing Itemsets

Outline

1 Frequent Itemset Mining

2 Closed and Maximal Itemset Mining

3 Assessing Itemsets

4 Beyond Frequent Itemsets

5 Case study

29 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Assessing Itemsets

Assessing a collection of itemsets

Itemset mining is an unsupervised task: it is about discoveringhidden correlations between any number of Boolean attributes. Asa consequence, there are only intrinsic measures of the quality ofan itemset or a collection of itemsets.

However itemsets can be mined in a supervised context too. Forexample, given classified objects, how much an itemsetdiscriminate a particular class can be a quality measure. There aretens of ways to define the discriminating power of an itemset.

30 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Assessing Itemsets

Assessing a collection of itemsets

Itemset mining is an unsupervised task: it is about discoveringhidden correlations between any number of Boolean attributes. Asa consequence, there are only intrinsic measures of the quality ofan itemset or a collection of itemsets.

However itemsets can be mined in a supervised context too. Forexample, given classified objects, how much an itemsetdiscriminate a particular class can be a quality measure. There aretens of ways to define the discriminating power of an itemset.

30 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Assessing Itemsets

Lift

The lift quantifies how expected (lift close to 1) or unexpected (liftsignificantly different from 1) the frequency of an itemset given thefrequencies of its individual attributes. It is the ratio between therelative frequency of the itemset and its expected relative frequencyunder the assumption that its attributes are independent.

Definition

Given D, a set O of objects described with Boolean attributes in

A, the lift of an itemset X ⊆ A is |O||X |−1f (X ,D)∏a∈X f ({a},D) .

Frequent itemset mining can only discover high-lift itemsetsindicating positively correlated attributes.

31 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Assessing Itemsets

Lift

The lift quantifies how expected (lift close to 1) or unexpected (liftsignificantly different from 1) the frequency of an itemset given thefrequencies of its individual attributes. It is the ratio between therelative frequency of the itemset and its expected relative frequencyunder the assumption that its attributes are independent.

Definition

Given D, a set O of objects described with Boolean attributes in

A, the lift of an itemset X ⊆ A is |O||X |−1f (X ,D)∏a∈X f ({a},D) .

Frequent itemset mining can only discover high-lift itemsetsindicating positively correlated attributes.

31 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Assessing Itemsets

Lift

The lift quantifies how expected (lift close to 1) or unexpected (liftsignificantly different from 1) the frequency of an itemset given thefrequencies of its individual attributes. It is the ratio between therelative frequency of the itemset and its expected relative frequencyunder the assumption that its attributes are independent.

Definition

Given D, a set O of objects described with Boolean attributes in

A, the lift of an itemset X ⊆ A is |O||X |−1f (X ,D)∏a∈X f ({a},D) .

Frequent itemset mining can only discover high-lift itemsetsindicating positively correlated attributes.

31 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Assessing Itemsets

Correlation

Frequent itemsets (especially under a minimal lift constraint)support the discovery of correlations between any number ofBoolean attributes.

32 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Assessing Itemsets

Statistical measures

Swap randomization randomizes the dataset D while preservingthe numbers of true values per object and perattribute. The itemsets in D that are significantly lessfrequent in the randomized dataset are interesting.

Compression ratio is how informative an itemset X (what increaseswith |X |f (X ,D) and with how infrequent theinvolved objects/attributes) divided by its descriptionlength (what increases with its |X |+ f (X ,D)).

33 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Assessing Itemsets

Statistical measures

Swap randomization randomizes the dataset D while preservingthe numbers of true values per object and perattribute. The itemsets in D that are significantly lessfrequent in the randomized dataset are interesting.

Compression ratio is how informative an itemset X (what increaseswith |X |f (X ,D) and with how infrequent theinvolved objects/attributes) divided by its descriptionlength (what increases with its |X |+ f (X ,D)).

33 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Beyond Frequent Itemsets

Outline

1 Frequent Itemset Mining

2 Closed and Maximal Itemset Mining

3 Assessing Itemsets

4 Beyond Frequent Itemsets

5 Case study

34 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Beyond Frequent Itemsets

Constraining the itemsets

To filter the most relevant itemsets and, often, fasten theirextraction, some algorithms allow to specify additional constraintsthe output itemsets must satisfy.

High utility itemset mining

Given D, a set of objects described with Boolean attributes in A,and a utility function u : A → R, listing every itemset X ⊆ Awith f (X ,D)

∑x∈X u(x) exceeding a given threshold.

Getting closer to querying inductive databases.

35 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Beyond Frequent Itemsets

Constraining the itemsets

To filter the most relevant itemsets and, often, fasten theirextraction, some algorithms allow to specify additional constraintsthe output itemsets must satisfy.

High utility itemset mining

Given D, a set of objects described with Boolean attributes in A,and a utility function u : A → R, listing every itemset X ⊆ Awith f (X ,D)

∑x∈X u(x) exceeding a given threshold.

Getting closer to querying inductive databases.

35 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Beyond Frequent Itemsets

Constraining the itemsets

To filter the most relevant itemsets and, often, fasten theirextraction, some algorithms allow to specify additional constraintsthe output itemsets must satisfy.

High utility itemset mining

Given D, a set of objects described with Boolean attributes in A,and a utility function u : A → R, listing every itemset X ⊆ Awith f (X ,D)

∑x∈X u(x) exceeding a given threshold.

Getting closer to querying inductive databases.

35 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Beyond Frequent Itemsets

Mining cliques

a

b

c

d

a b c d

a × × ×b × ×c × × ×d × × × ×

(Maximal) cliques are (closed) itemsets whose support include thesame vertices (and possibly some more).

36 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Beyond Frequent Itemsets

Mining cliques

a

b

c

d

a b c d

a × × ×b × ×c × × ×d × × × ×

(Maximal) cliques are (closed) itemsets whose support include thesame vertices (and possibly some more).

36 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Beyond Frequent Itemsets

Mining cliques

a

b

c

d

a b c d

a × × ×b × ×c × × ×d × × × ×

(Maximal) cliques are (closed) itemsets whose support include thesame vertices (and possibly some more).

36 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Beyond Frequent Itemsets

Noise tolerance

True values that are erroneously false in the dataset fragment the“real” itemsets and their supports.

Noise-tolerant itemsets accept, in their supports, objects having afew false values for some attributes in the itemset. To be sensible,noise tolerance should be per object and per attribute.

The (noise-tolerant) itemsets can be heuristically grown oragglomerated in a post-processing step.

37 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Beyond Frequent Itemsets

Noise tolerance

True values that are erroneously false in the dataset fragment the“real” itemsets and their supports.

Noise-tolerant itemsets accept, in their supports, objects having afew false values for some attributes in the itemset. To be sensible,noise tolerance should be per object and per attribute.

The (noise-tolerant) itemsets can be heuristically grown oragglomerated in a post-processing step.

37 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Beyond Frequent Itemsets

Noise tolerance

True values that are erroneously false in the dataset fragment the“real” itemsets and their supports.

Noise-tolerant itemsets accept, in their supports, objects having afew false values for some attributes in the itemset. To be sensible,noise tolerance should be per object and per attribute.

The (noise-tolerant) itemsets can be heuristically grown oragglomerated in a post-processing step.

37 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Beyond Frequent Itemsets

Collections of more complex objects

Frequent (closed/maximal) patterns are easily defined incollections of words, sequences of itemsets or (labeled) graphs.

Itemsets are easily generalized to n-ary (rather than binary)relations too.

38 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Beyond Frequent Itemsets

Collections of more complex objects

Frequent (closed/maximal) patterns are easily defined incollections of words, sequences of itemsets or (labeled) graphs.

Itemsets are easily generalized to n-ary (rather than binary)relations too.

38 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Beyond Frequent Itemsets

Large individual objects

Frequent patterns can be defined as repetitions in a large word, alarge sequence of itemsets or a large (labeled) graph.

Different definitions of the frequency have been proposed. Forsome of them, sub-patterns of frequent patterns can be infrequentand searching for closed/maximal patterns makes little sense.

39 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Beyond Frequent Itemsets

Large individual objects

Frequent patterns can be defined as repetitions in a large word, alarge sequence of itemsets or a large (labeled) graph.

Different definitions of the frequency have been proposed. Forsome of them, sub-patterns of frequent patterns can be infrequentand searching for closed/maximal patterns makes little sense.

39 / 41Loıc Cerf Mineracao de Dados Aplicada

N

Case study

Outline

1 Frequent Itemset Mining

2 Closed and Maximal Itemset Mining

3 Assessing Itemsets

4 Beyond Frequent Itemsets

5 Case study

40 / 41Loıc Cerf Mineracao de Dados Aplicada

N

License

c©2012–2017 Loıc Cerf

These slides are licensed under the Creative CommonsAttribution-ShareAlike 4.0 International License.

41 / 41Loıc Cerf Mineracao de Dados Aplicada

N