57
Institut für Scientific Computing - Universität Wien P.Brezany 1 Datamining Methods Mining Association Rules and Sequential Patterns

Datamining Methods Mining Association Rules and Sequential Patterns

  • Upload
    borka

  • View
    53

  • Download
    0

Embed Size (px)

DESCRIPTION

Datamining Methods Mining Association Rules and Sequential Patterns. KDD (Knowledge Discovery in Databases) Process. Data Mining. Clean, Collect, Summarize. Data Preparation. Training Data. Data Warehouse. Model, Patterns. Verification & Evaluation. Operational - PowerPoint PPT Presentation

Citation preview

Page 1: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany1

Datamining Methods

Mining Association Rulesand

Sequential Patterns

Page 2: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany2

KDD (Knowledge Discovery in Databases) Process

Clean,Collect,

Summarize

Clean,Collect,

Summarize

Data Preparation

Data Preparation

Verification &Evaluation

Verification &Evaluation

DataMiningDataMining

OperationalDatabases

TrainingData

Model,Patterns

DataWarehouse

Page 3: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany3

Mining Association Rules

• Association rule mining finds interesting association or correlation relationships among a large set of data items.

• This can help in many business decision making processes: store layout, catalog design, and customer segmentation based on buying paterns. Another important field: medical applications.

• Market basket analysis - a typical example of association rule mining.

• How can we find association rules from large amounts of data? Which association rules are the most interesting. How can we help or guide the mining procedures?

Page 4: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany4

Informal Introduction

• Given a set of database transactions, where each transaction is a set of items, an association rule is an expression X Ywhere X and Y are sets of items (literals). The intuitive meaningof the rule: transactions in the database which contain the items in X tend to also contain the items in Y.

• Example: 98% of customers who purchase tires and auto accessories also buy some automotive services; here 98% is called the confidence of the rule. The support of the ruleis the percentage of transactions that contain both X and Y.

• The problem of mining association rules is to find all rules that satisfy a user-specified minimum support and minimum confidence.

Page 5: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany5

Basic ConceptsLet J = (i1, i2, ..., im) be a set of items.Typically, the items are identifiers of individuals articles (pro-ducts (e.g., bar codes).

Let D, the task relevant data, be a set of database transactionswhere each transaction T is a set of items such that T J.

Let A be a set of items: a transaction T is said to contain A if and only if A T,

An association rule is an implication of the form A B, whereA J, B J, and A B = .

The rule A B holds in the transaction set D with support s,where s is the percentage of transactions in D that contain A B (i.e. both A and B). This is the probability, P(A B).

Page 6: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany6

Basic Concepts (Cont.)

The rule A B has confidence c in the transaction set D if cis the percentage of transactions in D containing A that alsocontain B - the conditional probability P(B|A).

Rules that satisfy both a minimum support threshold(min_sup) and a minimum confidence threshold (min_conf)are called strong.

A set of items is referred to as an itemset. An itemset thatcontains k items is a k-itemset.The occurence frequency of an itemset is the number of transactions that contain the itemset.

Page 7: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany7

Basic Concepts - Example

transaction purchased items

1 bread, coffee, milk, cake 2 coffee, milk, cake 3 bread, butter, coffee, milk 4 milk, cake 5 bread, cake 6 bread

X = {coffee, milk}R = {coffee, cake, milk}

support of X = 3 from 6 = 50%support of R = 2 from 6 = 33%

Support of “milk, coffee” “cake” equals to support of R = 33%

Confidence of “milk, coffee” “cake” = 2 from 3 = 67% [=support(R)/support(X)]

Page 8: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany8

Basic Concepts (Cont.)

An itemset satisfies minimum support if the occurrence fre-quency of the itemset is greater than or equal to the productof min_sup and the total number of transactions in D. The number of transactions required for the itemset to satis-fy minimum support is therefore referred to as the minimumsupport count. If an itemset satisfy minimum support, then itis a frequent itemset.The set of frequent k-itemsets is commonly denoted by Lk.

Association rule mining is a two-step process:1. Find all frequent itemsets.2. Generate strong association rules from the frequent itemsets.

Page 9: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany9

Association Rule Classification

• Based on the types of values handled in the rule:If a rule concerns associations between the presence or absence of items, it is a Boolean association rule. For example: computer financial_management_software

[support = 2%, confidence = 60%]If a rule describes associations between quantitative items or attributes, then it is a quantitative associa-tion rule. For example: age(X, “30..39”) and income(X,”42K..48K”)

buys(X, high resolution TV)

Note that the quantitative attributes, age and income,have been discretized.

Page 10: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany10

Association Rule Classification (Cont.)

• Based on the dimensions of data involved in the rule:If the items or attributes in an association rule refe-rence only one dimension, then it is a single dimensional association rule. For example:

buys(X,”computer”) buys (X, “financial manage- ment software”)

The above rule refers to only one dimension, buys.

If a rule references two or more dimensions, such as buys, time_of_transaction, and customer_category, then it is a multidimensional association rule.The second rule on the previous slide is a 3-dimensional ass. rule since it involves 3 dimensions: age, income, and buys.

Page 11: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany11

Association Rule Classification (Cont.)

• Based on the levels of abstractions involved in the rule set:Suppose that a set of association rules minded includes:

age(X,”30..39”) buys(X, “laptop computer”) age(X,”30..39”) buys(X, “computer”)

In the above rules, the items bought are referenced at different levels of abstraction. (E.g., “computer” is a higher-level abstraction of “laptop computer” .) Such ru-les are called multilevel association rules.

Single-level association rules refer one abstraction level only.

Page 12: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany12

Mining Single-Dimensional Boolean Association Rules from

Transactional DatabasesThis is the simplest form of association rules (used in marketbasket analysis.We present Apriori, a basic algorithm for finding frequentitemsets. Its name – it uses prior knowledge of frequent itemset properties (explained later). Apriori employs a iterative approach known as a level-wisesearch, where k-itemsets are used to explore (k + 1)-itemsets.

First, the set of frequent 1-items, L1, is found. L1 is used to find L2, the set of frequent 2-itemsets, which is used to findL3, and so on, until no more frequent k-itemsets can be found.The finding of each Lk requires one full scan of the database.The Apriori property is used to reduce the search space.

Page 13: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany13

The Apriori PropertyAll nonempty subsets of a frequent itemset must also be frequent.

If an itemset I does not satisfy the minimum supportthreshold, min_sup, then I is not frequent, that is,P(I) < min_sup. If an item A is added to the itemset I,then the resulting itemset (i.e., I A ) cannot occur more frequently than I. Therefore, I A is not frequenteither, that is, P (I A ) < min_sup.

How is the Apriori property used in the algorithm?

To understand this, let us look at how Lk-1 is used to findLk. A two-step process is followed, consisting of join and prune actions. These steps are explained on the next slides,

Page 14: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany14

The Apriori Algorithm – the Join StepTo find Lk, a set of candidate k-itemsets is generated by

joiningLk-1 with itself. This set of candidates is denoted by Ck.Let l1 and l2 be itemsets in Lk-1. The notation li[j] refers to thejth item in li (e.g., li[k-2] refers to the second to the last itemin l1).

Apriori assumes that items within a transaction or itemset aresorted in lexicographic order.

The join Lk-1 join Lk-1, is performed, where members of Lk-1 are joinable if their first (k-2) items are in common. That is, members l1 and l2 of Lk-1 are joined if (l1[1] = l2[1] ) (l1[2] = l2[2] ) ... (l1[k-2] = l2[k-2] ) (l1[k-1] < l2[k-1] ) . The condition (l1[k-1] < l2[k-1] ) simply ensures that no duplicates are generated. The resulting itemset: l1[1] l1[2] ) ... l1[k-1] l2[k-1] .

Page 15: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany15

The Apriori Algorithm – the Join Step (2)Illustration by an example

p Lk-1 = ( 1 2 3)

|| || Join: Result Ck = ( 1 2 3 4)

|| ||

q Lk-1 = ( 1 2 4)

Each frequent k-itemset p is always extended by the last item ofall frequent itemsets q which have the same first k-1 items as p .

Page 16: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany16

The Apriori Algorithm – the Prune Step Ck is a superset of Lk, that is, its members may or may not be

frequent, but all of the frequent k-items are included in Ck.

A scan of the database to determine the count of each candidate in Ck would result in the determination of Lk. Ck

can be huge, and so this could involve heavy computation.

To reduce the size of Ck, the Apriori property is used as follows. Any (k-1)-itemset that is not frequent cannot be asubset of a frequent k-itemset. Hence, if any (k-1)-subset of a candidate k-itemset is not in Lk-1, then the candidate cannotbe frequent either and so can be removed from Ck.

The above subset testing can be done quickly by maintaining a hash tree of all frequent itemsets.

Page 17: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany17

The Apriori Algorithm - ExampleLet’s look at a concrete example of Apriori, based on the

AllElectronics transaction database D, shown below. There arenine transactions in this database, e.i., |D| = 9. We use the nextfigure to illus-trate the fin-ding of fre-quent itemsetsin D.

TID List of item_Ids

T100 I1, I2, I5T200 I2, I4T300 I2, I3T400 I1, I2, I4T500 I1, I3T600 I2, I3T700 I1, I3T800 I1, I2, I3, I5T900 I1, I2, I3

Page 18: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany18

Generation of CK and LK (min.supp. count=2)

Scan D forcount of eachcandidate- scan

Itemset Sup. count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2

Itemset Sup. count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2

Itemset Sup. count {I1,I2} 4 {I1,I3} 4 {I1,I5} 2 {I2,I3} 4 {I2,I4} 2 {I2, I5} 2

Itemset Sup. count {I1,I2} 4 {I1,I3} 4 {I1,I4} 1 {I1,I5} 2 {I2,I3} 4 {I2,I4} 2 {I2,I5} 2 {I3,I4} 0 {I3,I5} 1 {I4,I5} 0

Itemset {I1,I2} {I1,I3} {I1,I4} {I1,I5} {I2,I3} {I2,I4} {I2,I5} {I3,I4} {I3,I5} {I4,I5}

Compare candidatesupport count withminimum supportcount - compare

Generate C2

candidatesfrom L1

Scan Compare

C1 L1

C2 C2

L2

Page 19: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany19

Generation of CK and LK (min.supp. count=2)

Generate C3

candidatesfrom L2

Itemset{I1,I2,I3}

{I1,I2,I5}

Itemset Sup. Count{I1,I2,I3} 2

{I1,I2,I5} 2

Itemset Sup. Count{I1,I2,I3} 2

{I1,I2,I5} 2

Scan Compare

C3 C3 L3

Page 20: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany20

Algorithm Application Description

1 In the 1st iteration, each item is a member of C1. The algorithm simply scan all the transactions in order to count the number of occurrences of each item.

2 Suppose that the minimum transaction support count (min_sup = 2/9 = 22%). L1 can then be determined.

3 C2 = L1 join L1.4 The transactions in D are scanned and the support count

of each candidate itemset in C2 , as shown in the middle table of the second row in the last figure.

5 The set of frequent 2-itemsets, L2 , is then determined, consisting of those candidate 2-itemsets in C2 having minimum support.

Page 21: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany21

Algorithm Application Description (2)

6 The generation of C3 = L2 join L2 is detailed in the next figure. Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that the four latter candidates cannot possibly be frequent. We therefore remove them from C3.

7 The transactions in D are scanned in order to determine L3 , consisting of those candidate 3-itemsets in C3 having minimum support.

8 C4 = L3 join L3 , after the pruning C4 = Ø.

Page 22: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany22

Example: Generation C3 from L2

1. Join: C3 = L2 L2 = {{I1,I2},{I1,I3},{I1,I5}, {I2,I3},{I2,I4}, {I2,I5}} {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3},{I2,I4}, {I2,I5}} = {{I1,I2,I3}, {I1,I2,I5}, {I1,I3,I5}, {I2,I3,I4}, {I2,I3,I5}, {I2,I4,I5}}.2. Prune using the Apriori property: All nonempty subsets of a frequent itemset must also be frequent. The 2-item subsets of {I1,I2,I3} are {I1,I2}, {I1,I3}, {I2,I3}, and they all are members of L2. Therefore, keep {I1,I2,I3} in C3. The 2-item subsets of {I1,I2,I5} are {I1,I2}, {I1,I5}, {I2,I5}, and they all are members of L2. Therefore, keep {I1,I2,I5} in C3. Using the same analysis remove other 3-items from C3.3. Therefore, C3 = {{I1,I2,I3}, {I1,I2,I5}} after pruning.

Page 23: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany23

Generating Association Rules from Frequent Items

We generate strong association rules - they satisfy both minimum support and minimum confidence.

support_count(A B)confidence ( A B ) = P(B|A) = ------------------------- support_count(A)

where support_count(A B) is the number of transactionscontaining the itemsets A B, and

support_count(A) is the number of transactions containing the itemset A.

Page 24: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany24

Generating Association Rules from Frequent Items (Cont.)

Based on the equations on the previous slide, association rulescan be generated as follows:

- For each frequent itemset l , generate all nonempty subsets of l.

- For every nonempty subset s of l, output the rule

“s (l - s)”

support_count(l) if ----------------- min_conf, where min_conf is minimum support_count(s) confidence threshold.

Page 25: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany25

Generating Association Rules - Example

Suppose that the transactional data for AllElectronics containthe frequent itemset l = {I1,I2,I5}. The resulting rules are:

I1 I2 I5, confidence = 2/4 = 50%I1 I5 I2, confidence = 2/2 = 100%I2 I5 I1, confidence = 2/2 = 100%I1 I2 I5, confidence = 2/6 = 33%I2 I1 I5, confidence = 2/7 = 29%I5 I1 I2, confidence = 2/2 = 100%

If the minimum confidence threshold is, say, 70%, then onlythe second, third, and the last rules above are output, sincethese are the only ones generated that are strong.

Page 26: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany26

Multilevel (Generalized) Association Rules

For many applications, it is difficult to find strong associationsamong data items at low or primitive levels of abstraction dueto sparsity of data in multidimensional space.

Strong associations discovered at high concept levels may represent common sense knowledge. However, what may represent common sense to one user may seem novel to another.

Therefore, data mining systems should provide capabilities tomine association rules at multiple levels of abstraction and traverse easily among different abstraction spaces.

Page 27: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany27

Multilevel (Generalized) Association Rules - Example

Suppose we are given the following task-relevant set of transactional data for sales at the computer department of anAllElectronics branch, showing the items purchased for eachtransaction TID.

TID Items purchased

T1 IBM desktop computer, Sony b/w printerT2 Microsoft educational software, Microsoft financial softwareT3 Logitech mouse computer accessory, Ergoway wrist pad accessoryT4 IBM desktop computer, Microsoft financial softwareT5 IBM desktop computer . . . . . .

Table Transactions

Page 28: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany28

A Concept Hierarchy for our Example

all

computer software printer Computeraccessory

desktop laptop educational financial color b/wwrist pad

mouse

IBM Microsoft HP Ergoway LogitechSony... ... ...

... ... ... ... ... ... ...

Level 0

Level 3

Page 29: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany29

Example (Cont.)The items in Table Transactions are at the lowest level of theconcept hierarchy. It is difficult to find interesting purchasepatterns at such raw or primitive level data.If, e.g., “IBM desktop computer” or “Sony b/w printer” eachoccurs in a very small fraction of the transactions, then it maybe difficult to find strong associations involving such items.In other words, it is unlikely that the itemset “{IBM desktopcomputer, Sony b/w printer}” will satisfy minimum support.

Itemsets containing generalized items, such as “{IBM desktopcomputer, b/w printer}” and “{computer, printer}” are morelikely to have minimum support.

Rules generated from association rule mining with concept hie-rarchies are called multiple-level or multilevel or generalizedassociation rules.

Page 30: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany30

Parallel Formulation of Association Rules

• Need:– Huge Transaction Datasets (10s of TB)– Large Number of Candidates.

• Data Distribution:– Partition the Transaction Database, or– Partition the Candidates, or– Both

Page 31: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany31

Parallel Association Rules: Count Distribution (CD)

• Each Processor has complete candidate hash tree.

• Each Processor updates its hash tree with local data.

• Each Processor participates in global reduction to get global counts of candidates in the hash tree.

• Multiple database scans per iteration are required if hash tree too big for memory.

Page 32: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany32

CD: Illustration

{5,8}

2

{3,4}{2,3}{1,3}{1,2}

5372 {5,8}

7

{3,4}{2,3}{1,3}{1,2}

3119 {5,8}

0

{3,4}{2,3}{1,3}{1,2}

2826

P0 P1 P2

Global Reduction of Counts

N/p N/p N/p

Page 33: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany33

Parallel Association Rules: Data Distribution (DD)

• Candidate set is partitioned among the processors.• Once local data has been partitioned, it is broadcast

to all other processors.• High Communication Cost due to data movement.• Redundant work due to multiple traversals of the

hash trees.

Page 34: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany34

DD: Illustration

All-to-All Broadcast of Candidates

9{1,3}{1,2}

10 {3,4}{2,3} 12

10{5,8} 17

P0 P1 P2

N/p N/p N/pRemote

DataRemote

DataRemote

Data

DataBroadcast

Count Count Count

Page 35: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany35

Predictive Model Markup Language –

PMML and Visualization

Page 36: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany36

Predictive Model Markup Language - PMML

• Markup language (XML) to describe data mining models

• PMML describes:

– the inputs to data mining models– the transformations used prior to prepare data for

data mining– The parameters which define the models

themselves

Page 37: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany37

PMML 2.1 – Association Rules (1)

1.Model attributes (1)

<xs:element name="AssociationModel">    

<xs:complexType>      

<xs:sequence>        

<xs:element minOccurs="0" maxOccurs="unbounded" ref="Extension" />        

<xs:element ref="MiningSchema" />        

<xs:element minOccurs="0" maxOccurs="unbounded" ref="Item" />        

<xs:element minOccurs="0" maxOccurs="unbounded" ref="Itemset" />        

<xs:element minOccurs="0" maxOccurs="unbounded" ref="AssociationRule" />        

<xs:element minOccurs="0" maxOccurs="unbounded" ref="Extension" />      

</xs:sequence>

…      

Page 38: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany38

PMML 2.1 – Association Rules (2)

1. Model attributes (2)

<xs:attribute name="modelName" type="xs:string" />       <xs:attribute name="functionName" type="MINING-FUNCTION“ use="required"/>       <xs:attribute name="algorithmName" type="xs:string" />       <xs:attribute name="numberOfTransactions" type="INT-NUMBER" use="required"

/>       <xs:attribute name="maxNumberOfItemsPerTA" type="INT-NUMBER" />       <xs:attribute name="avgNumberOfItemsPerTA" type="REAL-NUMBER" />       <xs:attribute name="minimumSupport" type="PROB-NUMBER" use="required" />       <xs:attribute name="minimumConfidence" type="PROB-NUMBER" use="required" /> <xs:attribute name="lengthLimit" type="INT-NUMBER" /> <xs:attribute name="numberOfItems" type="INT-NUMBER" use="required" /> <xs:attribute name="numberOfItemsets" type="INT-NUMBER" use="required" /> <xs:attribute name="numberOfRules" type="INT-NUMBER" use="required" /> </xs:complexType></xs:element>

Page 39: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany39

PMML 2.1 – Association Rules (3)

2. Items

<xs:element name="Item">    

<xs:complexType>      

<xs:attribute name="id" type="xs:string" use="required" />      

<xs:attribute name="value" type="xs:string" use="required" />      

<xs:attribute name="mappedValue" type="xs:string" />      

<xs:attribute name="weight" type="REAL-NUMBER" />    

</xs:complexType>  

</xs:element>

Page 40: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany40

PMML 2.1 – Association Rules (4)

3. ItemSets

<xs:element name="Itemset">    

<xs:complexType>      

<xs:sequence>        

<xs:element minOccurs="0" maxOccurs="unbounded" ref="ItemRef“ />        

<xs:element minOccurs="0" maxOccurs="unbounded" ref="Extension“ />      

</xs:sequence>      

<xs:attribute name="id" type="xs:string" use="required" />      

<xs:attribute name="support" type="PROB-NUMBER" />      

<xs:attribute name="numberOfItems" type="INT-NUMBER" />    

</xs:complexType>  

</xs:element>

Page 41: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany41

PMML 2.1 – Association Rules (5)

4. AssociationRules

<xs:element name="AssociationRule">     <xs:complexType>       <xs:sequence>         <xs:element minOccurs="0" maxOccurs="unbounded"

ref="Extension" />       </xs:sequence>       <xs:attribute name="support" type="PROB-NUMBER" use="required" />

      <xs:attribute name="confidence" type="PROB-NUMBER"

use="required" />       <xs:attribute name="antecedent" type="xs:string" use="required" />       <xs:attribute name="consequent" type="xs:string" use="required" />     </xs:complexType>   </xs:element>

Page 42: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany42

PMML example model for AssociationRules (1)

<?xml version="1.0" ?> <PMML version="2.1" > <DataDictionary numberOfFields="2" > <DataField name="transaction" optype="categorical" /> <DataField name="item" optype="categorical" /> </DataDictionary> <AssociationModel functionName="associationRules"

numberOfTransactions="4" numberOfItems=“4" minimumSupport="0.6" minimumConfidence="0.3" numberOfItemsets=“7" numberOfRules=“3">

<MiningSchema> <MiningField name="transaction"/> <MiningField name="item"/> </MiningSchema>

Page 43: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany43

PMML example model for AssociationRules (2)

<!-- four items - input data --> <Item id="1" value=“PC" /> <Item id="2" value=“Monitor" /> <Item id="3" value=“Printer" />

<Item id=“4" value=“Notebook" />

<!-- three frequent 1-itemsets --> <Itemset id="1" support="1.0" numberOfItems="1">    <ItemRef itemRef="1" /> </Itemset>

<Itemset id="2" support="1.0" numberOfItems="1">    <ItemRef itemRef=“2" /> </Itemset> <Itemset id=“3" support="1.0" numberOfItems="1">    <ItemRef itemRef="3" /> </Itemset>

Page 44: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany44

PMML example model for AssociationRules (3)

<!-- three frequent 2-itemset --> <Itemset id=“4" support="1.0" numberOfItems="2">    <ItemRef itemRef="1" />    <ItemRef itemRef=“2" /> </Itemset> <Itemset id=“5" support="1.0" numberOfItems="2">    <ItemRef itemRef="1" />    <ItemRef itemRef=“3" /> </Itemset> <Itemset id=“6" support="1.0" numberOfItems="2">    <ItemRef itemRef=“2" />    <ItemRef itemRef="3" /> </Itemset>

Page 45: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany45

PMML example model for AssociationRules (4)

<!-- one frequent 3-itemset --> <Itemset id=“7" support="0.9" numberOfItems=“3">    <ItemRef itemRef="1" />    <ItemRef itemRef=“2" /> <ItemRef itemRef="3" /> </Itemset>

<!-- three rules satisfy the requirements – the output --> <AssociationRule support="0.9“ confidence="0.85“                 

antecedent=“4" consequent=“3" /> <AssociationRule support="0.9" confidence="0.75"                 

antecedent=“1" consequent=“6" /> <AssociationRule support="0.9" confidence="0.70"                 

antecedent=“6" consequent="1" /> </AssociationModel></PMML>

Page 46: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany46

Visualization of Association Rules (1)

1. Table Format

Antecedent Consequent Support

Confidence

PC, Monitor Printer 90% 85%

PC Printer, Monitor 90% 75%

Printer, Monitor

PC 80% 70%

Page 47: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany47

Visualization of Association Rules (2)

2. Directed Graph

PC

Printer

Printer

PC

Monitor

PC

Monitor

Printer

Monitor

Page 48: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany48

Visualization of Association Rules (3)

3. 3-D Visualisation

Page 49: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany49

Mining Sequential Patterns

(Mining Sequential Associations)

Page 50: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany50

Mining Sequential Patterns

• Discovering sequential patterns is a relatively new data mining problem.

• The input data is a set of sequences, called data-sequences.

• Each data-sequence is a list of transactions where each transaction is a set of items. Typically, there is a transaction time associated with each transaction.

• A sequential pattern also consists of a list of sets of items.• The problem is to find all sequential patterns with a user-

specified minimum support , where the support of a sequential pattern is a percentage of data-sequences that contain the pattern.

Page 51: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany51

Application Examples• Book club

Each data sequence may correspond to all book selections of a customer, and each transaction corresponds to the books selected by the customer in one order. A sequential pattern may be “5% of customers bough `Foundation´, then `Foundation and Empire´ and then `Second Foundation´”.

The data sequences corresponding to a customer who bought some other books in between these books still contains this sequential pattern.

• Medical domainA data sequence may correspond to the symptoms or diseases of a patient, with a transaction corresponding tothe symptoms exhibited or diseases diagnosed during a visit to the doctor. The patterns discovered could be used in disease research to help identify symptoms diseases that precede certain diseases.

Page 52: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany52

Discovering Sequential Associations

Object 2

Object 1

823

468 1

14 1

253

193

32

timeline

events

10 20 30 40 50

Given: A set of objects with associated event occurrences.

Page 53: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany53

Problem StatementWe are given a database D of customer transactions. Each transaction consists of the following fields: customer-id,transaction-time, and the items purchased in the transaction.No customer has more than one transaction with the same transaction time. We do not consider quantities of items boughtin a transaction: each item is a binary variable representing whether an item was bought or not.

A sequence is an ordered list of itemsets.We denote an itemset i by (i1 i2 ... im ), where ij is an item. We denote a sequence s by <s1 s2 ... sn>, where sj is an itemset.

A sequence <a1 a2 ... an> is contained in another sequence<b1 b2 ... bm> if there exist integers i1 < i2 < in such that a1 bi1, a2 bi2, ..., an bin.

Page 54: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany54

Problem Statement (2)

For example, <(3) (4 5) (8)> is contained in <(7) (3 8) (9) (4 5 6) (8)>,since (3) (3 8), (4 5) (4 5 6) and (8) (8). However, thesequence <(3) (5)> is not contained in <(3 5)> (an vice versa). Theformer represents items 3 and 5 being bought one after the other, while the latter represents items 3 and 5 being boughttogether.

In a set of sequences, a sequence s is maximal if s is not contained in any other sequence.

Customer sequence - an itemset list of customer transactionsordered by increasing transaction time: <itemset(T1) itemset(T2) ... itemset(Tn)>

Page 55: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany55

Problem Statement (3)A customer supports a sequence s if s is contained in the customer sequence for this customer.

The support for a sequence is defined as the fraction of total customers who support this sequence.

Given a database D customer transactions, the problem of miningsequential patterns is to find the maximal sequences among allsequences that have a certain user-specified minimum support.Each such sequence represents a sequential pattern.

We call a sequence satisfying the minimum support constraint alarge sequence.

See the next example.

Page 56: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany56

ExampleCustomer Id Transaction Time Items Bought

1 June 25 ‘00 301 June 30 ‘00 902 June 10 ‘00 10, 202 June 15 ‘00 302 June 20 ‘00 40, 60, 703 June 25 ‘00 30, 50, 704 June 25 ‘00 304 June 30 ‘00 40, 704 July 25 ‘00 905 June 12 ‘00 90

Customer Id Custom Sequence1 <(30) (90)>2 <(10 20) (30) (40 60 70)>3 <(30 50 70)>4 <(30) (40 70) (90)>5 <(90)>

Database sorted by customer Id and transaction time

Customer-sequenceversion of thedatabase

Page 57: Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität Wien

P.Brezany57

Example (2)With minimum support set to 25%, i.e., a minimum support of 2customers, two sequences: <(30) (90)> and <(30) (40 70)> aremaximal among those satisfying the support constraint, and arethe desired sequential patterns. <(30) (90)> is supported by customers 1 and 4. Customer 4 buysitems (40 70) in between items 30 and 90, but supports thepatterns <(30) (90)> since we are looking for patterns that arenot necessarily contiguous. <(30 (40 70)> is supported by customers 2 and 4. Customer 2 buys 60 along with 40 and 70,but suports this pattern since (40 70) is a subset of (40 60 70).

E.g. the sequence <(10 20) (30)> does not have minimal support;it is only supported by customer 2. The sequences <(30)>, <(40)> <(70)>, <(90)>, <(30) (40)>, <(30) (70)> and <(40 70)> have minimum support, they are notmaximal - therefore, they are not in the answer.