Masket Basket Analysis

MARKET BASKET ANALYSIS

LEARNING OBJECTIVES:• EXPLAIN WHAT ASSOCIATION RULES AND ITEM SETS ARE.

• DESCRIBE THE BASIC PROCESS FOR MARKET BASKET ANALYSIS.

• KNOW WHEN TO APPLY MARKET BASKET ANALYSIS.

• UNDERSTAND THE STRENGTHS AND WEAKNESSES OF MARKET BASKET ANALYSIS.

By Obakeng Brian Pheelwane & Marc Berman – Group 14

ASSOCIATION RULES

• Association discovery is to find items that imply the presence of other items in the same transaction.

• Association rules are in the form:

If <Left hand Side (LHS)> then <Right hand Side(RHS)>

• To indicate the validity and importance of a rule, each rule has two parameters:

• Support Factor

• Confidence Factor.

• RHS usually has one item; LHS has one or more items.

EXAMPLES OF ASSOCIATION RULES

In a database of transactions of two items (X, Y) in a departmental store, an example association rule is:

Thus if a customer buys X (which occurs in 70% of the cases studied) then he/she will also buy Y. This occurs 20% of all purchases made at the departmental store.

Therefore, the rule has a 70% confidence factor and 20% support factor.

CONFIDENCE AND SUPPORT FACTORS

• Assume we have an association rule indicated as LHS -> RHS.

• T is the total number of cases in the database.

• X is the number of cases covered by the LHS.

• Y is the number of cases covered by the RHS.

• X Y is the number of cases covered by both the LHS and the RHS, indicated by the overlapping area in Figure 1.Figure 1: Confidence and support factors

visualised

Figure 2: Confidence and support factor formula

• Confidence factor is calculated based upon the number of cases present in both the left and right hand sides of the scenario, divided by the total number of cases in the left hand side.

• Support factor is calculated based upon the number of cases present in both the left and right hand sides of the scenario, divided by the total number of cases in the database.

THE BASIC PROCESS OF MARKET BASKET ANALYSIS1. Choosing the right item set.

• The objective is to define a set of items. When association rules are formed among these items, some of the rules provide a meaningful interpretation that may lead to useful rules.

• Several methods to generating the right item sets:

• Use taxonomy to get the right level, range from general items to special items (see Figure 3)

• Use virtual items (see Figures 4 & 5)

• A combination of both

• The taxonomy and virtual items (to be prepared by the users or domain expert) become the means to assist users to choose the right item set during the exploration to find useful rules.

Figure 3: Taxonomies start with the most general and move to increasing detail.

Figure 4: This is an example of poor choice of virtual items since the rules are likely to be redundant.

The problem with this visualisation is the rules are just repeats of the definition.

Figure 5: This is an example of a good choice of virtual items, though one must be careful to not totally encompass the items used for

analysis as this would create redundancy again.

BASIC PROCESS CONTINUED

2. Generating rules:

• The rule generation process involves generating the co-occurrence matrix, counting the frequencies of co-occurrence between n items in the item set.

• To generate a rule of n item of the form:

If X1, X2,…,X(n-1) Then Xn

A co-occurrence matrix of n items is required.

Number of items on LHS Total number of combinations1 100

2 4,950

3 161,700

4 3,921,225

5 75,287,520

6 1,192,052,400

7 16,007,560,800

8 186,087,894,300

Figure 5: This is a computationally expensive process, especially when a large data set is present.

BASIC PROCESS CONTINUED

3. Identifying useful rules that are unknown, valid and actionable.

• First, specify the threshold values for confidence factor and support factor to filter out rules which are not supported by the data automatically by the rule generation algorithm.

• Second, human judgement is required to identify the interestingness, validity and actionability of the rules which have sifted through the automatic filter.

WHEN TO APPLY MARKET BASKET ANALYSIS

• Problems that consist of well-defined items that group together in potentially interesting ways.

• Time-series problems that can be adapted for market basket analysis by relatively simple data transformations.

STRENGTHS AND WEAKNESSES

Strengths:

• Clear and understandable results

• Support undirected data mining

• Work on variable-length data

• Simple computational process

Weaknesses:

• Computation increases exponentially as

• the problem size grows.

• Limited support for attributes on the data.

• Difficult to determine the right number of items.

• It discounts rare items.

DISSOCIATION RULES

• Similar to association rules except that a negation “NOT” is used to an item. An example of dissociation rule is:

• If X and not Y then Z.

Problems with dissociation rules:

1. Doubling the items significantly slows down the runtime

2. The size of transactions grows because it includes inverted items

3. Tend to produce rules in which all items are inverted because the frequency of the inverted items are usually much larger.

WHAT WE HAVE LEARNED:

• We have learned about association and dissociation rules.

• How to generate more specialised items using taxonomy and virtual items.

• When to apply Market Basket Analysis

• Finally, the strengths and weaknesses of Market Basket Analysis

REVIEW QUESTIONS

1. Discuss the similarities and differences between a decision rule and an association rule in terms of rule structure and how it is used.

Decision rule (Separate-and-conquer)

Decision rules are closely related to decision trees. The terminal nodes of a tree can be grouped into rules. Attempts to find a partial solution for a part of a problem. Looking for the optimal solution to the problem

How it is used:

- One partial solution in each step

Association rule

An association rule does not have a target. It finds all rules that exist in data. Attempts to find a full set of solutions of a problem. Looking for the optimal solution to the problem.

How it is used:

- Multiple combinations in each step

2. Discuss the due caution one should have when applying association rules. Relate your explanation to the definition of data mining: Data mining is a process of extracting previously unknown, valid, and actionable information from large databases and then using the information to make crucial decisions.

Due caution with the association mining

A huge amount of association rules can be generated from a data set.

Before one can apply the rules, one shall examine them in terms of validity, interestingness and actionability.

First one can filter a majority of the rules which are not supported by the data. This can be done automatically by setting some threshold values to confidence and support factor. This represents one aspect of the validity check with minimum human involvement.

The focus is to avoid trivial rules and inexplicable rules which have sifted through the automatic filter. These rules reproduce common knowledge or are a result of flukes in the data. They are therefore not actionable.

REVIEW QUESTIONS

3. Compare the model selection process in predictive modelling with the similar process in market basket analysis. Answer the following questions in your comparison:

i. What is a model?

A model in predictive modelling tasks is one built to make prediction for unseen data.

E.g. the trained model is used to make a positive or negative diagnosis about a disease for a new patient.

A model in market basket analysis is in the form of a set of rules that describes the association between attributes and they are not meant for prediction.

ii. How do the model selection processes differ?

The model selection process in predictive modelling is guided by maximising a measure determined during the problem definition step, this process can be carried out objectively.

The model selection process in the market basket analysis is more subjective, although a few measures can be used to reduce the set of candidate rules.

REVIEW QUESTIONS

Technology

Masket Basket Analysis