Data Science and Big Data Analytics
Chap 5: Adv Analytical Theory and Methods: Association Rules
Charles TappertSeidenberg School of CSIS, Pace
University
Chapter Sections
5.1 Overview 5.2 Apriori Algorithm 5.3 Evaluation of Candidate Rules 5.4 Example: Transactions in a Grocery
Store 5.5 Validation and Testing 5.6 Diagnostics
5.1 Overview
Association rules method Unsupervised learning method Descriptive (not predictive) method Used to find hidden relationships in data The relationships are represented as rules
Questions association rules might answer Which products tend to be purchased
together What products do similar customers tend
to buy
5.1 Overview
Example – general logic of association rules
5.1 Overview
Rules have the form X -> Y When X is observed, Y is also observed
Itemset Collection of items or entities k-itemset = {item 1, item 2,…,item k} Examples
Items purchased in one transaction Set of hyperlinks clicked by a user in one
session
5.1 Overview – Apriori Algorithm
Apriori is the most fundamental algorithm Given itemset L, support of L is the percent of
transactions that contain L Frequent itemset – items appear together
“often enough” Minimum support defines “often enough” (%
transactions) If an itemset is frequent, then any subset is
frequent
5.1 Overview – Apriori Algorithm
If {B,C,D} frequent, then all subsets frequent
5.2 Apriori AlgorithmFrequent = minimum
support Bottom-up iterative algorithm Identify the frequent (min support) 1-
itemsets Frequent 1-itemsets are paired into 2-
itemsets, and the frequent 2-itemsets are identified, etc.
Definitions for next slide D = transaction database d = minimum support threshold N = maximum length of itemset (optional
parameter) Ck = set of candidate k-itemsets Lk = set of k-itemsets with minimum support
5.2 Apriori Algorithm
5.3 Evaluation of Candidate Rules
Confidence Frequent itemsets can form candidate
rules Confidence measures the certainty of a
rule
Minimum confidence – predefined threshold
Problem with confidence Given a rule X->Y, confidence considers only
the antecedent (X) and the co-occurrence of X and Y
Cannot tell if a rule contains true implication
5.3 Evaluation of Candidate RulesLift
Lift measures how much more often X and Y occur together than expected if statistically independent
Lift = 1 if X and Y are statistically independent Lift > 1 indicates the degree of usefulness of
the rule Example – in 1000 transactions,
If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400, then Lift(milk->eggs) = 0.3/(0.5*0.4) = 1.5
If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400, then Lift(milk->bread) = 0.4/(0.5*0.4) = 2.0
5.3 Evaluation of Candidate Rules
Leverage Leverage measures the difference in the
probability of X and Y appearing together compared to statistical independence
Leverage = 0 if X and Y are statistically independent
Leverage > 0 indicates degree of usefulness of rule
Example – in 1000 transactions, If {milk, eggs} appears in 300, {milk} in 500, and
{eggs} in 400, then Leverage(milk->eggs) = 0.3 - 0.5*0.4 = 0.1
If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400, then Leverage (milk->bread) = 0.4 - 0.5*0.4 = 0.2
5.4 Applications of Association Rules
The term market basket analysis refers to a specific implementation of association rules For better merchandising – products to
include/exclude from inventory each month Placement of products within related products
Association rules also used for Recommender systems – Amazon, Netflix Clickstream analysis from web usage log files
Website visitors to page X click on links A,B,C more than on links D,E,F
5.5 Example: Grocery Store Transactions
5.5.1 The Groceries Dataset
Packages -> Install -> arules, arulesViz # don’t enter next line> install.packages(c("arules", "arulesViz")) # appears on console> library('arules')> library('arulesViz')> data(Groceries)> summary(Groceries) # indicates 9835 rows
Class of dataset Groceries is transactions, containing 3 slots 1. transactionInfo # data frame with vectors having length of transactions
2. itemInfo # data frame storing item labels
3. data # binary evidence matrix of labels in transactions
> Groceries@itemInfo[1:10,]> apply(Groceries@data[,10:20],2,function(r) paste(Groceries@itemInfo[r,"labels"],collapse=", "))
>
5.5 Example: Grocery Store Transactions
5.5.2 Frequent Itemset GenerationTo illustrate the Apriori algorithm, the code below does each iteration separately.
Assume minimum support threshold = 0.02 (0.02 * 9853 = 198 items), get 122 itemsets total
First, get itemsets of length 1> itemsets<-apriori(Groceries,parameter=list(minlen=1,maxlen=1,support=0.02,target="frequent itemsets"))> summary(itemsets) # found 59 itemsets> inspect(head(sort(itemsets,by="support"),10)) # lists top 10
Second, get itemsets of length 2> itemsets<-apriori(Groceries,parameter=list(minlen=2,maxlen=2,support=0.02,target="frequent itemsets"))> summary(itemsets) # found 61 itemsets> inspect(head(sort(itemsets,by="support"),10)) # lists top 10
Third, get itemsets of length 3> itemsets<-apriori(Groceries,parameter=list(minlen=3,maxlen=3,support=0.02,target="frequent itemsets"))> summary(itemsets) # found 2 itemsets> inspect(head(sort(itemsets,by="support"),10)) # lists top 10
> summary(itemsets) # found 59 itemsets> inspect(head(sort(itemsets,by="support"),10)) # lists top 10 supported items
5.5 Example: Grocery Store Transactions
5.5.3 Rule Generation and Visualization
The Apriori algorithm will now generate rules.
Set minimum support threshold to 0.001 (allows more rules, presumably for the scatterplot) and minimum confidence threshold to 0.6 to generate 2,918 rules.
> rules <- apriori(Groceries,parameter=list(support=0.001,confidence=0.6,target="rules"))> summary(rules) # finds 2918 rules> plot(rules) # displays scatterplot
The scatterplot shows that the highest lift occurs at a low support and a low confidence.
5.5 Example: Grocery Store Transactions
5.5.3 Rule Generation and Visualization
> plot(rules)> plot(rules)
5.5 Example: Grocery Store Transactions
5.5.3 Rule Generation and Visualization
Get scatterplot matrix to compare the support, confidence, and lift of the 2918 rules
> plot(rules@quality) # displays scatterplot matrix
Lift is proportional to confidence with several linear groupings.Note that Lift = Confidence/Support(Y), so when support of Y remains the same, lift is proportional to confidence and the slope of the linear trend is the reciprocal of Support(Y).
5.5 Example: Grocery Store Transactions
5.5.3 Rule Generation and Visualization
> plot(rules)> plot(rules)
5.5 Example: Grocery Store Transactions
5.5.3 Rule Generation and Visualization
Compute the 1/Support(Y) which is the slope
> slope<-sort(round(rules@quality$lift/rules@quality$confidence,2))
Display the number of times each slope appears in dataset
> unlist(lapply(split(slope,f=slope),length))
Display the top 10 rules sorted by lift
> inspect(head(sort(rules,by="lift"),10))
Rule {Instant food products, soda} -> {hamburger meat} has the highest lift of 19 (page 154)
5.5 Example: Grocery Store Transactions
5.5.3 Rule Generation and Visualization
Find the rules with confidence above 0.9
> confidentRules<-rules[quality(rules)$confidence>0.9] > confidentRules # set of 127 rules
Plot a matrix-based visualization of the LHS v RHS of rules> plot(confidentRules,method="matrix",measure=c("lift","confidence"),control=list(reorder=TRUE))
The legend on the right is a color matrix indicating the lift and the confidence to which each square in the main matrix corresponds
5.5 Example: Grocery Store Transactions
5.5.3 Rule Generation and Visualization
> plot(rules)> plot(rules)
5.5 Example: Grocery Store Transactions
5.5.3 Rule Generation and Visualization
Visualize the top 5 rules with the highest lift.
> highLiftRules<-head(sort(rules,by="lift"),5) > plot(highLiftRules,method="graph",control=list(type="items"))
In the graph, the arrow always points from an item on the LHS to an item on the RHS.
For example, the arrows that connects ham, processed cheese, and white bread suggest the rule
{ham, processed cheese} -> {white bread}
Size of circle indicates support and shade represents lift
5.5 Example: Grocery Store Transactions
5.5.3 Rule Generation and Visualization
5.6 Validation and Testing
The frequent and high confidence itemsets are found by pre-specified minimum support and minimum confidence levels
Measures like lift and/or leverage then ensure that interesting rules are identified rather than coincidental ones
However, some of the remaining rules may be considered subjectively uninteresting because they don’t yield unexpected profitable actions
E.g., rules like {paper} -> {pencil} are not interesting/meaningful
Incorporating subjective knowledge requires domain experts
Good rules provide valuable insights for institutions to improve their business operations
5.7 Diagnostics
Although minimum support is pre-specified in phases 3&4, this level can be adjusted to target the range of the number of rules – variants/improvements of Apriori are available
For large datasets the Apriori algorithm can be computationally expensive – efficiency improvements
Partitioning
Sampling
Transaction reduction
Hash-based itemset counting
Dynamic itemset counting