Download pptx - Data Science and Big Data Analytics Chap 5: Adv Analytical Theory and Methods: Association Rules Charles Tappert Seidenberg School of CSIS, Pace University

Data Science and Big Data Analytics

Chap 5: Adv Analytical Theory and Methods: Association Rules

Charles TappertSeidenberg School of CSIS, Pace

University

Chapter Sections

5.1 Overview 5.2 Apriori Algorithm 5.3 Evaluation of Candidate Rules 5.4 Example: Transactions in a Grocery

Store 5.5 Validation and Testing 5.6 Diagnostics

5.1 Overview

Association rules method Unsupervised learning method Descriptive (not predictive) method Used to find hidden relationships in data The relationships are represented as rules

Questions association rules might answer Which products tend to be purchased

together What products do similar customers tend

to buy

5.1 Overview

Example – general logic of association rules

5.1 Overview

Rules have the form X -> Y When X is observed, Y is also observed

Itemset Collection of items or entities k-itemset = {item 1, item 2,…,item k} Examples

Items purchased in one transaction Set of hyperlinks clicked by a user in one

session

5.1 Overview – Apriori Algorithm

Apriori is the most fundamental algorithm Given itemset L, support of L is the percent of

transactions that contain L Frequent itemset – items appear together

“often enough” Minimum support defines “often enough” (%

transactions) If an itemset is frequent, then any subset is

frequent

5.1 Overview – Apriori Algorithm

If {B,C,D} frequent, then all subsets frequent

5.2 Apriori AlgorithmFrequent = minimum

support Bottom-up iterative algorithm Identify the frequent (min support) 1-

itemsets Frequent 1-itemsets are paired into 2-

itemsets, and the frequent 2-itemsets are identified, etc.

Definitions for next slide D = transaction database d = minimum support threshold N = maximum length of itemset (optional

parameter) Ck = set of candidate k-itemsets Lk = set of k-itemsets with minimum support

5.2 Apriori Algorithm

5.3 Evaluation of Candidate Rules

Confidence Frequent itemsets can form candidate

rules Confidence measures the certainty of a

rule

Minimum confidence – predefined threshold

Problem with confidence Given a rule X->Y, confidence considers only

the antecedent (X) and the co-occurrence of X and Y

Cannot tell if a rule contains true implication

5.3 Evaluation of Candidate RulesLift

Lift measures how much more often X and Y occur together than expected if statistically independent

Lift = 1 if X and Y are statistically independent Lift > 1 indicates the degree of usefulness of

the rule Example – in 1000 transactions,

If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400, then Lift(milk->eggs) = 0.3/(0.5*0.4) = 1.5

If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400, then Lift(milk->bread) = 0.4/(0.5*0.4) = 2.0

5.3 Evaluation of Candidate Rules

Leverage Leverage measures the difference in the

probability of X and Y appearing together compared to statistical independence

Leverage = 0 if X and Y are statistically independent

Leverage > 0 indicates degree of usefulness of rule

Example – in 1000 transactions, If {milk, eggs} appears in 300, {milk} in 500, and

{eggs} in 400, then Leverage(milk->eggs) = 0.3 - 0.5*0.4 = 0.1

If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400, then Leverage (milk->bread) = 0.4 - 0.5*0.4 = 0.2

5.4 Applications of Association Rules

The term market basket analysis refers to a specific implementation of association rules For better merchandising – products to

include/exclude from inventory each month Placement of products within related products

Association rules also used for Recommender systems – Amazon, Netflix Clickstream analysis from web usage log files

Website visitors to page X click on links A,B,C more than on links D,E,F

5.5 Example: Grocery Store Transactions

5.5.1 The Groceries Dataset

Packages -> Install -> arules, arulesViz # don’t enter next line> install.packages(c("arules", "arulesViz")) # appears on console> library('arules')> library('arulesViz')> data(Groceries)> summary(Groceries) # indicates 9835 rows

Class of dataset Groceries is transactions, containing 3 slots 1. transactionInfo # data frame with vectors having length of transactions

2. itemInfo # data frame storing item labels

3. data # binary evidence matrix of labels in transactions

> Groceries@itemInfo[1:10,]> apply(Groceries@data[,10:20],2,function(r) paste(Groceries@itemInfo[r,"labels"],collapse=", "))

>


5.5.2 Frequent Itemset GenerationTo illustrate the Apriori algorithm, the code below does each iteration separately.

Assume minimum support threshold = 0.02 (0.02 * 9853 = 198 items), get 122 itemsets total

First, get itemsets of length 1> itemsets<-apriori(Groceries,parameter=list(minlen=1,maxlen=1,support=0.02,target="frequent itemsets"))> summary(itemsets) # found 59 itemsets> inspect(head(sort(itemsets,by="support"),10)) # lists top 10

Second, get itemsets of length 2> itemsets<-apriori(Groceries,parameter=list(minlen=2,maxlen=2,support=0.02,target="frequent itemsets"))> summary(itemsets) # found 61 itemsets> inspect(head(sort(itemsets,by="support"),10)) # lists top 10

Third, get itemsets of length 3> itemsets<-apriori(Groceries,parameter=list(minlen=3,maxlen=3,support=0.02,target="frequent itemsets"))> summary(itemsets) # found 2 itemsets> inspect(head(sort(itemsets,by="support"),10)) # lists top 10

> summary(itemsets) # found 59 itemsets> inspect(head(sort(itemsets,by="support"),10)) # lists top 10 supported items


5.5.3 Rule Generation and Visualization

The Apriori algorithm will now generate rules.

Set minimum support threshold to 0.001 (allows more rules, presumably for the scatterplot) and minimum confidence threshold to 0.6 to generate 2,918 rules.

> rules <- apriori(Groceries,parameter=list(support=0.001,confidence=0.6,target="rules"))> summary(rules) # finds 2918 rules> plot(rules) # displays scatterplot

The scatterplot shows that the highest lift occurs at a low support and a low confidence.



> plot(rules)> plot(rules)



Get scatterplot matrix to compare the support, confidence, and lift of the 2918 rules

> plot(rules@quality) # displays scatterplot matrix

Lift is proportional to confidence with several linear groupings.Note that Lift = Confidence/Support(Y), so when support of Y remains the same, lift is proportional to confidence and the slope of the linear trend is the reciprocal of Support(Y).






Compute the 1/Support(Y) which is the slope

> slope<-sort(round(rules@quality$lift/rules@quality$confidence,2))

Display the number of times each slope appears in dataset

> unlist(lapply(split(slope,f=slope),length))

Display the top 10 rules sorted by lift

> inspect(head(sort(rules,by="lift"),10))

Rule {Instant food products, soda} -> {hamburger meat} has the highest lift of 19 (page 154)



Find the rules with confidence above 0.9

> confidentRules<-rules[quality(rules)$confidence>0.9] > confidentRules # set of 127 rules

Plot a matrix-based visualization of the LHS v RHS of rules> plot(confidentRules,method="matrix",measure=c("lift","confidence"),control=list(reorder=TRUE))

The legend on the right is a color matrix indicating the lift and the confidence to which each square in the main matrix corresponds






Visualize the top 5 rules with the highest lift.

> highLiftRules<-head(sort(rules,by="lift"),5) > plot(highLiftRules,method="graph",control=list(type="items"))

In the graph, the arrow always points from an item on the LHS to an item on the RHS.

For example, the arrows that connects ham, processed cheese, and white bread suggest the rule

{ham, processed cheese} -> {white bread}

Size of circle indicates support and shade represents lift



5.6 Validation and Testing

The frequent and high confidence itemsets are found by pre-specified minimum support and minimum confidence levels

Measures like lift and/or leverage then ensure that interesting rules are identified rather than coincidental ones

However, some of the remaining rules may be considered subjectively uninteresting because they don’t yield unexpected profitable actions

E.g., rules like {paper} -> {pencil} are not interesting/meaningful

Incorporating subjective knowledge requires domain experts

Good rules provide valuable insights for institutions to improve their business operations

5.7 Diagnostics

Although minimum support is pre-specified in phases 3&4, this level can be adjusted to target the range of the number of rules – variants/improvements of Apriori are available

For large datasets the Apriori algorithm can be computationally expensive – efficiency improvements

Partitioning

Sampling

Transaction reduction

Hash-based itemset counting

Dynamic itemset counting