Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Pa�ern MiningKnowledge Discovery and Data Mining 2
Roman Kern and Heimo Gursch
ISDS, TU Graz
2019-10-31
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 1 / 44
Outline
1 Introduction
2 Apriori Algorithm
3 FP-Growth Algorithm
4 Sequence Pa�ern Mining
5 Alternatives & Remarks
6 Tools
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 2 / 44
Introduction to Pa�ern MiningWhat & Why
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 3 / 44
Pa�ern Mining Intro
Definition of KDDKDD is the nontrivial process of identifying valid, novel, potentially useful, and ultimatelyunderstandable pa�erns in data
→ But, until now we did not directly investigate into mining of these pa�erns.
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 4 / 44
Pa�ern Mining Example
Famous example for pa�ern miningMany supermarkets/grocery (chains) do have loyalty cards
→ gives the store the ability to analyse the purchase behaviour (and increase sales, formarketing, etc.)Benefits
I Which items have been bought togetherI Hidden relationships: probability of purchasing an item given another item is already part of
the purchaseI … or temporal pa�erns
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 5 / 44
Pa�ern Mining Example
Transactions in a groceryAll purchases are organised in transactions
Each transaction contains a list of purchased items
For example:
Id Items
0 Bread, Milk1 Bread, Diapers, Beer, Eggs2 Milk, Diapers, Beer, Coke3 Bread, Milk, Diapers, Beer4 Bread, Milk, Diapers, Coke
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 6 / 44
Pa�ern Mining Example
Ways of analysing the transaction dataFrequent itemset
I Which items are o�en bought together?I → association analysis also called itemset mining
Frequent associationsI If one itemset occurs what itemset will most probably also occur alongside?I Find “if A - then B” associationsI → association rule mining
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 7 / 44
Pa�ern Mining Example
ResultA grocery store chain in the Midwest of the US found that on Thursdays men tend to buy beer ifthey bought diapers as well.
Association Rule{diapers → beer}
Note #1: The grocery store could make use of this knowledge to place diapers and beer close to each other and tomake sure none of these product has a discount on Thursdays.Note #2: Not clear whether this is actually true.
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 8 / 44
Pa�ern Mining Basics
Formal description: items and transactionsI = {I1, I2, ..., In} - the set of all available items (e.g. the grocery goods - inventory)
D = {t1, t2, ..., tm} - the transactions, where each transaction consists of a set of items(tx ⊆ I)
Note #1: The organisation of transaction is sometimes referred to as horizontal dataNote #2: One could see each item as feature and thus each transaction as feature vector with zeros for items not contained in thetransaction.format
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 9 / 44
Pa�ern Mining Basics
Formal description: itemset and association ruleX = {I1, I2, ..., Ii} - the itemset, just a bunch of items, a subset of IFrequent itemset is an itemset that found o�en in the transactionsX → Y - an association rule
I X and Y are itemssetsI X is called premise, antecedent or le�-hand-side (LHS)I Y is called conclusion, consequent or right-hand-side (RHS)
Frequent association rule is an association rule that is true for many transactions. I. e., X → Yis found o�en.
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 10 / 44
Pa�ern Mining Basics
Formal description: support of an itemsetHow frequent is a frequent itemset?
→ The support of an itemset is the probability of finding the itemset in the transactions.
Support(X) = P(X) = P(I1, I2, ..., In)Support(X) = |{t∈D|X⊆t}|
|D|The relative frequency of the itemset (relative number of transactions containing all itemsfrom the itemset)
Note: Some algorithms take the total number of transactions instead of the relative frequency
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 11 / 44
Pa�ern Mining Basics
Formal description: support of an association ruleHow frequent is a frequent association rule?
→ The support of an association rule is the probability of finding the association rule to betrue in the transactions.
Support(X → Y) = P(X ∪ Y) = P(IX1 , IX2 , ..., IXn , IY1 , IY2 , ..., IYm)Support(X) = |{t∈D|X∪Y⊆t}|
|D|… equals to the joint itemset
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 12 / 44
Pa�ern Mining Basics
Formal description: confidenceHow relevant is a (frequent) association rule?
→ The confidence is the probability to find the conclusion Y given the premise X is there.
Conf(X → Y) = P(IY1 , IY2 , ..., IYm|IX1 , IX2 , ..., IXn )Conf(X → Y) = Support(X∪Y)
Support(X)
Conf(X → Y) = |{t∈D|X∪Y⊆t}||{t∈D|X⊆t}|
The confidence reflects how o�en one finds Y in transactions, which contain X
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 13 / 44
Pa�ern Mining Basics
Association Rule Mining GoalFind all frequent, relevant association rules, i.e. all rules with a su�icient support and su�icientconfidence
What is it useful for?Market basket analysis, query completion, click stream analysis, genome analysis, …
Preprocessing for other data mining tasks (e. g., select most frequent part from dataset).
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 14 / 44
Pa�ern Mining Basics
Simple Solution (for association analysis)1 Pick out any combination of items2 Test for su�icient support3 Goto 1 (until all combinations have been tried)
Not the optimal solution, due to 2n combinations for n items
Two improvements are presented next:→ Aprioi Algorithm→ FP-Growth Algorithm
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 15 / 44
The Apriori Algorithm… and the apriori principle
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 16 / 44
Apriori Algorithm
�estionReduce the number of candidates from 2n combinations for n items
Can some candidates be ignored?
Apriori algorithmSimple and e�icient algorithm to find frequent itemsets
Can also be applied to find association rules
InputThe transactions
The (minimal) support threshold
For association rules additionally: The (minimal) confidence threshold
R. Agrawal, H. Road, and S. Jose, “Mining Association Rules between Sets of Items in Large Databases,” ACM SIGMOD Record. Vol.22. No. 2. ACM, 1993.
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 17 / 44
Apriori Algorithm
Apriori principleIf an itemset is frequent, then all of the subsets are frequent as wellIf an itemset is infrequent, then all of the supersets are infrequent as well
Note: This can be used to reduce the number of candidates considerably
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 18 / 44
Apriori Algorithm
Apriori basic approach for frequent itemsets1 Find all frequent itemsets of size k = 1 and add them to the frequent itemsets F1 (i.e., check
for support of single items)2 k = k + 13 Generate the candidates Ck of size k from the known frequent itemsets Fk−1 of size k − 14 Prune the candidates in Ck containing infrequent sub-itemsets (apriori principle)5 For all candidate in Ck
I Calculate the support countI If the support is large enough move to Fk ; if not delete it
6 Goto 2 (until there no entries for Fk are found)
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 19 / 44
Apriori Algorithm
Basis for frequent association rule miningFrequent association rule mining is done on the frequent itemsets and not on thetransactions directly, thus reducing the search space considerably.
A set with k items has 2k − 2 candidates for association rules, i.e., all subsets but the emptyand the full set can be premise and conclusion.Confidence is monotonous only on the same itemsets
I Conf(ABC → D) ≥ Conf(AC → CD) ≥ Conf(A→ BCD)
but not in generalI Conf(ABC → D) can be larger or smaller than Conf(AB→ D)
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 20 / 44
Apriori Algorithm
Apriori basic approach for frequent association rules1 Take the frequent itemsets of size k ≥ 22 Start by generating association rules with a conclusion set Hm of size m = 1
I Hm of size m = 1 is generated by picking all items out of the frequent itemset one at a timeand one a�er the other.
3 m = m+ 14 Generate candidates the conclusion for size m, by merging two rules from the previous step
I New premise: intersection of old premisesI New conclusion: union of old premisesI I.e., one item moves from the premise to the conclusion
5 Prune the candidates containing infrequent sub-itemsets (apriori principle)6 For all candidate
I Calculate the confidenceI If the confidence is large enough move to Hm; if not delete it
7 Goto 2 (until there no entries for Hk are found)Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 21 / 44
The FP-Growth Algorithm… more e�icient than Apriori
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 22 / 44
FP-Growth Algorithm
OverviewCompletely di�erent approach than the Apriori algorithm
Follows a divide and conquer approachLower complexity than Apriori for finding frequent itemsets by several orders of magnitude
I Reduces the scans scans through the transaction database to two
Main idea: transform the data set into an e�icient data structure, the FP Tree1 Build the Frequent Pa�ern Tree (FP-Tree)2 Mine the FP-Tree for frequent itemsets by
F Creating conditional FP-Trees on-the-flyF Recursively mine the conditional FP-Trees for frequent itemsets
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 23 / 44
FP-Growth Algorithm
The FP-TreeContains all information from the transactions
Allows e�icient retrieval of frequent itemsetsConsists of a tree
I Each node represents an itemI Each node has a count, # of transactions from the root to the node
Consists of a linked listI For each frequent item there is a head with countI Links to all nodes with the same item
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 24 / 44
FP-Growth Algorithm
Building the FP-Tree1 Scan through all transactions and count each item generate the table Ix → countx of the
item support given by their count2 Scan through all transactions and
1 Sort the items in the transaction by descending support count2 Throw away infrequent items, countx < minSupport3 Iterate over the items and match against tree (starting with the root node)
F if the current item is found as child to the current node, increase its countF if the current item is not found, create a new child (branch)F Repeat for all items in transaction
4 Repeat for all transactions
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 25 / 44
FP-Growth Algorithm
Given a set of transactions (1), sort the items by frequency in the data set and prune infrequent items (4), create theFP-Tree with the tree itself (root on the le�) a linked list for each of the items (head on the top)
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 26 / 44
FP-Growth Algorithm
Mine the FP-Tree for frequent itemsets1 Start with single item itemsets I ordered by increasing support2 Create a conditional FP-Tree for the selected itemset I
I Remove all items with less support from the treeI Remove all transaction where the selected item is not it
3 Recursively mine the FP-Tree:I If the tree has only one branch, all combination of the items in that branch are frequent item
setsI If the tree has more than one branch:
F Build a new conditional FP-Tree (new recursion) by added the item with the least support toitemset I
I Recursively mine by going to 3 until no new items are in the conditional treeI Goto 2 until all items are done
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 27 / 44
FP-Growth Algorithm
Complete FP-tree (top le�), element e removed (top right), and conditional tree for e (bo�om le�).
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 28 / 44
Sequence Pa�ern MiningMining pa�erns in temporal data
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 29 / 44
Sequence Pa�ern Mining
Initial definitionGiven a set of sequences, where each sequence consists of a list of elements and each elementconsists of a set of items, and given a user-specified min support threshold, sequential pa�ernmining is to find all frequent subsequences, i.e., the subsequences whose occurrence frequencyin the set of sequences is no less than min support.
- R. Agrawal and R. Srikant, “Mining Sequential Pa�erns,” Proc. 1995 Int’l Conf. Data Eng. (ICDE ’95), pp. 3-14, Mar.1995.
Note #1: This definition can then be expanded to include all sorts of constraints.Note #2: Each element in a sequence is an itemset.
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 30 / 44
Sequence Pa�ern Mining
Applications & Use-CasesShopping cart analysis
I e.g. common shopping sequences
DNA sequence analysis(Web)Log-file analysis
I Common access pa�erns
Derive features for subsequent analysis
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 31 / 44
Sequence Pa�ern Mining
Transaction database vs. sequence database
ID Itemset
1 a, b, c2 b, c3 a, e, f, g4 c, f, g
ID Sequence
1 <a(abc)(ac)d(cf)>2 <(ad)c(bc)(ae)>3 <(ef)(ab)(df)cb>4 <eg(af)cbc>
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 32 / 44
Sequence Pa�ern Mining
Subsequence & Super-sequenceA sequence is an ordered list of elements, denoted < x1 x2 . . . xk >
Given two sequences α =< a1 a2 . . . an > and β =< b1 b2 . . . bm >α is called a subsequence of β, denoted as α v β
I if there exist integers 1 ≤ j1 < j2 < · · · < jn ≤ m such that a1 ⊆ bj1 , a2 ⊆ bj2 , . . . , an ⊆ bjnβ is a super-sequence of α
I e.g. α =< (ab), d > and β =< (abc), (de) >Note #1: A database transaction (single sequence) is defined to contain a sequence, if that sequence is a subsequence.Note #2: Frequently found subsequences (within a transaction database) are called sequential pa�erns.
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 33 / 44
Sequence Pa�ern Mining
Initial approaches to sequence pa�ern mining relied on the a-priory principleI If a sequence S is not frequent, then none of the super-sequences of S is frequent
Generalized Sequential Pa�ern Mining (GSP) is one example of such algorithmsDisadvantages:
I Requires multiple scans within the databaseI Potentially many candidates
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 34 / 44
FreeSpan - Algorithm
FreeSpan - Frequent Pa�ern-Projected Sequential Pa�ern MiningInput: minimal support, sequence database
Output: all sequences with a frequency equal or higher than the minimal support
Main idea: Recursively reduce (project) the database instead of scanning the wholedatabase multiple timesProjection
I Reduced version of a database consisting just a set of sequencesI Thus each sequence in the projected database contains the sequence
Keep the sequences sortedI All sequences are kept in an fixed orderI Starting with the most frequent to the least frequent
Similar divide and conquer approach than in FP-Growth, but uses an ordered matrix datastructure instead of tree.
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 35 / 44
PrefixSpan - Algorithm
PrefixSpan - IdeaProblem in FreeSpan: The projected database size dose only shrink in each iteration byremoval of the infrequent items
I Only consider frequent prefixes and not all prefixes to make it shrink faster
PrefixSpan - Algorithm1 Calculate the support for each item. Count it once per sequence, not per occurrence in
sequence2 For each item in increasing order of support: Make the item to an on-element prefix pi and
do3 Generate a projected database for pi4 Find the frequent items f having the current prefix pi and extend pi with f to pi+1
5 Goto 2 for all elements in pi+1 until pi+1 is empty
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 36 / 44
Alternatives & Remarks… and remarks
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 37 / 44
Remarks
Problems of frequent association rulesRules with high confidence, but also high support
I e.g. use significance test
Confidence is relative to the frequency of the premiseI Hard to compare multiple association rules
Redundant association rules
Many frequent itemsets (o�en even more than transactions)
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 38 / 44
Remarks
How to deal with numeric a�ributes?Binning (discretization) on predefined ranges
Distribution-based binning
Extend algorithm to allow numeric a�ributes, e.g. Approximate Association Rule Mining
Frequent closed pa�ern miningA pa�ern is called closed, if there is no superset that satisfies the minimal support
Produces a smaller number of pa�erns
Similar maximum pa�ern mining
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 39 / 44
Further Algorithms
Frequent tree mining
Frequent graph mining
Constraint based miningApproximate pa�ern mining
I e.g. picking a single representative for a group of pa�ernsI … by invoking a clustering algorithm
Mining cyclic or periodic pa�erns
J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent pa�ern mining: current status and future directions,” Data Min. Knowl. Discov., vol.15, no. 1, pp. 55–86, Jan. 2007.
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 40 / 44
Connection to ML
Connection to other tasks in machine learningFrequent pa�ern based classification
I Mine the relation between features/items and the labelI … useful for feature selection/engineering
Frequent pa�ern based clusteringI Clustering in high dimensions is a challenge, e.g. textI … use frequent pa�erns as a dimensionality reduction
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 41 / 44
Evaluation & Tools… for pa�ern mining
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 42 / 44
Evaluation
EvaluationUsually there are many frequent itemsets and rules found.
I Which ones are the interesting once?I Which ones are spurious?
Support & ConfidenceI Support and Confidence are defined on the primes that the occurrence of the elements is
statistically independent from each other.I This is not true in the general case!I If there is a occurrence frequencies are mismatched they tend to fail.
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 43 / 44
Evaluation
Contingency table
Y YX f11 f10 f1+X f01 f00 f0+
f+1 f+0 N
f11 support of X and Y
f10 support of X and Y
f01 support of X and Y
f01 support of X and Y
f+1 & f+1 sum of column
f1+ & f0+ sum of column
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 44 / 44
Evaluation
Problems with Confidence
Co�ee Co�eeTea 15 5 20Tea 75 5 80
90 10 100
Association Rule: Tea→ Co�ee
Conf(Tea→ Co�ee) = 1520 = 0.75
But P(Co�ee) = 0.9 meaning knowing that a persondrinks tea reduces the probability that the persondrinks co�ee considerably
P(Co�ee|Tea) = 7580 = 0.9375)
What rules to look for?Conf(X → Y) should be high, so that the rule is more likely to be true than false
Conf(X → Y) > Support(Y): Otherwise, the rule will be misleading since having item Xactually reduces the chance of having item Y in the same transaction
There are loads of such measures!
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 45 / 44
Evaluation
Li� and Interest
Lift = P(Y |X)P(Y) Interest = P(Y ,X)
P(Y)P(Y)
Lift is used for rules
Interest is used for itemsets
Also called Interest Factor (IF) or Interest (I)
It measures the ratio of the support of a pa�ern (numerator) against its baseline support(denominator) under the independence assumption.
Lift = 1 implies statistical independence
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 46 / 44
Evaluation
Problems with Confidence
Co�ee Co�eeTea 15 5 20Tea 75 5 80
90 10 100
Association Rule: Tea→ Co�ee
Conf(Tea→ Co�ee) = 1520 = 0.75
But P(Co�ee) = 0.9
Lift(Tea→ Co�ee) = 7590 = 0.8333)
Lift is smaller than 1, so there is a slightly negativerelationship between tea drinkers and co�eedrinkers.
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 47 / 44
Tools for pa�ern Mining
Java ToolsSPMF - http://www.philippe-fournier-viger.com/spmf/
Python ToolsPyFIM - http://www.borgelt.net/pyfim.html
PrefixSpan - https://pypi.org/project/prefixspan/
R Toolsarules - https://cran.r-project.org/web/packages/arules/index.html
arulesSequences -https://cran.r-project.org/web/packages/arulesSequences/index.html
Spark (Scala, Java, Python & R)Spark ML -https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 48 / 44
The EndNext: Deep Learning
Recommended book: P-N. Tan, M. Steinbach, A. Karpatne, V. Kumar: Introduction to DataMining. Pearson, Second Edition, 2018. ISBN 9780134545943https://www-users.cs.umn.edu/~kumar001/dmbook/index.php Chapter 5 “AssociationAnalysis: Basic Concepts and Algorithms” available on the web page
Christian Borgelt Slides: http://www.borgelt.net/teach/fpm/slides.html (516 slides)http://www.is.informatik.uni-duisburg.de/courses/im_ss09/folien/
MiningSequentialPatterns.ppt
Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 49 / 44