Constraint Mining of Frequent Patterns in Long Sequences

Constraint Mining of Frequent Patterns in Long Sequences

Constraint Mining of Frequent Patterns in Long SequencesPresented by Yaron GonenOutlineIntroductionProblems definition and motivationPrevious workThe CAMLS AlgorithmOverviewMain contributionsResultsFuture Work

Frequent Item-sets:The Market-Basket ModelA set of items, e.g., stuff sold in a supermarketA set of baskets, (later called events or transactions) each of which is a small set of the items, e.g., the things one customer buys on one day.

SupportSupport for item-set I = the number of baskets containing all items in I (Usually given as a percentage)Given a support threshold minSup, sets of items that appear in > minSup baskets are called frequent item-setsSimplest question: find sets of frequent item-setsExampleItems:

Minimum Support = 0.6 (2 baskets)

Application (1)Items: products at a supermarketBaskets: set of products a customer bought at one time.Example: many people by beer and diapers together. Place beer next to diapers to increase both salesRun a sale on diapers and raise price of beer.

Application (2)(Counter-Intuitive)Items: species of plantsBaskets: each basket represent an attribute. A basket contains items (plants) that have that attributeFrequent sets may indicate similarity between plants

Scale of ProblemCostco sells more than 120k different items, and has 57m members (from Wikipedia)Botany has identified about 350k extant species of plants

The Nave AlgorithmGenerate all possible itemsets.Check their support.

,,,,

,

,,,,

,

The Apriori PropertyAll nonempty subsets of a frequent itemset must also be frequent.

XXXXThe Apriori AlgorithmFind frequent 1-itemsetsMerge and prune to generate candidate of next sizeHas candidates?EndGo though whole DB to count supportyesno> min support?Frequent itemsetHeres where the apriori property is used.Largest itemsets length times going over the DBVertical FormatIndex on items.Calculating support is fast

1232121233Frequent Sequences:Taking it to the Next LevelA large set of sequences. Each of which is a time ordered list of events (baskets), e.g., all the stuff a single customer buys over time

2 weeks5 days

SupportSubsequence: a sequence, that all its events are subsets of another sequence, in the same order (but not necessarily consecutive)Support for subsequence s = the number of sequences containing s (Usually given as a percentage)Given a support threshold minSup, subsequences that appear in > minSup sequences are called frequent subsequencesSimplest question: find all frequent subsequenceNotationsItems are letters: a,b,Events are parenthesized: (ab), (bdf),Except for events with single itemsSequences are surrounded by Every sequence has an identifier sidExamplesidsequence1

2

3

4

Frequent Sequences:4:3:4:2:2minSup = 0.5MotivationCustomer shopping patternsStock market fluctuationWeblog click stream analysisSymptoms of a diseasesDNA sequence analysisWeather forecastMachine anti agingMany more

Much Harder than Frequent Item-sets!2m*n possible candidates!

Where m is the number of items, and n in the number of transactions in the longest sequence

The Apriori PropertyIf a sequence is not frequent, then any sequence that contains it cannot be frequent

ConstraintsProblems:Too many frequent sequencesmost frequent sequences are not usefulSolution remove themConstraints are a way to define usefulness The trick do so while mining

Previous WorkGSP (Srikant and Agrawal, 1996)Generation-and-test Apriori Based approachSPADE (Zaki, 2001)Generation-and-test Apriori Based approachUses equivalence-class for memory optimizationUses a vertical-format dbPrefixSpan (Pei, 2004)No candidate generationUsing a db-projection methodWhy a New Algorithm?Huge set of candidate-sequences/projected db generatedMultiple Scans of database neededInefficient for mining long sequential patternsNo exploits of domain-specific propertiesWeak constraints supportThe CAMLS AlgorithmConstraint-based Apriori algorithm for Mining Long SequencesDesigned especially for efficient mining of long sequencesOutperforms SPADE and PrefixSpan on both synthetic and real data

The CAMLS AlgorithmMakes a logical distinction between two types of constraints:Intra-Event: not time related (i.e. mutually exclusive items)Inter-Event: addresses the temporal aspect of the data (i.e. values that can or cannot appear one after the other)

Event-wise ConstraintsEvent must/must not contain a specific itemTwo items cannot occur on the same timemax_event_length: An event cannot contain more than a fixed number of itemsSequencewise Constraintsmax_sequence_length: a sequence cannot contain more than a fixed number of eventsmax_gap: long time between events dismisses the patternCAMLS OverviewConstraints (minSup, maxGap, )

InputEvent-wiseSequence-wiseOutput

Frequent events + occurrence indexWhat Do We Get?The best of both worlds:Much less candidates are being generated.Support check is fast.Worst case: works like SPADE.Tradeoff: Uses a bit more memory (for storing the frequent item-sets).

Event-wise PhaseInput: sequence database and constraintsOutput: frequent events + occurrence indexUse Apriori or FP-Growth to find frequent itemsets (both with minor modifications)

Event-wiseL1 = all frequent itemsfor k=2;Lk-1;k++ dogenerateCandidates(Lk-1)Lk = pruneCandidates()L = L Lkend for

Example soon!If two frequent (k-1) event have the same prefix merge them and form a new candidatePrune, calculate support count and create occurrence indexOccurrence IndexA compact representation of all occurrences of a sequenceStructure: list of sids, each associated with a list of eids

Example on next slide!eid1sid1sid2sid3eid2eid3eid4eid5eid6eid7eid8eid9sequenceEvent-wise Example(Using Apriori)eventeidsid(acd)01(bcd)51b101a02c42(bd)82(cde)03e73(acd)113minSup=2All frequent items:a:3, b:2, c:3, d:3candidates:(ab),(ac),(ad),(bc),Support count:(ac):2, (ad):2, (bd):2, (cd):2candidates:(abc), (abd),(acd),Support count:(acd):2No more candidates!13011Sequence-wise PhaseInput: frequent events + occurrence index, constraintsOutput: all frequent sequencesSimilar to GSPs and SPADEs candidate generation phase except using the frequent itemsets as seeds

Sequence-wiseL1 = all frequent 1-sequencesfor k=2;Lk-1;k++ dogenerateCandidates(Lk-1)Lk = pruneAndSupCalc()L = L Lkend for

Elaboration on next two slideSequence-wise Candidate GenerationIf two frequent k-sequences s and s share a common k-1 prefix and s1 is a generator, we form a new candidate

s = s = =

Sequence-wise PruningKeep a radix-ordered list of pruned sequences in current iterationIn the same iteration its possible that a k-sequence will contain another k-sequence in the same iteration.With a new candidate:Check subsequence in pruned list: Very Fast!Test for frequencyAdd to pruned list if needed

Support CalculationA simple intersection operation between the occurrence index of the forming sequencesWhen a new occurrence index is formed, calculation is trivial

The maxGap ConstraintmaxGap is a special kind of constraint:Data dependantApriori property not applicableThe occurrence index enables fast maxGap checkA frequent sequence that does not satisfy maxGap is flagged as non-generator. Example:Assume is frequent but gap between a and b > maxgapBut frequent sequences and and in all maxgap constraints are ok!So is a non-Generator but kept in order not to prune

Sequence-Wise Exampleeventeidsid(acd)01(bcd)51b101a02c42(bd)82(cde)03e73(acd)113Original DB

Event-wiseminSup=2maxGap=5g : 3g : 2 g : 3g : 3g : 2g : 2g : 2g : 2g : 2Candidate generation

is added to pruned list. is a super-sequence of , therefore it is pruned. does not pass maxGap, therefore it is not a generator. : 2g : 2 : 2 : 2g : 2g : 2g : 2 : 2 : 2 : 2

:2No more candidates!Evaluation (1):Machine Anti AgingHow can Sequence Mining Help?Data collected from machine is a sequenceDiscover typical behavior leading to failureMonitor machine and alert before failureDomain:Light intensity for wavelengths (continuous)Pre-processDiscretizationMeta features (maxDisc, maxWL, isBurned)Synm stands for a synthetic database simulating the machine behavior with m meta-features

Evaluation (2)Real Stocks data valuesRn stands for stock data (10 different stocks) for n daysCAMLS Compared with PrefixSpan

CAMLS Compared with Spade and PrefixSpan

So, Whats CAMLS Contribution?Constraints distinction: easy implementationTwo phasesHandling on the MaxGap constraintOccurrence index data structureFast new pruning methodFuture ResearchMain issue: closed sequencesMore constraints (aspiring regexp)

Thank You!

Documents

Constraint Mining of Frequent Patterns in Long Sequences