Upload
willa-kelley
View
212
Download
0
Embed Size (px)
Citation preview
Assessing the Quantitative Significance of Sequential Patterns
Yifan Cao, Dr. Byron Gao5/31/11-8/1/11
Project Statement
• We seek to find a method to quantitatively describe the significance of general sequential patterns
What is “significant” or “interesting?”Name Bread Milk Cereal Eggs
Alvin 1 0 0 1
Brett 1 1 1 1
Chris 1 1 1 1
David 1 0 0 1
Edgar 1 1 1 0
What makes a pattern an interesting one?
Naive answer: depth and length
P-Values
• P-value(Pattern p) = Probability(p occurs naturally at least as often as it does in our data)
• Smaller p-values mean more significant
Why would we care?
• Almost all significance measures deal with non-sequential data
• Those dealing with sequential data are incredibly data-specific
• Identifies patterns that matter from products of the data set’s structure
Sequential vs. Non-sequential Data– Examples of Non-sequential Data:• Groceries purchased• Facebook friends• Top 5 favorite exotic fruits and vegetables
– Examples of Sequential Data:• Words• DNA Sequences• Number of hours you sleep per night
– Unclear/Could be both• Products purchased on Amazon (student prime!)• Books read
Structural Differences
– Non-sequential Data-• Easily expressed as a matrix of supports• No problems with subsets having different sizes• Easy to construct similar data sets thru randomization
– Sequential Data-• Cannot be expressed as a 2-D matrix of supports• Subsets of different lengths are problematic for matrix• Cannot carry out randomization on a matrix of items
Name Bread Milk Cereal Eggs
Alvin 1 0 0 1
Brett 1 1 1 1
Chris 1 1 1 1
David 1 0 0 1
Edgar 1 1 1 0
Name Bread Milk Cereal Eggs
Alvin 1 0 1 0
Brett 1 1 1 1
Chris 1 1 1 1
David 1 0 0 1
Edgar 1 1 0 1
Solution? Think Simpler!
• We’re looking for a method for general sequential patterns
• Proposal- – Randomize the ordering of items in each sequence– Obtain a probability of a pattern occurring for
each sequence– Use such probabilities to generate a distribution
for total number of pattern occurrences
Computing p-values
• For each sequence in the data set, find the probability that if its ordering is randomized, the pattern will occur
• With each sequence having a probability of containing a given pattern, construct the overall distribution of times said pattern occurs in the data set
ABCDE ABCEF EDCFG
ABCBA EDFGH HABCD
Sequence Dictionary
ABCDE A: 1, B: 1, C: 1, D: 1, E: 1
ABCEF A: 1, B: 1, C: 1, E: 1, F: 1
EDCFG C: 1, D: 1, E: 1, F: 1, G: 1
ABCBA A: 2, B: 2, C: 1
EDFGH D: 1, E: 1, F: 1, G: 1, H: 1
HABCD A: 1, B: 1, C: 1, D: 1, H: 1
• Use combinatorics to analyze and compute the probability that a random ordering of a given sequence will contain pattern P
• N = # of unique orderings = ( )• For ABCDE: ( ) For ABCBA: ( )• M = (sequence length – pattern length +1)( )• For P=ABC and sequence ABCBA: M=(3)( )• So the probability of ABCBA containing pattern
P=ABC is M/N = 1/5
Sequence lengthDictionary Values
51,1,1,1,1
52,2,1
Surplus length
Surplus Dictionary Values
21, 1
Advantages
• All work is probabilistic, finding p-values is very fast operation
• Longer patterns’ significance can be built off of shorter patterns’ significance
• Allows large, comprehensive sets of patterns to be judged in significance
• Could lead to significance-based closed-frequent patter finding algorithm
Related Works
• Randomization of real-valued matrices for assessing the significance of data mining results by Markus Ojala
• Ranking Sequential Patterns with Respect to Significance by Robert Gwadera
• Frequent Pattern Mining with Uncertain Data by Charu Aggarwal
Further Study
• Dealing with patterns occurring multiple times within one sequence
• Modifying significance calculation to allow for more flexibility while maintaining overall structure of data
• Algorithmic applications, especially in closed-frequent types of pattern finding algorithms
In Conclusion
• Our method provides great accessibility to the field of sequential patterns
• Combinatoric approach means it runs very fast• Significance calculation approach is highly
scalable for huge sets of patterns
Thank you for listening!