Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

Assessing the Quantitative Significance of Sequential Patterns

Yifan Cao, Dr. Byron Gao5/31/11-8/1/11

Project Statement

• We seek to find a method to quantitatively describe the significance of general sequential patterns

What is “significant” or “interesting?”Name Bread Milk Cereal Eggs

Alvin 1 0 0 1

Brett 1 1 1 1

Chris 1 1 1 1

David 1 0 0 1

Edgar 1 1 1 0

What makes a pattern an interesting one?

Naive answer: depth and length

P-Values

• P-value(Pattern p) = Probability(p occurs naturally at least as often as it does in our data)

• Smaller p-values mean more significant

Why would we care?

• Almost all significance measures deal with non-sequential data

• Those dealing with sequential data are incredibly data-specific

• Identifies patterns that matter from products of the data set’s structure

Sequential vs. Non-sequential Data– Examples of Non-sequential Data:• Groceries purchased• Facebook friends• Top 5 favorite exotic fruits and vegetables

– Examples of Sequential Data:• Words• DNA Sequences• Number of hours you sleep per night

– Unclear/Could be both• Products purchased on Amazon (student prime!)• Books read

Structural Differences

– Non-sequential Data-• Easily expressed as a matrix of supports• No problems with subsets having different sizes• Easy to construct similar data sets thru randomization

– Sequential Data-• Cannot be expressed as a 2-D matrix of supports• Subsets of different lengths are problematic for matrix• Cannot carry out randomization on a matrix of items

Name Bread Milk Cereal Eggs

Alvin 1 0 0 1

Brett 1 1 1 1

Chris 1 1 1 1

David 1 0 0 1

Edgar 1 1 1 0

Name Bread Milk Cereal Eggs

Alvin 1 0 1 0

Brett 1 1 1 1

Chris 1 1 1 1

David 1 0 0 1

Edgar 1 1 0 1

Solution? Think Simpler!

• We’re looking for a method for general sequential patterns

• Proposal- – Randomize the ordering of items in each sequence– Obtain a probability of a pattern occurring for

each sequence– Use such probabilities to generate a distribution

for total number of pattern occurrences

Computing p-values

• For each sequence in the data set, find the probability that if its ordering is randomized, the pattern will occur

• With each sequence having a probability of containing a given pattern, construct the overall distribution of times said pattern occurs in the data set

ABCDE ABCEF EDCFG

ABCBA EDFGH HABCD

Sequence Dictionary

ABCDE A: 1, B: 1, C: 1, D: 1, E: 1

ABCEF A: 1, B: 1, C: 1, E: 1, F: 1

EDCFG C: 1, D: 1, E: 1, F: 1, G: 1

ABCBA A: 2, B: 2, C: 1

EDFGH D: 1, E: 1, F: 1, G: 1, H: 1

HABCD A: 1, B: 1, C: 1, D: 1, H: 1

• Use combinatorics to analyze and compute the probability that a random ordering of a given sequence will contain pattern P

• N = # of unique orderings = ( )• For ABCDE: ( ) For ABCBA: ( )• M = (sequence length – pattern length +1)( )• For P=ABC and sequence ABCBA: M=(3)( )• So the probability of ABCBA containing pattern

P=ABC is M/N = 1/5

Sequence lengthDictionary Values

51,1,1,1,1

52,2,1

Surplus length

Surplus Dictionary Values

21, 1

Advantages

• All work is probabilistic, finding p-values is very fast operation

• Longer patterns’ significance can be built off of shorter patterns’ significance

• Allows large, comprehensive sets of patterns to be judged in significance

• Could lead to significance-based closed-frequent patter finding algorithm

Related Works

• Randomization of real-valued matrices for assessing the significance of data mining results by Markus Ojala

• Ranking Sequential Patterns with Respect to Significance by Robert Gwadera

• Frequent Pattern Mining with Uncertain Data by Charu Aggarwal

Further Study

• Dealing with patterns occurring multiple times within one sequence

• Modifying significance calculation to allow for more flexibility while maintaining overall structure of data

• Algorithmic applications, especially in closed-frequent types of pattern finding algorithms

In Conclusion

• Our method provides great accessibility to the field of sequential patterns

• Combinatoric approach means it runs very fast• Significance calculation approach is highly

scalable for huge sets of patterns

Thank you for listening!

Documents

Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11