17
Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

Embed Size (px)

Citation preview

Page 1: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

Assessing the Quantitative Significance of Sequential Patterns

Yifan Cao, Dr. Byron Gao5/31/11-8/1/11

Page 2: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

Project Statement

• We seek to find a method to quantitatively describe the significance of general sequential patterns

Page 3: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

What is “significant” or “interesting?”Name Bread Milk Cereal Eggs

Alvin 1 0 0 1

Brett 1 1 1 1

Chris 1 1 1 1

David 1 0 0 1

Edgar 1 1 1 0

What makes a pattern an interesting one?

Naive answer: depth and length

Page 4: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

P-Values

• P-value(Pattern p) = Probability(p occurs naturally at least as often as it does in our data)

• Smaller p-values mean more significant

Page 5: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

Why would we care?

• Almost all significance measures deal with non-sequential data

• Those dealing with sequential data are incredibly data-specific

• Identifies patterns that matter from products of the data set’s structure

Page 6: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

Sequential vs. Non-sequential Data– Examples of Non-sequential Data:• Groceries purchased• Facebook friends• Top 5 favorite exotic fruits and vegetables

– Examples of Sequential Data:• Words• DNA Sequences• Number of hours you sleep per night

– Unclear/Could be both• Products purchased on Amazon (student prime!)• Books read

Page 7: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

Structural Differences

– Non-sequential Data-• Easily expressed as a matrix of supports• No problems with subsets having different sizes• Easy to construct similar data sets thru randomization

– Sequential Data-• Cannot be expressed as a 2-D matrix of supports• Subsets of different lengths are problematic for matrix• Cannot carry out randomization on a matrix of items

Page 8: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

Name Bread Milk Cereal Eggs

Alvin 1 0 0 1

Brett 1 1 1 1

Chris 1 1 1 1

David 1 0 0 1

Edgar 1 1 1 0

Name Bread Milk Cereal Eggs

Alvin 1 0 1 0

Brett 1 1 1 1

Chris 1 1 1 1

David 1 0 0 1

Edgar 1 1 0 1

Page 9: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

Solution? Think Simpler!

• We’re looking for a method for general sequential patterns

• Proposal- – Randomize the ordering of items in each sequence– Obtain a probability of a pattern occurring for

each sequence– Use such probabilities to generate a distribution

for total number of pattern occurrences

Page 10: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

Computing p-values

• For each sequence in the data set, find the probability that if its ordering is randomized, the pattern will occur

• With each sequence having a probability of containing a given pattern, construct the overall distribution of times said pattern occurs in the data set

Page 11: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

ABCDE ABCEF EDCFG

ABCBA EDFGH HABCD

Sequence Dictionary

ABCDE A: 1, B: 1, C: 1, D: 1, E: 1

ABCEF A: 1, B: 1, C: 1, E: 1, F: 1

EDCFG C: 1, D: 1, E: 1, F: 1, G: 1

ABCBA A: 2, B: 2, C: 1

EDFGH D: 1, E: 1, F: 1, G: 1, H: 1

HABCD A: 1, B: 1, C: 1, D: 1, H: 1

Page 12: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

• Use combinatorics to analyze and compute the probability that a random ordering of a given sequence will contain pattern P

• N = # of unique orderings = ( )• For ABCDE: ( ) For ABCBA: ( )• M = (sequence length – pattern length +1)( )• For P=ABC and sequence ABCBA: M=(3)( )• So the probability of ABCBA containing pattern

P=ABC is M/N = 1/5

Sequence lengthDictionary Values

51,1,1,1,1

52,2,1

Surplus length

Surplus Dictionary Values

21, 1

Page 13: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

Advantages

• All work is probabilistic, finding p-values is very fast operation

• Longer patterns’ significance can be built off of shorter patterns’ significance

• Allows large, comprehensive sets of patterns to be judged in significance

• Could lead to significance-based closed-frequent patter finding algorithm

Page 14: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

Related Works

• Randomization of real-valued matrices for assessing the significance of data mining results by Markus Ojala

• Ranking Sequential Patterns with Respect to Significance by Robert Gwadera

• Frequent Pattern Mining with Uncertain Data by Charu Aggarwal

Page 15: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

Further Study

• Dealing with patterns occurring multiple times within one sequence

• Modifying significance calculation to allow for more flexibility while maintaining overall structure of data

• Algorithmic applications, especially in closed-frequent types of pattern finding algorithms

Page 16: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

In Conclusion

• Our method provides great accessibility to the field of sequential patterns

• Combinatoric approach means it runs very fast• Significance calculation approach is highly

scalable for huge sets of patterns

Page 17: Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

Thank you for listening!