22
MS Sequence Clustering

MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

Embed Size (px)

Citation preview

Page 1: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

MS Sequence Clustering

Page 2: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

What is it?

• We know clustering, especially EM (Expectation Maximization)

• Now, what is a sequence?– A series of discrete events (state), usually finite

• Education path high school, work, college, professional school, graduate school, community colleges

• Set of URLs, or parameters, at AMAZON• DNA (A, G, C, and T)

Page 3: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

What does the algorithm do

• It is a hybrid of sequence and clustering• It is used to analyze a population of cases that

contains sequence data and group those cases into clusters

• For example, at Amazon, – we could just care what are ordered – could be an

clustering problem– If we care where do customers visit before purchases

or not, that is a sequence clustering problem

Page 4: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

Amazon Example

• The company has click information for each customer profile.

• By using the Microsoft Sequence Clustering algorithm on this data, the company can find groups, or clusters, of customers who have similar patterns or sequences of clicks.

• The company can then use these clusters to analyze how users move through the Web site, to identify which pages are most closely related to the sale of a particular product, and to predict which pages are most likely to be visited next.

Page 5: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

How the Algorithm Works

• One of the input columns that the Microsoft Sequence Clustering algorithm uses is a nested table that contains sequence data.

• This data is a series of state transitions of individual cases in a dataset, such as product purchases or Web clicks.

• To determine which sequence columns to treat as input columns for clustering, the algorithm measures the differences, or distances, between all the possible sequences in the dataset.

• After the algorithm measures these distances, it can use the sequence column as an input for the EM method of clustering.

Page 6: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

Markov Chain• Having the Markov property means that,

– Given the present state, future states are independent of the past states.

– Future states will be reached through a probabilistic process instead of a deterministic one

– P(xi+1=G|xi= A) = 0.15 saying that given the current state A, the probability of next state being G is 0.15

Page 7: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

The order of the chain

• An nth-order Markov chain over k states is equivalent to a first order (1st-order) Markov chain over kn states.

• Example, the 2nd- order of A, C, G, T is the same as the 1st-order of AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT.

Page 8: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

State Transition Matrix

• States are – Finite – Not too large– Non-redundant

• If M is the number of states, a state transition matrix is a M*M matrix

Page 9: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

Clustering with Markov Chain

1. Create clusters in random

2. Map each cluster with a chain

3. Assign a case to a few clusters based on fitting and cut-off numbers

4. Calibrate the clusters

5. Repeat steps 3 and 4 until converge

Page 10: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

Number of Clusters

• Sequence clustering may have more clusters than the non-sequence clustering because the meaning of the clustering is more easily understood.

Page 11: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events
Page 12: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

Sequence Clustering Viewer

• Cluster Diagram,

• Cluster Profiles,

• Cluster Characteristics,

• Cluster Discrimination, and

• State Transitions.

Page 13: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

Cluster Diagram Tab

• The layout in the diagram represents the relationships of the clusters, where similar clusters are grouped close together. By default, the shade of the node color represents the density of all cases in the cluster—the darker the node, the more cases it contains.

Page 14: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events
Page 15: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

Cluster Profiles Tab

• The Cluster Profiles tab displays the sequences that exist in each cluster. The clusters are listed in individual columns to the right of the States column.

Page 16: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events
Page 17: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

Cluster Characteristics Tab

• The Cluster Characteristics tab summarizes the transitions between states in a cluster, with bars describing the importance of the attribute value for the selected cluster.

Page 18: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events
Page 19: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

Cluster Discrimination Tab

• With the Cluster Discrimination tab, you can compare two clusters, to determine which models favor which clusters. The tab contains four columns: Variables, Values, Cluster 1, and Cluster 2. If the cluster favors a specific model, a blue bar appears in the Cluster 1 or Cluster 2 column in the row of the corresponding model in the Variables column. The longer the blue bar, the more the model favors the cluster.

Page 20: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events
Page 21: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events

State Transitions Tab

• On the State Transitions tab, you can select a cluster and browse through its state transitions. Each node represents a state of the model. A line represents the transition between states, and each node is based on the probability of a transition. The background color represents the frequency of the node in the cluster.

Page 22: MS Sequence Clustering. What is it? We know clustering, especially EM (Expectation Maximization) Now, what is a sequence? –A series of discrete events