SAS Homework 4 Review Clustering and Segmentation

SAS HOMEWORK 4 REVIEWCLUSTERING AND SEGMENTATION

MIS2502Data Analytics

SAS Homework 4 Review Clustering and Segmentation

• Using AAEM.DUNGAREE Data Set • Explore data set : SALESTOT and STOREID • Assign ID to STOREID • SALESTOT Role – Rejected

• Add a Cluster node (Explore)• In Properties select Internal Standardization => Standardize

• Run and Evaluate • Change Properties Segment Max to 6• Run and Evaluate • Add a Segment Profile node (Assess)• Run and Evaluate

Set Up • Retail – looking for patterns sales of types of jeans by

store

Data Source - Edit Variables

Data Source – Explore

Note scale

Add Cluster Node, Standardize

Segments, Automaticnote root mean square std deviation

Change Number of Clusters to 6

Segments, Max 6note root mean square std deviation

Segment Profile Node

Segment Profiles red outline is the overall distribution

Questions How do the SALESTOT and STOREID distributions differ from the other variables’ distributions (look at the histograms of each one)? Assign STOREID a model role of ID and SALESTOT a model role of Rejected. Make sure that the remaining variables have the Input model role and the Interval measurement level. Based on the variable descriptions on page 1 and your answer to part

Why do you think that the variable SALESTOT should be rejected?

Add a Cluster node to the diagram workspace and connect it to the Input Data node.

Select the Cluster node and select Internal Standardization Standardization. Why is it important to standardize your inputs? (hint: look at the range of the scales on the X axis of the histograms)

Run the diagram from the Cluster node and examine the results.How many clusters are created?

What might be a problem with having so many clusters?

What is the highest root mean squared standard deviation among the clusters?Two hints:

Look at the Mean Statistics window.The root mean squared standard deviation means basically the same thing as the sum of squares error.

Distribution of Store Id

Distribution of SaleTot

• Does tell you that there are a handful of stores selling well below average

• These 2 variables aren’t useful for the product mix analysis.

Why Standardize ?

• Note difference in range of numbers on x axis

Segment Profile Node

Reading a Histogram Look at the distribution in total, and then the individual bars. For this distribution you would say that for this segment, they sell less original jeans than average, and in a narrower range /with less variability (not part of the question). Overall you can say this because the distribution is to the left of and 'tighter' than the overall distribution.

3) note that for ranges 3 ,4 and 5, the overall average (red) shows roughly that 65% of stores sell in these volume ranges (11% and 23 % and 31% respectively). You get this by reading the Y axis.

5) Conclusion: Overall, this segment has more stores selling original jeans in lower volume ranges than the overall average. Therefore, for this segment we can say that the stores sell less Original Jeans than average.

2) Note that you have 8 ranges of standardized sales volumes on the x axis for the overall average (the red). These are ordered for lowest (on the left) to highest (on the right). We established this earlier when looking at the individual segments.

1) The red bars are the distribution of Original Jeans sales over all segments. By comparing the specific segment distribution (blue) to the overall distribution (red) you can make some observations about the what makes this segment different in regards to Original Jeans sold.

4) Now look at the specific segment distribution (blue). For this segment approximately 86% of the stores sell within volume ranges 3 and 4.,

Segment Profiles red outline is the overall distribution

Original

In Class

Answer the questions about this output: 1. How many distinct customer groups (segments) are there? 2. Explain how the customers in cluster 1 are different from cluster 2? 3. What aspect of the customer data most differentiates cluster 1 from cluster 3? 4. Which cluster has the highest cohesion? In practical terms, what does that mean?

In Class – Evaluating Clustering Output

5. Is the root mean squared standard deviation of these clusters higher or lower than they were in the three cluster scenario? Why?

6. Is the distance to the nearest cluster higher or lower than in the three cluster scenario? Why?

7. Which scenario (#1 or #2) has higher cohesion among its clusters?

8. Which scenario (#1 or #2) has higher separation between its clusters?

Documents

SAS Homework 4 Review Clustering and Segmentation