20
SAS HOMEWORK 4 REVIEW CLUSTERING AND SEGMENTATION MIS2502 Data Analytics

SAS Homework 4 Review Clustering and Segmentation

  • Upload
    maj

  • View
    201

  • Download
    1

Embed Size (px)

DESCRIPTION

SAS Homework 4 Review Clustering and Segmentation . MIS2502 Data Analytics. SAS Homework 4 Review Clustering and Segmentation . Using AAEM.DUNGAREE Data Set Explore data set : SALESTOT and STOREID Assign ID to STOREID SALESTOT Role – Rejected Add a Cluster node (Explore) - PowerPoint PPT Presentation

Citation preview

Page 1: SAS Homework 4 Review Clustering and Segmentation

SAS HOMEWORK 4 REVIEWCLUSTERING AND SEGMENTATION

MIS2502Data Analytics

Page 2: SAS Homework 4 Review Clustering and Segmentation

SAS Homework 4 Review Clustering and Segmentation

• Using AAEM.DUNGAREE Data Set • Explore data set : SALESTOT and STOREID • Assign ID to STOREID • SALESTOT Role – Rejected

• Add a Cluster node (Explore)• In Properties select Internal Standardization => Standardize

• Run and Evaluate • Change Properties Segment Max to 6• Run and Evaluate • Add a Segment Profile node (Assess)• Run and Evaluate

Page 3: SAS Homework 4 Review Clustering and Segmentation

Set Up • Retail – looking for patterns sales of types of jeans by

store

Page 4: SAS Homework 4 Review Clustering and Segmentation

Data Source - Edit Variables

Page 5: SAS Homework 4 Review Clustering and Segmentation

Data Source – Explore

Note scale

Page 6: SAS Homework 4 Review Clustering and Segmentation

Add Cluster Node, Standardize

Page 7: SAS Homework 4 Review Clustering and Segmentation

Segments, Automaticnote root mean square std deviation

Page 8: SAS Homework 4 Review Clustering and Segmentation

Change Number of Clusters to 6

Page 9: SAS Homework 4 Review Clustering and Segmentation

Segments, Max 6note root mean square std deviation

Page 10: SAS Homework 4 Review Clustering and Segmentation

Segment Profile Node

Page 11: SAS Homework 4 Review Clustering and Segmentation

Segment Profiles red outline is the overall distribution

Page 12: SAS Homework 4 Review Clustering and Segmentation

Questions How do the SALESTOT and STOREID distributions differ from the other variables’ distributions (look at the histograms of each one)? Assign STOREID a model role of ID and SALESTOT a model role of Rejected. Make sure that the remaining variables have the Input model role and the Interval measurement level. Based on the variable descriptions on page 1 and your answer to part

Why do you think that the variable SALESTOT should be rejected?

Add a Cluster node to the diagram workspace and connect it to the Input Data node.

Select the Cluster node and select Internal Standardization Standardization. Why is it important to standardize your inputs? (hint: look at the range of the scales on the X axis of the histograms)

Run the diagram from the Cluster node and examine the results.How many clusters are created?

What might be a problem with having so many clusters?

What is the highest root mean squared standard deviation among the clusters?Two hints:

Look at the Mean Statistics window.The root mean squared standard deviation means basically the same thing as the sum of squares error.

Page 13: SAS Homework 4 Review Clustering and Segmentation

Distribution of Store Id

Page 14: SAS Homework 4 Review Clustering and Segmentation

Distribution of SaleTot

• Does tell you that there are a handful of stores selling well below average

• These 2 variables aren’t useful for the product mix analysis.

Page 15: SAS Homework 4 Review Clustering and Segmentation

Why Standardize ?

• Note difference in range of numbers on x axis

Page 16: SAS Homework 4 Review Clustering and Segmentation

Segment Profile Node

Page 17: SAS Homework 4 Review Clustering and Segmentation

Reading a Histogram Look at the distribution in total,  and then the individual bars.  For this distribution you would say that for this segment, they sell less original jeans than average, and in a narrower range /with less variability (not part of the question).  Overall you can say this because the distribution is to the left of and 'tighter' than the overall distribution.      

3) note that for ranges 3 ,4 and 5, the overall average (red) shows  roughly that 65% of stores sell in these volume ranges (11%  and 23 %  and 31% respectively). You get this by reading the Y axis.

5) Conclusion: Overall, this segment has more stores selling original jeans in lower volume ranges  than the overall average.  Therefore, for this segment we can say that the stores sell less Original Jeans than average. 

2) Note that you have 8 ranges of standardized sales volumes on the x axis for the overall average (the red).  These are ordered for lowest (on the left) to highest (on the right).  We established this earlier when looking at the individual  segments.

1) The red bars are the distribution of Original Jeans sales over all segments. By comparing the specific segment distribution (blue) to the overall distribution (red) you can make some observations about the what makes this segment different in regards to Original Jeans sold.

4) Now look at the specific segment distribution (blue). For this segment approximately 86% of the stores sell within  volume ranges 3 and 4., 

Page 18: SAS Homework 4 Review Clustering and Segmentation

Segment Profiles red outline is the overall distribution

Original

Page 19: SAS Homework 4 Review Clustering and Segmentation

In Class

Answer the questions about this output: 1. How many distinct customer groups (segments) are there? 2. Explain how the customers in cluster 1 are different from cluster 2? 3. What aspect of the customer data most differentiates cluster 1 from cluster 3? 4. Which cluster has the highest cohesion? In practical terms, what does that mean?

Page 20: SAS Homework 4 Review Clustering and Segmentation

In Class – Evaluating Clustering Output

5. Is the root mean squared standard deviation of these clusters higher or lower than they were in the three cluster scenario? Why?

6. Is the distance to the nearest cluster higher or lower than in the three cluster scenario? Why?

7. Which scenario (#1 or #2) has higher cohesion among its clusters?

8. Which scenario (#1 or #2) has higher separation between its clusters?