227
Copyright © SAS Institute Inc. All rights reserved. Monte Carlo K-Means Clustering SAS Enterprise Miner Donald K. Wedding, PhD Director of Data Science Sprint Corporation [email protected]

Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Monte Carlo K-Means Clustering SAS Enterprise Miner

Donald K. Wedding, PhD

Director of Data Science

Sprint Corporation

[email protected]

Page 2: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

What Is Clustering?

Page 3: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

3

K-Means Clustering

• Technique can be used on other data such as CUSTOMER data

• K-Means clustering allows for grouping multiple variables simultaneously

• More sophisticated treatment of customers than is possible from simple segmentation

3

Page 4: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

4

K-Means Clustering Clusters based on AGE and INCOME

4

How many clusters do you see?

Page 5: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

5

K-Means Clustering Visual Inspection “proc eyeball”

5

There are FOUR clusters.

Page 6: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

6

K-Means Clustering

A bank might use these clusters for “cross sell”

• Recent Graduates : Overdraft Protection

• Peak Income : Mortgage, Heloc , Investment Account

• Retired : Trust Fund, Retirement Account, Estate Planning

• Unemployed : Unprofitable – “Choose to Lose”

6

Page 7: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

What Affects Cluster Quality?

Page 8: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

8

What Affects Cluster Results?

• How many clusters are there?

• Cluster Starting Points (“Seeds”)?

Page 9: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

9

What Affects Cluster Results?

• How many clusters are there?

• Cluster Starting Points (“Seeds”)?

Page 10: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

How Many Clusters?

Page 11: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

11

How Many Clusters: Example

Given the Following Data Points

• Find the cluster centers for N=2 Clusters

• Find the cluster centers for N=3 Clusters

• Find the cluster centers for N=4 Clusters

Page 12: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

12

How Many Clusters: Example

Page 13: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

13

How Many Clusters: Example 2 Clusters

13

Page 14: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

14

How Many Clusters: Example 2 Clusters

14

Page 15: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

15

How Many Clusters: Example 2 Clusters

15

Page 16: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

16

How Many Clusters: Example 2 Clusters

16

Page 17: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

17

How Many Clusters: Example 2 Clusters

17

Page 18: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

18

How Many Clusters: Example 2 Clusters

18

Page 19: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

19

How Many Clusters: Example 2 Clusters

19

Page 20: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

20

How Many Clusters: Example 3 Clusters

20

Page 21: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

21

How Many Clusters: Example 3 Clusters

21

Page 22: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

22

How Many Clusters: Example 3 Clusters

22

Page 23: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

23

How Many Clusters: Example 3 Clusters

23

Page 24: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

24

How Many Clusters: Example 3 Clusters

24

Page 25: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

25

How Many Clusters: Example 3 Clusters

25

Page 26: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

26

How Many Clusters: Example 3 Clusters

26

Page 27: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

27

How Many Clusters: Example 4 Clusters

27

Page 28: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

28

How Many Clusters: Example 4 Clusters

28

Page 29: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

29

How Many Clusters: Example 4 Clusters

29

Page 30: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

30

How Many Clusters: Example 4 Clusters

30

Page 31: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

31

How Many Clusters: Example 4 Clusters

31

Page 32: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

32

How Many Clusters: Example 4 Clusters

32

Page 33: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

33

How Many Clusters: Example 4 Clusters

33

Page 34: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

34

How Many Clusters: Example 4 Clusters

34

Page 35: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

35

Summary

Given the Following Data Points

• Find the cluster centers for N=2 Clusters

• Find the cluster centers for N=3 Clusters

• Find the cluster centers for N=4 Clusters

Page 36: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

36

Summary

36

Page 37: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

37

K-Means Clustering Clusters based on AGE and INCOME

37

How many clusters do you see?

Page 38: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

38

38

Page 39: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

39

39

Page 40: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

40

40

Page 41: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

41

41

Page 42: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

42

Page 43: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

43

Page 44: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

44

44

Page 45: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

45

Page 46: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

46

Page 47: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

47

Page 48: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

48

What Affects Cluster Results?

• How many clusters are there?

• Cluster Starting Points (“Seeds”)?

Page 49: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

49

What Affects Cluster Results?

• How many clusters are there?

• Cluster Starting Points (“Seeds”)?

Page 50: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

What Are The Center Points?

Page 51: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

51

Cluster Seeds: Example

Given the Following Data Points

• Find the cluster centers for N=3 Clusters

• Find the cluster centers using Starting Point “A”

• Find the cluster centers using Starting Point “B”

Page 52: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

52

Cluster Seeds: Example

Page 53: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

53

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “A”

53

Page 54: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

54

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “A”

54

Page 55: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

55

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “A”

55

Page 56: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

56

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “A”

56

Page 57: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

57

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “A”

57

Page 58: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

58

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “A”

58

Page 59: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

59

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “A”

59

Page 60: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

60

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “A”

60

Page 61: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

61

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “B”

61

Page 62: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

62

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “B”

62

Page 63: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

63

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “B”

63

Page 64: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

64

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “B”

64

Page 65: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

65

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “B”

65

Page 66: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

66

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “B”

66

Page 67: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

67

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “B”

67

Page 68: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

68

Summary

Page 69: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

69

Summary

Page 70: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

70

K-Means Clustering Clusters based on AGE and INCOME

70

How many clusters do you see?

Page 71: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

71

71

Page 72: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

72

72

Page 73: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

73

73

Page 74: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

74

74

Page 75: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

75

75

Page 76: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

76

76

Page 77: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

77

Summary

Page 78: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

78

Summary

Page 79: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

79

What Affects Cluster Results?

• How many clusters are there?

• Cluster Starting Points (“Seeds”)?

Page 80: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

80

What Affects Cluster Results?

• How many clusters are there?

• Cluster Starting Points (“Seeds”)?

Page 81: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Approximate The Number of Clusters

Page 82: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Diagram 4300

Page 83: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

83

Page 84: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

84

How Many Clusters?

Set the number of clusters to”automatic”

Set the Following Parameters: • Preliminary Max = 50

Assume that initially there might be as many as 50 clusters

• Minimum = 2 When complete, there will be at least 2 clusters.

• Final Maximum = 20 When complete, there will be no more than 20 clusters.

Page 85: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

85

How Many Clusters?

• SAS Enterprise Miner allows user to “guess” at the number of clusters within a RANGE (example: at least 2 and at most 20 is default)

• SAS Enterprise Miner will estimate the optimal number of clusters

• Optimal number of clusters will vary depending upon clustering parameters.

• STEP1: Narrow the “Search Range” by clustering using multiple parameters

Page 86: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

86

How Many Clusters?

Measurement of cluster distances • Average • Centroid • Ward (Default)

Page 87: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

87

Cluster Selection Methods SAS Enterprise Miner

• Average

Calculate the average distance from every point in one cluster to every point in another cluster

• Centroid

Find the distance from one cluster center point to another cluster center point

• Ward (Default Method)

Cluster measurement is based on the ANOVA sum of squares of the two clusters

Page 88: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

88

How Many Clusters?

How are Initial Clusters Centers Chosen? • First “n” Records • MacQueen Drifting • Full Replacement • Princomp • Partial Replacement (Default)

Page 89: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

89

Cluster Seed Selection SAS Enterprise Miner

• First “N” Records Method

• Use the first “N” records in the list as seeds

• Partial Replacement Method (Default)

• Select “N” records that are far away from each other

• Full Replacement Method

• Select “N” records that are very far away from each other by looking for outliers.

• Principal Component Method

• Select “N” evenly spaced records along the first Principal Component Vector

• MacQueen “Drifting” Method

• Use the first “N” records in the list as seeds. Assign records to clusters one by one and recomputes center after each record is assigned aka “drifting”.

Page 90: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

90

Approximate Number Of Clusters Diagram 4300

Page 91: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

91

Page 92: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

92

Page 93: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

93

Page 94: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

94

Page 95: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

95

Example 1: Random Seeds – Synthetic Data SAS Program to generate synthetic data

95

• Program creates 1000 data points with two values: X,Y

• 200 points centered at (3,3)

• 200 points centered at (5,5)

• 200 points centered at (4,6)

• 200 points centered at (6,4)

• 200 points centered at (4,4)

• Each X and Y value has noise added to

• Normally distributed random number

• Random number is multiplied by a weight of 0.5

Page 96: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

96

Example 1: Random Seeds – Synthetic Data %let COUNT = 200 %let WEIGHT = 0.5; %let SEED = 1; %let INFILE = INFILE; %let OUTFILE = RANDOM_DATA; data &INFILE.; do I = 1 to &COUNT.; X = 3.0; Y = 3.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 5.0; Y = 5.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 4.0; Y = 6.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 6.0; Y = 4.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 4.0; Y = 4.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; end; drop I; run; data &OUTFILE.; set &INFILE.; X = X + &WEIGHT.*NOISE_X; Y = Y + &WEIGHT.*NOISE_Y; drop NOISE_X; drop NOISE_Y; run;

Page 97: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

97

Noise Level 0.5

Page 98: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

98

Page 99: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

99

Random Seeds – Shuffle Cards %let SEED = 1; %let INFILE = RANDOM_DATA; %let TEMPFILE = TEMPFILE; %let OUTFILE = SORTED_DATA; data &TEMPFILE.; set &INFILE.; SORT = ranuni( &SEED. ); run; proc sort data=&TEMPFILE.; by SORT; run; data &OUTFILE.; set &TEMPFILE.; drop SORT; run; proc print data=&OUTFILE.(obs=5); run;

Page 100: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

100

Random Seeds – Shuffle Cards %let SEED = 1; %let INFILE = RANDOM_DATA; %let TEMPFILE = TEMPFILE; %let OUTFILE = SORTED_DATA; data &TEMPFILE.; set &INFILE.; SORT = ranuni( &SEED. ); run; proc sort data=&TEMPFILE.; by SORT; run; data &OUTFILE.; set &TEMPFILE.; drop SORT; run; proc print data=&OUTFILE.(obs=5); run;

Random Number Seed: Changing this value will cause the list of data points to be put in a different order (“shuffled”)

Page 101: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

101

Page 102: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

102

How Many Clusters?

Page 103: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

103

Ward / First = 7 clusters

Page 104: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

104

Ward / MacQueen = 3 clusters

Page 105: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

105

Ward / Full = 5 clusters

Page 106: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

106

Ward / Princomp = 5 clusters

Page 107: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

107

Ward / Partial = 3 clusters

Page 108: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

108

Page 109: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

109

Average / First = 5 clusters

Page 110: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

110

Average / MacQueen = 5 clusters

Page 111: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

111

Average / Full = 7 clusters

Page 112: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

112

Average / Princomp = 5 clusters

Page 113: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

113

Average / Partial = 5 clusters

Page 114: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

114

Page 115: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

115

Centroid / First = 4 clusters

Page 116: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

116

Centroid / MacQueen = 5 clusters

Page 117: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

117

Centroid / Full = 6 clusters

Page 118: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

118

Centroid / Princomp = 5 clusters

Page 119: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

119

Centroid / Partial = 6 clusters

Page 120: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

120

How Many Clusters?

Cluster Ward Average Centroid

First 7 5 4

MacQueen 3 5 5

Full 5 7 6

Princomp 5 5 5

Partial 3 5 6

Page 121: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

121

How Many Clusters?

Number of Cluster Count

3 clusters 2

4 clusters 1

5 clusters 8

6 clusters 2

7 clusters 2

Page 122: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

122

How Many Clusters?

Number of Cluster Count

3 clusters 2

4 clusters 1

5 clusters 8

6 clusters 2

7 clusters 2

Page 123: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

123

123

Page 124: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

124

124

Page 125: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

125

Page 126: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

126

Page 127: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

127

How Many Clusters?

Number of Cluster Count

3 clusters 2

4 clusters 1

5 clusters 8

6 clusters 2

7 clusters 2

Page 128: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

128

How Many Clusters?

The Number of Clusters Found Depends Upon

• Cluster Starting Points

• Clustering Method

Certain Numbers occur more frequently than others

• Trial and Error suggests 3 to 7 Clusters

• Probably 5 Clusters is optimal

Page 129: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Starting Points Affect Clusters “Your Mileage May Vary”

Page 130: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Diagram 4400

Page 131: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

131

Different Seed Selection Methods: Diagram 4400

Page 132: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

132

Random Seeds – Synthetic Data %let COUNT = 200 %let WEIGHT = 0.2; %let SEED = 1; %let INFILE = INFILE; %let OUTFILE = RANDOM_DATA; data &INFILE.; do I = 1 to &COUNT.; X = 3.0; Y = 3.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 5.0; Y = 5.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 4.0; Y = 6.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 6.0; Y = 4.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 4.0; Y = 4.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; end; drop I; run; data &OUTFILE.; set &INFILE.; X = X + &WEIGHT.*NOISE_X; Y = Y + &WEIGHT.*NOISE_Y; drop NOISE_X; drop NOISE_Y; run;

Page 133: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

133

Random Seeds – Synthetic Data %let COUNT = 200 %let WEIGHT = 0.2; %let SEED = 1; %let INFILE = INFILE; %let OUTFILE = RANDOM_DATA; data &INFILE.; do I = 1 to &COUNT.; X = 3.0; Y = 3.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 5.0; Y = 5.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 4.0; Y = 6.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 6.0; Y = 4.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 4.0; Y = 4.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; end; drop I; run; data &OUTFILE.; set &INFILE.; X = X + &WEIGHT.*NOISE_X; Y = Y + &WEIGHT.*NOISE_Y; drop NOISE_X; drop NOISE_Y; run;

Changed “Noise Weight” • From 0.5 • To 0.2

Less “Noise” was introduced. Clusters will be more “defined”

Page 134: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

134

Different Seed Selection Methods

Page 135: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

135

Different Seed Selection Methods

Page 136: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

136

Random Seeds – Shuffle Cards %let SEED = 2; %let INFILE = RANDOM_DATA; %let TEMPFILE = TEMPFILE; %let OUTFILE = SORTED_DATA; data &TEMPFILE.; set &INFILE.; SORT = ranuni( &SEED. ); run; proc sort data=&TEMPFILE.; by SORT; run; data &OUTFILE.; set &TEMPFILE.; drop SORT; run; proc print data=&OUTFILE.(obs=5); run;

Page 137: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

137

Random Seeds – Shuffle Cards %let SEED = 2; %let INFILE = RANDOM_DATA; %let TEMPFILE = TEMPFILE; %let OUTFILE = SORTED_DATA; data &TEMPFILE.; set &INFILE.; SORT = ranuni( &SEED. ); run; proc sort data=&TEMPFILE.; by SORT; run; data &OUTFILE.; set &TEMPFILE.; drop SORT; run; proc print data=&OUTFILE.(obs=5); run;

Random Number Seed: Changing this value will cause the list of data points to be put in a different order (“shuffled”)

Page 138: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

138

Different Seed Selection Methods

Page 139: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

139

Different Seed Selection Methods

Set the numbers to exactly 5 clusters

Use first 5 data points as cluster seeds. • Repeat for “Partial” • Repeat for “Full” • Repeat for “MacQueen” • Repeat for “Princomp”

Page 140: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

140

Different Seed Selection Methods

Page 141: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

141

First “N” Selection Method

Page 142: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

142

MacQueen Selection Method

Page 143: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

143

Full Selection Method

Page 144: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

144

Partial Selection Method

Page 145: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

145

Princomp Selection Method

Page 146: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

146

146

Page 147: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

147

147

Page 148: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

148

148

Page 149: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

149

What Are The Cluster Centers?

Different Starting Points and Settings Can Yield Different Results

• Occasionally sub-optimal clusters are found

• Usually the same optimal clusters are found regardless of starting points and settings

Five different settings

• 2 of 5 have sub optimal Clusters

• 3 of 5 have optimal cluster

- Even sub-optimal Clusters have some similarity to optimal clusters

Page 150: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Monte Carlo Clustering

Page 151: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Monte Carlo Macros

Page 152: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

152

Monte Carlo Clustering

Cluster data repeatedly:

• Use different methods for determining starting points

• Use different clustering methods

After each clustering algorithm finishes:

• After each iteration, record the number of clusters

• After each iteration, record the cluster centers

After numerous iterations:

• Determine the correct number of clusters

• Cluster the “Cluster Centers”

Page 153: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

153

SAS Macro: Sleep

• Macro will cause the SAS Program to “sleep” for a specified number of seconds.

• This gives the operating system time to write files to disk and prevents deadlocks.

Parameters

%SLEEP( HOWLONG );

- HowLong : How many seconds should the program “sleep”

Page 154: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

154

SAS Macro: Sleep

%macro SLEEP( HOWLONG ); data; time_slept=sleep(&HOWLONG.,1); run; %mend;

Page 155: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

155

SAS Macro: Save_Cluster_Info

• Stores the number of clusters and the cluster centers found by

• SAS Enterprise Miner Cluster Node

• SAS Enterprise Miner SOM/Kohonen Node

• Results are collected from Enterprise Miner nodes and appended to SAS data files

• Clusters with rare membership are deleted

Page 156: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

156

SAS Macro: Save_Cluster_Info

CENTERFILE : Cluster Centers from SAS Enterprise Miner

OUTFILE_CENTERS : File to store the Cluster Centers

OUTFILE_HOWMANY : File to store the number of Cluster Centers

TEMPFILE : Temporary File to hold data

HOWMANY : Name of the variable that will store the number of clusters

CUTOFFPCT : If a clusters has less than this percent of the records, delete it

HOWLONG : How many seconds to sleep between functions

%SAVE_CLUSTER_INFO( CENTERFILE, OUTFILE_CENTERS, OUTFILE_HOWMANY, TEMPFILE = TEMPFILE, HOWMANY = _HOWMANY_, CUTOFFPCT = 0.1, HOWLONG = 1 );

Page 157: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

157

SAS Macro: Save_Cluster_Info Sample Run: First run found “7” Clusters

157

How Many Clusters File:

Cluster Center File:

Obs _HOWMANY_ 1 7

Obs _HOWMANY_ X Y 1 7 3.03363 2.74416 2 7 4.16234 3.71581 3 7 6.03689 3.95565 4 7 4.00574 4.82218 5 7 2.71785 3.49664 6 7 3.96880 6.14890 7 7 5.12917 5.06041

Page 158: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

158

SAS Macro: Save_Cluster_Info Sample Run: Second Run found “5” Clusters

158

How Many Clusters File:

Cluster Center File:

Obs _HOWMANY_ X Y 1 7 3.03363 2.74416 2 7 4.16234 3.71581 3 7 6.03689 3.95565 4 7 4.00574 4.82218 5 7 2.71785 3.49664 6 7 3.96880 6.14890 7 7 5.12917 5.06041 8 5 2.90993 3.03597 9 5 4.06764 3.98039 10 5 6.01687 3.95821 11 5 5.02613 5.04958 12 5 3.94580 6.00564

Obs _HOWMANY_ 1 7 2 5

Page 159: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

159

SAS Macro: Save_Cluster_Info Sample Run: Third Run found “5” Clusters

159

How Many Clusters File:

Cluster Center File:

Obs _HOWMANY_ 1 7 2 5 3 5

Obs _HOWMANY_ X Y 1 7 3.03363 2.74416 2 7 4.16234 3.71581 3 7 6.03689 3.95565 4 7 4.00574 4.82218 5 7 2.71785 3.49664 6 7 3.96880 6.14890 7 7 5.12917 5.06041 8 5 2.90993 3.03597 9 5 4.06764 3.98039 10 5 6.01687 3.95821 11 5 5.02613 5.04958 12 5 3.94580 6.00564 13 5 6.02094 3.95189 14 5 3.95058 6.00808 15 5 5.05562 5.05453 16 5 4.07909 4.01951 17 5 2.92715 3.03838

Page 160: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

160

SAS Macro: Save_Cluster_Info (page 1 of 3)

%macro CLUSTER_SLEEP( HOWLONG ); data; time_slept=sleep(&HOWLONG.,1); run; %mend; %macro SAVE_CLUSTER_INFO( CENTERFILE, OUTFILE_CENTERS, OUTFILE_HOWMANY, TEMPFILE = TEMPFILE, HOWMANY = _HOWMANY_, CUTOFFPCT = 0.1, HOWLONG = 1 ); data &TEMPFILE.; set &CENTERFILE.; drop _RADIUS_; drop _CRIT_ _XCONV_ _FCONV_ _RMSSTD_ _NEAR_ _GAP_ _SEGMENT_; drop _CRIT_ _XCONV_ _FCONV_ SOM_SEGMENT _RMSSTD_ _NEAR_ _GAP_ SOM_DIMENSION1 SOM_DIMENSION2 SOM_ID; run;

Page 161: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

161

SAS Macro: Save_Cluster_Info (page 2 of 3) data; set &TEMPFILE.; retain &HOWMANY.; if _N_ = 1 then &HOWMANY. = 0; &HOWMANY. = &HOWMANY. + _FREQ_; call symput("HOWMANYCOUNT", &HOWMANY. ); run; data &TEMPFILE.; set &TEMPFILE.; if _FREQ_ / &HOWMANYCOUNT. * 100 < &CUTOFFPCT. then delete; run; data; set &TEMPFILE.; retain &HOWMANY.; if _N_ = 1 then &HOWMANY. = 0; &HOWMANY. = &HOWMANY. + 1; call symput("HOWMANYCOUNT", &HOWMANY. ); run; data &TEMPFILE.; length &HOWMANY. 8.; set &TEMPFILE.; &HOWMANY. = &HOWMANYCOUNT.; drop _FREQ_; run;

Page 162: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

162

SAS Macro: Save_Cluster_Info (page 3 of 3)

%cluster_sleep(&HOWLONG.); proc append data=&TEMPFILE. out=&OUTFILE_CENTERS. force; run; %cluster_sleep(&HOWLONG.); data &TEMPFILE.; set &TEMPFILE.(obs=1); keep &HOWMANY.; run; %cluster_sleep(&HOWLONG.); proc append data=&TEMPFILE. out=&OUTFILE_HOWMANY. force; run; %cluster_sleep(&HOWLONG.); %mend;

Page 163: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

%include the MACRO

Page 164: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

164

SAS Enterprise Miner Project Start Code

Page 165: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

165

SAS Enterprise Miner Project Start Code

Page 166: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

166

SAS Enterprise Miner Project Start Code

Page 167: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

167

SAS Enterprise Miner Project Start Code

Page 168: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

EXAMPLE-Using the Macro: Diagram 5100

Page 169: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

169

Cluster Node Data Collection

1. Use same Synthetic Data Program as Example 3 of Lecture 4. The Noise factor is set to 0.5

• 200 points centered at (3,3)

• 200 points centered at (5,5)

• 200 points centered at (4,6)

• 200 points centered at (6,4)

• 200 points centered at (4,4)

2. Use same “shuffle program” use SEED = -1

• A value of -1 causes the computer clock to be used as a “seed”.

• This results in a different random seed being used every time the program is executed.

Page 170: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

170

Cluster Node Data Collection

3. Cluster Node Settings

• Ward Clustering (but any method will do)

• Partial Replacement Cluster Seed (but any method will do)

• Automatic Cluster Selection

- Max 7: (Maximum Value from Example 3 of Lecture 4)

- Min 3: (Minimum Value from Example 3 of Lecture 4)

4. Save the Cluster Centers using the Save_Cluster_Info Macro.

Page 171: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

171

Cluster Node Data Collection

5. Rerun Numerous Times

• Shuffle

• Cluster

• Save_Cluster_Info

6. Cluster the Clusters Centers

Page 172: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

172

Cluster Node Data Collection Enterprise Miner Diagram

Page 173: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

173

Cluster Node Data Collection Enterprise Miner Diagram

Create synthetic data with the noise factor of 0.5

Page 174: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

174

Cluster Node Data Collection Enterprise Miner Diagram

Page 175: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

175

Cluster Node Data Collection Enterprise Miner Diagram

Shuffle the data with a random seed of -1

(tied to clock)

Page 176: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

176

Cluster Node Data Collection Enterprise Miner Diagram

Set the “Rerun” to “Yes” so that this code node will rerun every time and will reshuffle the data.

Page 177: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

177

Cluster Node Data Collection Enterprise Miner Diagram

%let SEED = -1; %let INFILE = &EM_IMPORT_DATA.; %let TEMPFILE = TEMPFILE; %let OUTFILE = SORTED_DATA; data &TEMPFILE.; set &INFILE.; SORT = ranuni( &SEED. ); run; proc sort data=&TEMPFILE.; by SORT; run; data &OUTFILE.; set &TEMPFILE.; drop SORT; run; proc print data=&OUTFILE.(obs=5); run;

Random Number Seed is set to -1. This ties the random number seed to the clock.

Every time this program is executed, the data will be in a different order.

Page 178: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

178

Cluster Node Data Collection Enterprise Miner Diagram

Create clusters using the data points that were shuffled in the

previous node.

Page 179: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

179

Cluster Node Data Collection Enterprise Miner Diagram

Number of clusters is set to “Automatic”. The MAX is set to “7” and the MIN is set to “3” because that was the range found in Lecture 4 Example 3. The Clustering Method is set to “Ward”, but “Average” or “Centroid could also be used.

Seed Intitialization is set to “Partial Replacement” but other methods could be used.

Page 180: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

180

Cluster Node Data Collection Enterprise Miner Diagram

Call the SAS Macro “Save_Cluster_Info” in order to save the results from the cluster node.

Page 181: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

181

Cluster Node Data Collection Enterprise Miner Diagram

%let INFILE = &EM_IMPORT_CLUSMEAN.; %let CENTERFILE = SGFLIB.y5100_CENTERFILE; %let HOWMANYFILE = SGFLIB.y5100_HOWMANYFILE; proc print data=&INFILE.; run; %save_cluster_info( &INFILE., &CENTERFILE., &HOWMANYFILE. ); proc print data=&CENTERFILE.(obs=30); run; proc print data=&HOWMANYFILE.(obs=10); run; proc freq data=&HOWMANYFILE.; table _HOWMANY_ /missing; run; data &EM_EXPORT_TRAIN.; set &CENTERFILE.; run;

The Macro “Save_Cluster_Info” is called. The number of clusters is stored in a file called “yHOWMANYFILE” and the actual clusters are saved in a file called “yCENTERFILE”.

Page 182: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

182

Cluster Node Data Collection Enterprise Miner Diagram

%let INFILE = &EM_IMPORT_CLUSMEAN.; %let CENTERFILE = SGFLIB.y5100_CENTERFILE; %let HOWMANYFILE = SGFLIB.y5100_HOWMANYFILE; proc print data=&INFILE.; run; %save_cluster_info( &INFILE., &CENTERFILE., &HOWMANYFILE. ); proc print data=&CENTERFILE.(obs=30); run; proc print data=&HOWMANYFILE.(obs=10); run; proc freq data=&HOWMANYFILE.; table _HOWMANY_ /missing; run; data &EM_EXPORT_TRAIN.; set &CENTERFILE.; run;

The Macro “Save_Cluster_Info” is called. The number of clusters is stored in a file called “yHOWMANYFILE” and the actual clusters are saved in a file called “yCENTERFILE”.

Page 183: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

183

Cluster Node Data Collection Enterprise Miner Diagram

This graphing box is not necessary. It is being used for illustration purposes to display the cluster center points.

Page 184: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

184

Cluster Node Data Collection Results: Run 1

Page 185: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

185

Cluster Node Data Collection Center Points After 1 Run

Page 186: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

186

Cluster Node Data Collection Results: Run 2

Page 187: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

187

Cluster Node Data Collection Center Points After 2 Runs

Page 188: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

188

Cluster Node Data Collection Results: Run 4

Page 189: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

189

Cluster Node Data Collection Center Points After 4 Runs

Page 190: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

190

Cluster Node Data Collection Enterprise Miner Diagram

After running the code TEN times, the graph suggests that the 5 cluster solution (red boxes) will place the centers in roughly the same places. The 7 cluster solution (blue boxes) will place the centers is roughly the same places (but these will be different from the red boxes).

Page 191: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

191

Cluster Node Data Collection Center Points After 10 Runs

Page 192: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Looping in SAS Enterprise Miner

Page 193: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

193

Automated Data Collection

Manually executing the Data Collection Program is time consuming

SAS Enterprise Miner has a “Looping” Structure to automate Cluster Data Collection

IMPORTANT: Occasionally when SAS Enterprise Miner is “Looping”, then a error might occur. This is usually a result of a file deadlock state. It does not matter. Just exit Enterprise Miner and start running it again if you wish. You might have already collected enough samples by that point in time, so rerunning may not be necessary.

Page 194: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

194

Automated Data Collection Enterprise Miner Diagram

Page 195: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

195

Automated Data Collection Enterprise Miner Diagram

START GROUP: Start of an Enterprise Miner “Loop”

The Nodes Inside the“Start”/”End” Group will execute multiple times

END GROUP: End of an Enterprise Miner “Loop”

Page 196: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

196

Automated Data Collection Enterprise Miner Diagram

START GROUP: Start of an Enterprise Miner “Loop”

END GROUP: End of an Enterprise Miner “Loop”

Page 197: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

197

Automated Data Collection Enterprise Miner Diagram

• Rerun = Yes Mode: • Index Informs SAS that it will loop “N”

number of time. Index Count: • The Number of times the loop will

execute. In this case the number will be “3” but the number can be set to a much higher value if a person plans to be away from their computer for a while.

Page 198: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

198

Automated Data Collection Enterprise Miner Diagram

Same “shuffle” data node as in previous example. The seed is -1 which means it is tied to the clock.

Rerun is set to “YES” just as before.

Page 199: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

199

Automated Data Collection Enterprise Miner Diagram

Same “Cluster Node” as in previous example. All settings should be the same as before.

Page 200: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

200

Automated Data Collection Enterprise Miner Diagram

Same “Save Results” Code Node as in the previous example. Note: “PROC PRINT” and other output PROCS won’t display inside of a loop.

Page 201: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

201

Automated Data Collection Enterprise Miner Diagram

Print the Results

Page 202: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

202

Automated Data Collection Enterprise Miner Diagram

%let CENTERFILE = SGFLIB.y5100_CENTERFILE; %let HOWMANYFILE = SGFLIB.y5100_HOWMANYFILE; proc print data=&CENTERFILE.(obs=100); run; proc print data=&HOWMANYFILE.(obs=100); run; proc freq data=&HOWMANYFILE.; table _HOWMANY_ /missing; run;

Page 203: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

203

Automated Data Collection Enterprise Miner Diagram

After running 233 time, it is observed that

• 70% of the time, 5 clusters are found

• 26% of the time, 7 clusters are found

Note: Because of the nature of the random number generator, rerunning this model might yield slightly different results.

Page 204: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

204

Automated Data Collection Clusters = 5 Center Points

Page 205: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

205

Automated Data Collection Clusters = 7 Center Points

Page 206: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Cluster the Centers

Page 207: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

207

Cluster the Cluster Centers Enterprise Miner Diagram

Page 208: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

208

Cluster the Cluster Centers Enterprise Miner Diagram

Page 209: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

209

Automated Data Collection Enterprise Miner Diagram

*%let CENTERFILE = SGFLIB.y5100_CENTERFILE; *%let HOWMANYFILE = SGFLIB.y5100_HOWMANYFILE; %let CENTERFILE = SGFLIB.z5100_CENTERFILE; %let HOWMANYFILE = SGFLIB.z5100_HOWMANYFILE; %let OUTFILE = &EM_EXPORT_TRAIN.; proc print data=&CENTERFILE.(obs=100); run; proc print data=&HOWMANYFILE.(obs=100); run; proc freq data=&HOWMANYFILE.; table _HOWMANY_ /missing; run; data &OUTFILE.; set &CENTERFILE.; run;

Page 210: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

210

Automated Data Collection Enterprise Miner Diagram

*%let CENTERFILE = SGFLIB.y5100_CENTERFILE; *%let HOWMANYFILE = SGFLIB.y5100_HOWMANYFILE; %let CENTERFILE = SGFLIB.z5100_CENTERFILE; %let HOWMANYFILE = SGFLIB.z5100_HOWMANYFILE; %let OUTFILE = &EM_EXPORT_TRAIN.; proc print data=&CENTERFILE.(obs=100); run; proc print data=&HOWMANYFILE.(obs=100); run; proc freq data=&HOWMANYFILE.; table _HOWMANY_ /missing; run; data &OUTFILE.; set &CENTERFILE.; run;

For convenience, the program was already run 200+ times and the results were stored in the files: SGFLIB.z5100_CENTERFILE;

SGFLIB.z5100_HOWMANYFILE;

Page 211: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

211

Cluster the Cluster Centers Enterprise Miner Diagram

Page 212: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

212

Automated Data Collection Enterprise Miner Diagram

%let INFILE = &EM_IMPORT_DATA.; %let OUTFILE = &EM_EXPORT_TRAIN.; %let HOWMANY = 5; proc print data=&INFILE.(obs=100); run; data &OUTFILE.; set &INFILE.; if _HOWMANY_ = &HOWMANY.; drop _HOWMANY_; run;

Page 213: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

213

Automated Data Collection Enterprise Miner Diagram

%let INFILE = &EM_IMPORT_DATA.; %let OUTFILE = &EM_EXPORT_TRAIN.; %let HOWMANY = 5; proc print data=&INFILE.(obs=100); run; data &OUTFILE.; set &INFILE.; if _HOWMANY_ = &HOWMANY.; drop _HOWMANY_; run;

Only keep the CENTER POINTS for the times when 5 clusters were found.

Page 214: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

214

Example 3: Cluster the Cluster Centers Cluster of Center Points = 5

Page 215: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

215

Automated Data Collection Enterprise Miner Diagram

%let INFILE = &EM_IMPORT_DATA.; %let OUTFILE = &EM_EXPORT_TRAIN.; %let HOWMANY = 7; proc print data=&INFILE.(obs=100); run; data &OUTFILE.; set &INFILE.; if _HOWMANY_ = &HOWMANY.; drop _HOWMANY_; run;

Only keep the CENTER POINTS for the times when 7 clusters were found.

Page 216: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

216

Example 3: Cluster the Cluster Centers Cluster of Center Points = 7

Page 217: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

217

Cluster the Cluster Centers Applied to Original Data

Page 218: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

218

Example 3: Clusters = 5 Applied to Original Data

Page 219: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

219

Example 3: Clusters = 7 Applied to Original Data

Page 220: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Kohonen/SOM Clusters

Page 221: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

221

Automated Data Collection Enterprise Miner Diagram

Page 222: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

222

Automated Data Collection Enterprise Miner Diagram

Page 223: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

223

Automated Data Collection Enterprise Miner Diagram

Page 224: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

224

Automated Data Collection Enterprise Miner Diagram

Page 225: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

225

Cluster Node Data Collection Enterprise Miner Diagram

%let INFILE = &EM_LIB..&EM_METASOURCE_NODEID._OUTMEAN; %let CENTERFILE = SGFLIB.y6100_CENTERFILE; %let HOWMANYFILE = SGFLIB.y6100_HOWMANYFILE; proc print data=&INFILE.; run; %save_cluster_info( &INFILE., &CENTERFILE., &HOWMANYFILE. ); proc print data=&CENTERFILE.(obs=30); run; proc print data=&HOWMANYFILE.(obs=10); run; proc freq data=&HOWMANYFILE.; table _HOWMANY_ /missing; run; data &EM_EXPORT_TRAIN.; set &CENTERFILE.; run;

Page 226: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

226

Cluster Node Data Collection Enterprise Miner Diagram

%let INFILE = &EM_LIB..&EM_METASOURCE_NODEID._OUTMEAN; %let CENTERFILE = SGFLIB.y6100_CENTERFILE; %let HOWMANYFILE = SGFLIB.y6100_HOWMANYFILE; proc print data=&INFILE.; run; %save_cluster_info( &INFILE., &CENTERFILE., &HOWMANYFILE. ); proc print data=&CENTERFILE.(obs=30); run; proc print data=&HOWMANYFILE.(obs=10); run; proc freq data=&HOWMANYFILE.; table _HOWMANY_ /missing; run; data &EM_EXPORT_TRAIN.; set &CENTERFILE.; run;

SAS Enterprise Miner creates a file to hold the Kohonen/SOM center points. But they are not exported. Therefore, you need to go out and get them!

Page 227: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Questions?

227