Cluster Analysis Grouping Cases or Variables. Clustering Cases Goal is to cluster cases into groups...

Preview:

Citation preview

Cluster Analysis

Grouping Cases or Variables

Clustering Cases

• Goal is to cluster cases into groups based on shared characteristics.

• Start out with each case being a one-case cluster.

• The clusters are located in k-dimensional space, where k is the number of variables.

• Compute the squared Euclidian distance between each case and each other case.

Squared Euclidian Distance

• the sum across variables (from i = 1 to v) of the squared difference between the score on variable i for the one case (Xi) and the score on variable i for the other case (Yi)

2

1

v

iii YX

Agglomerate

• The two cases closest to each other are agglomerated into a cluster.

• The distances between entities (clusters and cases) are recomputed.

• The two entities closest to each other are agglomerated.

• This continues until all cases end up in one cluster.

What is the Correct Solution?

• You may have theoretical reasons to expect a certain k cluster solution.

• Look at that solution and see if it matches your expectations.

• Alternatively, you may try to make sense out of solutions at two or more levels of the analysis.

Faculty Salaries

• Subjects were faculty in Psychology at ECU.

• Variables were rank, experience, number of publications, course load, and salary.

• Data are at ClusterAnonFaculty.sav• Also see the statistical output

Analyze, Classify, Hierarchical Cluster

Statistics

Plots

Method

Save

Proximity Matrix

• We did not request this, but if we had it would display a measure of dissimilarity for each pair of entities.

• The pair of cases with the smallest squared Euclidian distance are clustered.

Stage Cluster Combined Coefficients

Cluster 1 Cluster 2 Cluster 1

1 32 33 .000

Look at the Agglomeration Schedule.Cases 32 and 33 are clustered. They are very similar (distance = 0.000)

Agglomeration Schedule

StageCluster Combined

Coefficients

Stage Cluster First Appears

Next Stage

Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cluster 1 Cluster 21 32 33 .000 0 0 92 41 42 .000 0 0 63 43 44 .000 0 0 64 37 38 .000 0 0 55 37 39 .001 4 0 76 41 43 .002 2 3 27

Steps 2 Through 5

Stages 2-5

• The agglomeration schedule show that in Stage 2 cases 41 and 42 are clustered.

• In Stage 3 cases 43 and 44 are clustered.• In Stage 4 cases 37 and 38 are clustered.• In Stage 5 case 39 is added to the cluster

that contains cases 37 and 38.• And so on.

Vertical Icicle, Two Clusters

• Look at the top of the display (next slide).• You can see two clusters

– On the left Boris through Willy– On the right, Deanna through Sunila

• The 2 cluster solution was adjuncts versus full time faculty.

Vertical Icicle, Three Clusters

• Look at the icicle second highest white bar.

• Now there are three clusters– Adjuncts– Junior faculty (Deanna through Mickey)– Senior faculty (Lawrence through Roslyn)

Vertical Icicle, Four Clusters

• Look at the white bar furthest to the right.• Now there are four clusters

– Adjuncts– Junior faculty – The acting chair (Lawrence)– The rest of the senior faculty (Catalina

through Roslyn)

The Dendogram

• At the far right you can see the two cluster solution.

• The next step to the left shows the three cluster solution.

• The next step to the left shows the four cluster solution.

• And so on.• Truncated and rotated dendogram on next

slide.

Compare Two Clusters

• The 2 cluster solution was adjuncts versus everybody else.

• Look at the t tests in the output• Adjuncts had lower rank, experience,

number of publications, course load, and salary.

Compare Three Clusters

• Look at the ANOVAs and plots.• The senior faculty had higher salary,

experience, rank, and number of pubs.

Compare Four Clusters• The acting chair had a higher salary and

number of publications.

I Could Not Help Myself

• With these data on hand, I could not resist predicting salary from the other variables.

• Salary was well correlated with Rank, FTEs, Publications, and Experience.

• In the multiple regression, only Rank and FTEs had significant unique effects.

• The residuals suggest who was being overpaid and who underpaid.

Split by Sex

• For men, the unique effect of number of publications was positive – more publications, higher salary.

• For women it was negative – more publications, lower salary.

• Curious.

Workaholism

• Aziz & Zickar (2005)• Workaholics may be defined as those

– High in work involvement,– High in drive to work, and– Low in work enjoyment.

• For each case, a score was obtained for each of these three dimensions.

The Three Cluster Solution

• Workaholics– High work involvement– High drive to work– Low work enjoyment

• Positively engaged workers– High work involvement– Medium drive to work– High work enjoyment

• Unengaged workers– Low work involvement– Low drive to work– Low work enjoyment

• Past research/theory indicated there should be six clusters, but the theorized six clusters were not obtained.

Clustering Variables

• FactBeer.sav• The statistical output.• Analyze, Classify, Hierarchical Cluster

Statistics

Plots

Method

Proximity Matrix

• Is simply the intercorrelation matrix• The two most correlated variables are

Color and Aroma (r = .909) – they are clustered on the first step.

• Stage 2: Size and Alcohol (r = .904) are clustered.

• Stage 3: Taste added to the cluster that already contains Color and Aroma

Also See Other Tables & Plots

• Stage 4: Cost added to the cluster that already contains Size and Alcohol.

• Stage 5: The two clusters are combined– But they are not very similar (similarity

coefficient = .038)– Now we have one cluster with six variables

and one with one (Reputation)

Recommended