Slide 1 A Problem in Personnel Classification This problem is from Phillip J. Rulon, David V. Tiedeman, Maurice Tatsuoka, and Charles R. Langmuir. Multivariate

Slide 1

A Problem in Personnel Classification

This problem is from Phillip J. Rulon, David V. Tiedeman, Maurice Tatsuoka, and Charles R. Langmuir. Multivariate Statistics for Personnel Classification. 1967.

This sample data is for "World Airlines, a company employing over 50,000 persons and operating scheduled flights. This company naturally needs many men who can be assigned to a particular set of functions. The mechanics on the line who service the equipment of World Airlines form one of the groups we shall consider. A second group are the agents who deal with the passengers of the airline. A third group are the men in operations who coordinate airline activities.

The personnel officer of World Airlines has developed an Activity Preference Inventory for the use of the airline. The first section of this inventory contains 30 pairs of activities, each pair naming an indoor activity and an outdoor activity. One item is

_____ Billiards : Golf _____

The applicant for a job in World Airlines checks the activity he prefers. The score in the number of outdoor activities marked." (page 24)

The second section of the Activity Preference Inventory "contains 35 items. One activity of each pair is a solitary activity, the other convivial. An example is

_____ Solitaire : Bridge _____

The apprentice's score is the number of convivial activities he prefers." (page 82)

The third section of the Activity Preference Inventory "contains 25 items. One activity of each pair is a liberal activity, the other a conservative activity. An example is

_____ Counseling : Advising _____

Discriminant Analysis

Slide 2 Discriminant Analysis

The apprentice's score is the number of conservative activities he prefers." (page 153)

The Activity Preference Inventory was administered to 244 employees in the three job classifications who were successful and satisfied with their jobs. The dependent variable, JOBCLASS 'Job Classification' included three job classifications: 1 - Passenger Agents, 2 - Mechanics, and 3 - Operations Control.

The purpose of the analysis is to develop a classification scheme based on scores on the Activity Preference Inventory to assign new employees to the different job groups.

A Problem in Personnel Classification (continued)

Slide 3

Stage One: Define the Research Problem

In this stage, the following issues are addressed:•Relationship to be analyzed•Specifying the dependent and independent variables•Method for including independent variables


Relationship to be analyzed

We are interested in the relationship between scores on the three scales of the Activity Preference Inventory and the different job classifications.

Specifying the dependent and independent variables

The dependent variable is:•JOBCLASS 'Job Classification'

The independent variables are:•OUTDOOR, 'Outdoor Activity Score'•CONVIV, 'Convivial Score'•CONSERV, 'Conservative Score'

Method for including independent variables

Since the purpose of this analysis is to articulate the relationship between the activity scores and job classification, direct entry of independent variables would be an appropriate method for selecting variables. However, I prefer to use a stepwise method in order to identify which predictors are statistically significant.

Slide 4

Stage 2: Develop the Analysis Plan: Sample Size Issues

In this stage, the following issues are addressed:

•Missing data analysis•Minimum sample size requirement: 20+ cases per independent variable•Division of the sample: 20+ cases in each dependent variable group


Missing data analysis

There is no missing data in this data set.

Minimum sample size requirement: 20+ cases per independent variable

The data set contains 244 subjects and 3 independent variables. The ratio of 81 cases per independent variable excess the minimum sample size requirement.

Division of the sample: 20+ cases in each dependent variable group

There were 85 Passenger Agents in the sample, 93 Mechanics, and 66 Operations Control staff in the sample. There are more than 20 cases in each dependent variable group.

Slide 5

Stage 2: Develop the Analysis Plan: Measurement Issues:


•Incorporating nonmetric data with dummy variables•Representing curvilinear effects with polynomials•Representing interaction or moderator effects


Incorporating Nonmetric Data with Dummy Variables

None of the variables are nonmetric.

Representing Curvilinear Effects with Polynomials

We do not have any evidence of curvilinear effects at this point in the analysis.

Representing Interaction or Moderator Effects

We do not have any evidence at this point in the analysis that we should add interaction or moderator variables.

Slide 6

Stage 3: Evaluate Underlying Assumptions


•Nonmetric dependent variable and metric or dummy-coded independent variables•Multivariate normality of metric independent variables: assess normality of individual variables•Linear relationships among variables•Assumption of equal dispersion for dependent variable groups


Nonmetric dependent variable and metric or dummy-coded independent variables

The dependent variable is nonmetric. All of the independent variables are metric.

Multivariate normality of metric independent variables

Since there is not a method for assessing multivariate normality, we assess the normality of the individual metric variables.

Slide 7

Run the 'NormalityAssumptionAndTransformations' Script


Slide 8

Complete the 'Test for Assumption of Normality' Dialog Box


Tests of Normality

We find that all three of the independent variables fail the test of normality, and that none of the transformations induced normality in any of the variables. We should note the failure to meet the normality assumption for possible inclusion in our discussion of findings.

Slide 9

Linear relationships among variables

Since our dependent variable is not metric, we cannot use it to test for linearity of the independent variables. As an alternative, we can plot each metric independent variable against all other independent variables in a scatterplot matrix to look for patterns of nonlinear relationships. If one of the independent variables shows multiple nonlinear relationships to the other independent variables, we consider it a candidate for transformation


Slide 10

Requesting a Scatterplot Matrix


Slide 11

Specifications for the Scatterplot Matrix


Slide 12

The Scatterplot Matrix

Blue fit lines were added to the scatterplot matrix to improve interpretability.

Having computed a scatterplot for all combinations of metric independent variables, we identify all of the variables that appear in any plot that shows a nonlinear trend. We will call these variables our nonlinear candidates. To identify which of the nonlinear candidates is producing the nonlinear pattern, we look at all of the plots for each of the candidate variables. The candidate variable that is not linear should show up in a nonlinear relationship in several plots with other linear variables. Hopefully, the form of the plot will suggest the power term to best represent the relationship, e.g. squared term, cubed term, etc.

None of the scatterplots show evidence of any nonlinear relationships.


Slide 13

Assumption of equal dispersion for dependent variable groups

Box's M statistic tests for homogeneity of dispersion matrices across the subgroups of the dependent variable. The null hypothesis is that the dispersion matrices are homogenous. If the analysis fails this test, we can request classification using separate group dispersion matrices to see it this improves the model's accuracy rate.

Box's M statistic is produced by the SPSS discriminant procedure, so we will defer this question until we have obtained the discriminant analysis output.


Slide 14

Stage 4: Estimation of Discriminant Functions and Overall Fit: The Discriminant Functions


•Compute the discriminant analysis•Overall significance of the discriminant function(s)


Compute the discriminant analysis

The steps to obtain a discriminant analysis are detailed on the following screens.

Slide 15

Requesting a Discriminant Analysis


Slide 16

Specifying the Dependent Variable


Slide 17

Specifying the Independent Variables


Slide 18

Specifying Statistics to Include in the Output


Slide 19

Specifying the Stepwise Method for Selecting Variables


Slide 20

Specifying the Classification Requirement


Slide 21

Complete the Discriminant Analysis Request


Slide 22

Overall significance of the discriminant function(s) - 1

Our first task is to determine whether or not there is a statistically significant relationship between the independent variables and the dependent variable. We navigate to the section of output titled "Summary of Canonical Discriminant Functions" to locate the following outputs:

Recall that the maximum number of discriminant functions is equal to the number of groups in the dependent variable minus one, or the number of variables in the analysis, whichever is smaller. For this problem, the maximum number of discriminant functions is two.


Slide 23

Overall significance of the discriminant function(s) - 2

In the Wilks' Lambda table, SPSS successively tests models with an increasing number of functions. The first line of the table tests the null hypothesis that the mean discriminant scores for the two possible functions are equal in the three groups of the dependent variable. Since the probability of the chi-square statistic for this test is less than 0.0001, we reject the null hypothesis and conclude that there is at least one statistically significant function. Had the probability for this test been larger than 0.05, we would have concluded that there are no discriminant functions which separate the groups of the dependent variable.

The second line of the Wilks' Lambda table tests the null hypothesis that the mean discriminant scores for the second possible discriminant function are equal in the three groups of the dependent variable. Since the probability of the chi-square statistic for this test is less than 0.0001, we reject the null hypothesis and conclude that the second discriminant function, as well as the first, is statistically significant. Had the probability for this test been larger than 0.05, we would have concluded that there is only one discriminant function to separate the groups of the dependent variable.

Our conclusion from this output is that there are two statistically discriminant functions for this problem.


Slide 24

Stage 4: Estimation of Discriminant Functions and Overall Fit: Assessing Model Fit


•Assumption of equal dispersion for dependent variable groups•Classification accuracy chance criteria•Press's Q statistic•Presence of outliers


Slide 25

In discriminant analysis, the best measure of overall fit is classification accuracy. The appropriateness of using the pooled covariance matrix in the classification phase is evaluated by the Box's M statistic.

We examine the probability of the Box's M statistic to determine whether or not we meet the assumption of equal dispersion of the dispersion or covariance matrices (multivariate measure of variance). This test is very sensitive, so we should select a conservative alpha value of 0.01. At that alpha level, we fail to reject the null hypothesis for this analysis.

Assumption of equal dispersion for dependent variable groups

Had we failed this test, our remedy would be to re-run the discriminant analysis requesting the use of separate covariance matrices in classification.


Slide 26

Classification accuracy chance criteria - 1

The classification matrix for this problem computed by SPSS is shown below:


Slide 27

Classification accuracy chance criteria - 2

Following the text, we compare the accuracy rate for the cross-validated sample (75.0%) to each of the by chance accuracy rates.

In the table of Prior Probabilities for Groups, we see that the three groups contained .348, .381, and .270 of the sample of 244 cases used to derive the discriminant model.

The proportional chance criteria for assessing model fit is calculated by summing the squared proportion that each group represents of the sample, in this case (0.348 x 0.348) + (0.381 x 0.381 ) + (0.270 x 0.270) = 0.339. Based on the requirement that model accuracy be 25% better than the chance criteria, the standard to use for comparing the model's accuracy is 1.25 x 0.339= 0.4424. Our model accuracy rate of 75% exceeds this standard.

The maximum chance criteria is the proportion of cases in the largest group, 38.1% in this problem. Based on the requirement that model accuracy be 25% better than the chance criteria, the standard to use for comparing the model's accuracy is 1.25 x 38.1% = 47.6%. Our model accuracy rate of 75% exceeds this standard.


Slide 28

Press's Q statistic

Substituting the values for this problem (244 cases, 183 correct classifications, and 3 groups) into the formula for Press's Q statistic, we obtain a value = [244 - (183 x 3)] ^ 2 / 244 * (3 - 1) = 190.6. This value exceeds the critical value of 6.63 (Text, page 305) so we conclude that the prediction accuracy is greater than that expected by chance.

By all three criteria, we would interpret our model as having an accuracy above that expected by chance. Thus, this is a valuable or useful model that supports predictions of the dependent variable.


Slide 29

SPSS print Mahalanobis distance scores for each case in the table of Casewise Statistics, so we can use this as a basis for detecting outliers.

According to the SPSS Applications Guide, p .227, cases with large values of the Mahalanobis Distance from their group mean can be identified as outliers. For large samples from a multivariate normal distribution, the square of the Mahalanobis distance from a case to its group mean is approximately distributed as a chi-square statistic with degrees of freedom equal to the number of variables in the analysis. The critical value of chi-square with 3 degrees of freedom (the stepwise procedure entered three variables in the function) and an alpha of 0.01 (we only want to detect major outliers) is 11.345.

We can request this figure from SPSS using the following compute command:

COMPUTE mahcutpt = IDF.CHISQ(0.99,3).

EXECUTE.

Where 0.99 is the cumulative probability up to the significance level of interest and 3 is the number of degrees of freedom. SPSS will create a column of values in the data set that contains the desired value.

We scan the table of Casewise Statistics to identify any cases that have a Squared Mahalanobis distance greater than 11.345 for the group to which the case is most likely to belong, i.e. under the column labeled 'Highest Group.'

Presence of outliers - 1


Slide 30

Presence of outliers - 2

In this particular analysis, I find one case, number 23, with a large enough Mahalanobis distance to indicate that it is an outlier and might be considered for removal from the analysis. However, since there is only one case out of 244, it is not likely to make any difference, so we will forego re-running the analysis without this case.


Slide 31

Stage 5: Interpret the Results

In this section, we address the following issues:

•Number of functions to be interpreted•Relationship of functions to categories of the dependent variable•Assessing the contribution of predictor variables•Impact of multicollinearity on solution


Number of functions to be interpreted

As indicated previously, there are two significant discriminant functions to be interpreted.

Slide 32

Role of functions in differentiating categories of the dependent variable

The combined-groups scatterplot enables us to link the discriminant functions to the categories of the dependent variable. I have modified the SPSS output by changing the symbols for the different points so that we can easily detect the group members. In addition, I have added reference lines at the zero value for each axis.

Analyzing this plot, we see that the first function differentiates Passenger Agents from Mechanics and Operations Control personnel. The second function differentiates Mechanics from Operations Control staff.


Slide 33

Assessing the contribution of predictor variables - 1

Identifying the statistically significant predictor variables

The summary table of variables entering and leaving the discriminant functions is shown below. We can see that we have three independent variables included in the analysis in the order shown in the table. We would conclude all three of the independent variables, Outdoor Activity Score, Convivial Score, and Conservative Score make a statistically significant contribution to group membership on the dependent variable.


Slide 34


Importance of Variables and the Structure Matrix

To determine which predictor variables are more important in predicting group membership when we use a stepwise method of variable selection, we can simply look at the order in which the variables entered, as shown in the following table.


Slide 35


While we know which variables were important to the overall analysis, we are also concerned with which variables are important to which discriminant function. This information is provided by the structure matrix, which is a rotated correlation matrix containing the correlations between each of the independent variables and the discriminant function scores.

From the structure matrix, we see that two of the three variable entered into the functions (Convivial Score and Conservative Score) are the important variables in the first discriminant function, while Outdoor Activity Score is the important variable on the second function.


Slide 36


Comparing Group Means to Determine Direction of Relationships

If we examine the pattern of means for the three statistically significant variables for the three job classifications, we can provider a fuller discussion of the relationships between the independent variables, the dependent variable groups, and the discriminant functions.


Slide 37


The first discriminant function distinguishes Passenger Agents from Mechanics and Operations Control staff. The two variables that are important on the first function are convivial score and conservative score. Passenger agents had higher convivial scores and lower conservative scores than the other two groups.

Operations Control staff are distinguished from Mechanics by the second discriminant function which contains only a single variable, the outdoor activity Score. Mechanics had a higher average on the outdoor activity score than did Operations Control staff.

In sum, Passenger Agents are more outgoing (convivial) and more tolerant (less conservative) than Mechanics and Operations Control personnel. Mechanics differ from Operations Control personnel in their stronger preference for outdoor oriented activities.


Slide 38

Impact of Multicollinearity on solution

Multicollinearity is indicated by SPSS for discriminant analysis by very small tolerance values for variables, e.g. less than 0.10 (0.10 is the size of the tolerance, not its significance value).

If we look at the table of Variables Not In The Analysis, we see that it did not print anything for step 3, indicating that all variables were in the analysis. Multicollinearity is not an issue in this problem.


Slide 39

Stage 6: Validate The Model

In this stage, we are normally concerned with the following issues:

•Conducting the Validation Analysis•Generalizability of the Discriminant Model


Conducting the Validation Analysis

To validate the discriminant analysis, we can randomly divide our sample into two groups, a screening sample and a validation sample. The analysis is computed for the screening sample and used to predict membership on the dependent variable in the validation sample. If the model in the screening sample is valid, we would expect that the accuracy rates for both samples to be about the same.

In the double cross-validation strategy, we reverse the designation of the screening and validation sample and re-run the analysis. We can then compare the discriminant functions derived for both samples. If the two sets of functions contain a very different set of variables, it indicates that the variables might have achieved significance because of the sample size and not because of the strength of the relationship. Our findings about these individual variables would be that the predictive utility of these variables is not generalizable.

Slide 40

Set the Starting Point for Random Number Generation


Slide 41

Compute the Variable to Randomly Split the Sample into Two Halves


Slide 42

Specify the Cases to Include in the First Screening Sample


Slide 43

Specify the Value of the Selection Variable for the First Validation Analysis


Slide 44

Specify the Value of the Selection Variable for the Second Validation Analysis


Slide 45

Generalizability of the Discriminant Model

We base our decisions about the generalizability of the discriminant model on a table which compares key outputs comparing the analysis with the full data set to each of the validation runs.

Full Model Split=0 Split=1

Number of Significant Functions

2 2 2

Cross-validated Accuracy 75.0% 74.8% 76.0%

Accuracy Rate for Validation Sample

77.7% 76.4%

SignificantCoefficients (p < 0.05)

1. OUTDOOR Outdoor Activity Score2. CONVIV Convivial Score3. CONSERV Conservative Score



In both of the validation analyses, two significant discriminant functions were found. The cross-validated accuracy rates and the accuracy rate for the validation samples were approximately the same size. Both validation analyses included the three available independent variables, though the order of entry differed.The results of the validation analyses are similar to the model with the full data set. We can conclude that the model is generalizable.


Documents

Slide 1 A Problem in Personnel Classification This problem is from Phillip J. Rulon, David V. Tiedeman, Maurice Tatsuoka, and Charles R. Langmuir. Multivariate