Upload
marybeth-lewis
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Slide 1
A Problem in Personnel Classification
This problem is from Phillip J. Rulon, David V. Tiedeman, Maurice Tatsuoka, and Charles R. Langmuir. Multivariate Statistics for Personnel Classification. 1967.
This sample data is for "World Airlines, a company employing over 50,000 persons and operating scheduled flights. This company naturally needs many men who can be assigned to a particular set of functions. The mechanics on the line who service the equipment of World Airlines form one of the groups we shall consider. A second group are the agents who deal with the passengers of the airline. A third group are the men in operations who coordinate airline activities.
The personnel officer of World Airlines has developed an Activity Preference Inventory for the use of the airline. The first section of this inventory contains 30 pairs of activities, each pair naming an indoor activity and an outdoor activity. One item is
_____ Billiards : Golf _____
The applicant for a job in World Airlines checks the activity he prefers. The score in the number of outdoor activities marked." (page 24)
The second section of the Activity Preference Inventory "contains 35 items. One activity of each pair is a solitary activity, the other convivial. An example is
_____ Solitaire : Bridge _____
The apprentice's score is the number of convivial activities he prefers." (page 82)
The third section of the Activity Preference Inventory "contains 25 items. One activity of each pair is a liberal activity, the other a conservative activity. An example is
_____ Counseling : Advising _____
Discriminant Analysis
Slide 2 Discriminant Analysis
The apprentice's score is the number of conservative activities he prefers." (page 153)
The Activity Preference Inventory was administered to 244 employees in the three job classifications who were successful and satisfied with their jobs. The dependent variable, JOBCLASS 'Job Classification' included three job classifications: 1 - Passenger Agents, 2 - Mechanics, and 3 - Operations Control.
The purpose of the analysis is to develop a classification scheme based on scores on the Activity Preference Inventory to assign new employees to the different job groups.
A Problem in Personnel Classification (continued)
Slide 3
Stage One: Define the Research Problem
In this stage, the following issues are addressed:•Relationship to be analyzed•Specifying the dependent and independent variables•Method for including independent variables
Discriminant Analysis
Relationship to be analyzed
We are interested in the relationship between scores on the three scales of the Activity Preference Inventory and the different job classifications.
Specifying the dependent and independent variables
The dependent variable is:•JOBCLASS 'Job Classification'
The independent variables are:•OUTDOOR, 'Outdoor Activity Score'•CONVIV, 'Convivial Score'•CONSERV, 'Conservative Score'
Method for including independent variables
Since the purpose of this analysis is to articulate the relationship between the activity scores and job classification, direct entry of independent variables would be an appropriate method for selecting variables. However, I prefer to use a stepwise method in order to identify which predictors are statistically significant.
Slide 4
Stage 2: Develop the Analysis Plan: Sample Size Issues
In this stage, the following issues are addressed:
•Missing data analysis•Minimum sample size requirement: 20+ cases per independent variable•Division of the sample: 20+ cases in each dependent variable group
Discriminant Analysis
Missing data analysis
There is no missing data in this data set.
Minimum sample size requirement: 20+ cases per independent variable
The data set contains 244 subjects and 3 independent variables. The ratio of 81 cases per independent variable excess the minimum sample size requirement.
Division of the sample: 20+ cases in each dependent variable group
There were 85 Passenger Agents in the sample, 93 Mechanics, and 66 Operations Control staff in the sample. There are more than 20 cases in each dependent variable group.
Slide 5
Stage 2: Develop the Analysis Plan: Measurement Issues:
In this stage, the following issues are addressed:
•Incorporating nonmetric data with dummy variables•Representing curvilinear effects with polynomials•Representing interaction or moderator effects
Discriminant Analysis
Incorporating Nonmetric Data with Dummy Variables
None of the variables are nonmetric.
Representing Curvilinear Effects with Polynomials
We do not have any evidence of curvilinear effects at this point in the analysis.
Representing Interaction or Moderator Effects
We do not have any evidence at this point in the analysis that we should add interaction or moderator variables.
Slide 6
Stage 3: Evaluate Underlying Assumptions
In this stage, the following issues are addressed:
•Nonmetric dependent variable and metric or dummy-coded independent variables•Multivariate normality of metric independent variables: assess normality of individual variables•Linear relationships among variables•Assumption of equal dispersion for dependent variable groups
Discriminant Analysis
Nonmetric dependent variable and metric or dummy-coded independent variables
The dependent variable is nonmetric. All of the independent variables are metric.
Multivariate normality of metric independent variables
Since there is not a method for assessing multivariate normality, we assess the normality of the individual metric variables.
Slide 7
Run the 'NormalityAssumptionAndTransformations' Script
Discriminant Analysis
Slide 8
Complete the 'Test for Assumption of Normality' Dialog Box
Discriminant Analysis
Tests of Normality
We find that all three of the independent variables fail the test of normality, and that none of the transformations induced normality in any of the variables. We should note the failure to meet the normality assumption for possible inclusion in our discussion of findings.
Slide 9
Linear relationships among variables
Since our dependent variable is not metric, we cannot use it to test for linearity of the independent variables. As an alternative, we can plot each metric independent variable against all other independent variables in a scatterplot matrix to look for patterns of nonlinear relationships. If one of the independent variables shows multiple nonlinear relationships to the other independent variables, we consider it a candidate for transformation
Discriminant Analysis
Slide 10
Requesting a Scatterplot Matrix
Discriminant Analysis
Slide 11
Specifications for the Scatterplot Matrix
Discriminant Analysis
Slide 12
The Scatterplot Matrix
Blue fit lines were added to the scatterplot matrix to improve interpretability.
Having computed a scatterplot for all combinations of metric independent variables, we identify all of the variables that appear in any plot that shows a nonlinear trend. We will call these variables our nonlinear candidates. To identify which of the nonlinear candidates is producing the nonlinear pattern, we look at all of the plots for each of the candidate variables. The candidate variable that is not linear should show up in a nonlinear relationship in several plots with other linear variables. Hopefully, the form of the plot will suggest the power term to best represent the relationship, e.g. squared term, cubed term, etc.
None of the scatterplots show evidence of any nonlinear relationships.
Discriminant Analysis
Slide 13
Assumption of equal dispersion for dependent variable groups
Box's M statistic tests for homogeneity of dispersion matrices across the subgroups of the dependent variable. The null hypothesis is that the dispersion matrices are homogenous. If the analysis fails this test, we can request classification using separate group dispersion matrices to see it this improves the model's accuracy rate.
Box's M statistic is produced by the SPSS discriminant procedure, so we will defer this question until we have obtained the discriminant analysis output.
Discriminant Analysis
Slide 14
Stage 4: Estimation of Discriminant Functions and Overall Fit: The Discriminant Functions
In this stage, the following issues are addressed:
•Compute the discriminant analysis•Overall significance of the discriminant function(s)
Discriminant Analysis
Compute the discriminant analysis
The steps to obtain a discriminant analysis are detailed on the following screens.
Slide 15
Requesting a Discriminant Analysis
Discriminant Analysis
Slide 16
Specifying the Dependent Variable
Discriminant Analysis
Slide 17
Specifying the Independent Variables
Discriminant Analysis
Slide 18
Specifying Statistics to Include in the Output
Discriminant Analysis
Slide 19
Specifying the Stepwise Method for Selecting Variables
Discriminant Analysis
Slide 20
Specifying the Classification Requirement
Discriminant Analysis
Slide 21
Complete the Discriminant Analysis Request
Discriminant Analysis
Slide 22
Overall significance of the discriminant function(s) - 1
Our first task is to determine whether or not there is a statistically significant relationship between the independent variables and the dependent variable. We navigate to the section of output titled "Summary of Canonical Discriminant Functions" to locate the following outputs:
Recall that the maximum number of discriminant functions is equal to the number of groups in the dependent variable minus one, or the number of variables in the analysis, whichever is smaller. For this problem, the maximum number of discriminant functions is two.
Discriminant Analysis
Slide 23
Overall significance of the discriminant function(s) - 2
In the Wilks' Lambda table, SPSS successively tests models with an increasing number of functions. The first line of the table tests the null hypothesis that the mean discriminant scores for the two possible functions are equal in the three groups of the dependent variable. Since the probability of the chi-square statistic for this test is less than 0.0001, we reject the null hypothesis and conclude that there is at least one statistically significant function. Had the probability for this test been larger than 0.05, we would have concluded that there are no discriminant functions which separate the groups of the dependent variable.
The second line of the Wilks' Lambda table tests the null hypothesis that the mean discriminant scores for the second possible discriminant function are equal in the three groups of the dependent variable. Since the probability of the chi-square statistic for this test is less than 0.0001, we reject the null hypothesis and conclude that the second discriminant function, as well as the first, is statistically significant. Had the probability for this test been larger than 0.05, we would have concluded that there is only one discriminant function to separate the groups of the dependent variable.
Our conclusion from this output is that there are two statistically discriminant functions for this problem.
Discriminant Analysis
Slide 24
Stage 4: Estimation of Discriminant Functions and Overall Fit: Assessing Model Fit
In this stage, the following issues are addressed:
•Assumption of equal dispersion for dependent variable groups•Classification accuracy chance criteria•Press's Q statistic•Presence of outliers
Discriminant Analysis
Slide 25
In discriminant analysis, the best measure of overall fit is classification accuracy. The appropriateness of using the pooled covariance matrix in the classification phase is evaluated by the Box's M statistic.
We examine the probability of the Box's M statistic to determine whether or not we meet the assumption of equal dispersion of the dispersion or covariance matrices (multivariate measure of variance). This test is very sensitive, so we should select a conservative alpha value of 0.01. At that alpha level, we fail to reject the null hypothesis for this analysis.
Assumption of equal dispersion for dependent variable groups
Had we failed this test, our remedy would be to re-run the discriminant analysis requesting the use of separate covariance matrices in classification.
Discriminant Analysis
Slide 26
Classification accuracy chance criteria - 1
The classification matrix for this problem computed by SPSS is shown below:
Discriminant Analysis
Slide 27
Classification accuracy chance criteria - 2
Following the text, we compare the accuracy rate for the cross-validated sample (75.0%) to each of the by chance accuracy rates.
In the table of Prior Probabilities for Groups, we see that the three groups contained .348, .381, and .270 of the sample of 244 cases used to derive the discriminant model.
The proportional chance criteria for assessing model fit is calculated by summing the squared proportion that each group represents of the sample, in this case (0.348 x 0.348) + (0.381 x 0.381 ) + (0.270 x 0.270) = 0.339. Based on the requirement that model accuracy be 25% better than the chance criteria, the standard to use for comparing the model's accuracy is 1.25 x 0.339= 0.4424. Our model accuracy rate of 75% exceeds this standard.
The maximum chance criteria is the proportion of cases in the largest group, 38.1% in this problem. Based on the requirement that model accuracy be 25% better than the chance criteria, the standard to use for comparing the model's accuracy is 1.25 x 38.1% = 47.6%. Our model accuracy rate of 75% exceeds this standard.
Discriminant Analysis
Slide 28
Press's Q statistic
Substituting the values for this problem (244 cases, 183 correct classifications, and 3 groups) into the formula for Press's Q statistic, we obtain a value = [244 - (183 x 3)] ^ 2 / 244 * (3 - 1) = 190.6. This value exceeds the critical value of 6.63 (Text, page 305) so we conclude that the prediction accuracy is greater than that expected by chance.
By all three criteria, we would interpret our model as having an accuracy above that expected by chance. Thus, this is a valuable or useful model that supports predictions of the dependent variable.
Discriminant Analysis
Slide 29
SPSS print Mahalanobis distance scores for each case in the table of Casewise Statistics, so we can use this as a basis for detecting outliers.
According to the SPSS Applications Guide, p .227, cases with large values of the Mahalanobis Distance from their group mean can be identified as outliers. For large samples from a multivariate normal distribution, the square of the Mahalanobis distance from a case to its group mean is approximately distributed as a chi-square statistic with degrees of freedom equal to the number of variables in the analysis. The critical value of chi-square with 3 degrees of freedom (the stepwise procedure entered three variables in the function) and an alpha of 0.01 (we only want to detect major outliers) is 11.345.
We can request this figure from SPSS using the following compute command:
COMPUTE mahcutpt = IDF.CHISQ(0.99,3).
EXECUTE.
Where 0.99 is the cumulative probability up to the significance level of interest and 3 is the number of degrees of freedom. SPSS will create a column of values in the data set that contains the desired value.
We scan the table of Casewise Statistics to identify any cases that have a Squared Mahalanobis distance greater than 11.345 for the group to which the case is most likely to belong, i.e. under the column labeled 'Highest Group.'
Presence of outliers - 1
Discriminant Analysis
Slide 30
Presence of outliers - 2
In this particular analysis, I find one case, number 23, with a large enough Mahalanobis distance to indicate that it is an outlier and might be considered for removal from the analysis. However, since there is only one case out of 244, it is not likely to make any difference, so we will forego re-running the analysis without this case.
Discriminant Analysis
Slide 31
Stage 5: Interpret the Results
In this section, we address the following issues:
•Number of functions to be interpreted•Relationship of functions to categories of the dependent variable•Assessing the contribution of predictor variables•Impact of multicollinearity on solution
Discriminant Analysis
Number of functions to be interpreted
As indicated previously, there are two significant discriminant functions to be interpreted.
Slide 32
Role of functions in differentiating categories of the dependent variable
The combined-groups scatterplot enables us to link the discriminant functions to the categories of the dependent variable. I have modified the SPSS output by changing the symbols for the different points so that we can easily detect the group members. In addition, I have added reference lines at the zero value for each axis.
Analyzing this plot, we see that the first function differentiates Passenger Agents from Mechanics and Operations Control personnel. The second function differentiates Mechanics from Operations Control staff.
Discriminant Analysis
Slide 33
Assessing the contribution of predictor variables - 1
Identifying the statistically significant predictor variables
The summary table of variables entering and leaving the discriminant functions is shown below. We can see that we have three independent variables included in the analysis in the order shown in the table. We would conclude all three of the independent variables, Outdoor Activity Score, Convivial Score, and Conservative Score make a statistically significant contribution to group membership on the dependent variable.
Discriminant Analysis
Slide 34
Assessing the contribution of predictor variables - 2
Importance of Variables and the Structure Matrix
To determine which predictor variables are more important in predicting group membership when we use a stepwise method of variable selection, we can simply look at the order in which the variables entered, as shown in the following table.
Discriminant Analysis
Slide 35
Assessing the contribution of predictor variables - 3
While we know which variables were important to the overall analysis, we are also concerned with which variables are important to which discriminant function. This information is provided by the structure matrix, which is a rotated correlation matrix containing the correlations between each of the independent variables and the discriminant function scores.
From the structure matrix, we see that two of the three variable entered into the functions (Convivial Score and Conservative Score) are the important variables in the first discriminant function, while Outdoor Activity Score is the important variable on the second function.
Discriminant Analysis
Slide 36
Assessing the contribution of predictor variables - 4
Comparing Group Means to Determine Direction of Relationships
If we examine the pattern of means for the three statistically significant variables for the three job classifications, we can provider a fuller discussion of the relationships between the independent variables, the dependent variable groups, and the discriminant functions.
Discriminant Analysis
Slide 37
Assessing the contribution of predictor variables - 5
The first discriminant function distinguishes Passenger Agents from Mechanics and Operations Control staff. The two variables that are important on the first function are convivial score and conservative score. Passenger agents had higher convivial scores and lower conservative scores than the other two groups.
Operations Control staff are distinguished from Mechanics by the second discriminant function which contains only a single variable, the outdoor activity Score. Mechanics had a higher average on the outdoor activity score than did Operations Control staff.
In sum, Passenger Agents are more outgoing (convivial) and more tolerant (less conservative) than Mechanics and Operations Control personnel. Mechanics differ from Operations Control personnel in their stronger preference for outdoor oriented activities.
Discriminant Analysis
Slide 38
Impact of Multicollinearity on solution
Multicollinearity is indicated by SPSS for discriminant analysis by very small tolerance values for variables, e.g. less than 0.10 (0.10 is the size of the tolerance, not its significance value).
If we look at the table of Variables Not In The Analysis, we see that it did not print anything for step 3, indicating that all variables were in the analysis. Multicollinearity is not an issue in this problem.
Discriminant Analysis
Slide 39
Stage 6: Validate The Model
In this stage, we are normally concerned with the following issues:
•Conducting the Validation Analysis•Generalizability of the Discriminant Model
Discriminant Analysis
Conducting the Validation Analysis
To validate the discriminant analysis, we can randomly divide our sample into two groups, a screening sample and a validation sample. The analysis is computed for the screening sample and used to predict membership on the dependent variable in the validation sample. If the model in the screening sample is valid, we would expect that the accuracy rates for both samples to be about the same.
In the double cross-validation strategy, we reverse the designation of the screening and validation sample and re-run the analysis. We can then compare the discriminant functions derived for both samples. If the two sets of functions contain a very different set of variables, it indicates that the variables might have achieved significance because of the sample size and not because of the strength of the relationship. Our findings about these individual variables would be that the predictive utility of these variables is not generalizable.
Slide 40
Set the Starting Point for Random Number Generation
Discriminant Analysis
Slide 41
Compute the Variable to Randomly Split the Sample into Two Halves
Discriminant Analysis
Slide 42
Specify the Cases to Include in the First Screening Sample
Discriminant Analysis
Slide 43
Specify the Value of the Selection Variable for the First Validation Analysis
Discriminant Analysis
Slide 44
Specify the Value of the Selection Variable for the Second Validation Analysis
Discriminant Analysis
Slide 45
Generalizability of the Discriminant Model
We base our decisions about the generalizability of the discriminant model on a table which compares key outputs comparing the analysis with the full data set to each of the validation runs.
Full Model Split=0 Split=1
Number of Significant Functions
2 2 2
Cross-validated Accuracy 75.0% 74.8% 76.0%
Accuracy Rate for Validation Sample
77.7% 76.4%
SignificantCoefficients (p < 0.05)
1. OUTDOOR Outdoor Activity Score2. CONVIV Convivial Score3. CONSERV Conservative Score
2. OUTDOOR Outdoor Activity Score1. CONVIV Convivial Score3. CONSERV Conservative Score
1. OUTDOOR Outdoor Activity Score2. CONVIV Convivial Score3. CONSERV Conservative Score
In both of the validation analyses, two significant discriminant functions were found. The cross-validated accuracy rates and the accuracy rate for the validation samples were approximately the same size. Both validation analyses included the three available independent variables, though the order of entry differed.The results of the validation analyses are similar to the model with the full data set. We can conclude that the model is generalizable.
Discriminant Analysis