12
SAS and JMP are at work together in corporate, academic, government and research settings to help you make discoveries through data visualization. Discriminant Analysis: Pathological Gambling Tonya Mauldin, SAS Institute Technical Support Suppose you want to evaluate a 12-item questionnaire that is designed to identify various levels of gambling behavior. The questions are based on DSM-IV® diagnostic criteria for pathological gambling (DSM-IV-TR, 2000). This reference is the diagnostic and statistical manual of mental disorders for mental professionals that lists different categories of mental disorders and the criteria for diagnosing them. The goal is to predict groups. You want to determine whether this questionnaire can adequately classify people previously identified as binge gamblers, steady gamblers, and non-gamblers. The following example has these objectives: 1) Predict group membership of a sample using discriminant analysis. 2) Evaluate the predictive ability of the resulting discriminative function by applying it to a new sample. This goal sounds similar to logistic regression. Discriminant analysis can be thought of as a multivariate generalization of logistic regression. Think of discriminant analysis as a way of learning what variables predict membership in various groups. Also, be sure not to confuse discriminant analysis with cluster analysis. Discriminant analysis requires prior knowledge of the group membership, whereas the purpose of cluster analysis is to create the groups. Cluster analysis data does not contain previously known information on group membership. The data for this example can be found on the Web page for this issue of JMPer Cable at www.jmp.com/about/newsletters/ Predict Group Membership with Discriminant Analysis The data table has variables type that identifies group membership, and dsm1- dsm12 that are responses to the pathological gambling questionnaire. To follow along with this example: 1) Open gamblegrp.jmp. 2) Choose Analyze > Multivariate Methods > Discriminant. 3) Select dsm1-dsm12 as Y, Covariates. 4) Select type as X, Categories. 5) Click, OK to see the initial discriminant analysis results. By default, JMP uses equal priors, which means a 33% chance of being a steady gambler, a 33% chance of being a binge gambler, and a 33% chance of being a nongambler. But also suppose that through prior knowledge you believe the true A Technical Publication for JMP ® Users Issue 24 Spring 2008 Inside This Newsletter Discriminant Analysis: Pathological Gambling 1 Finding Sample Data 4 Statistical Intervals: Confidence, Prediction, Enclosure 5 Book Discussion: Elementary Statistics Using JMP 8 Discovery 2008: The Data Exploration Conference 8 Why Shouldnʼt I Delete That Model Term? 9

Discriminant Analysis: Pathological Gambling Inside This ... · Discriminant analysis requires prior knowledge of the group membership, whereas the purpose of cluster analysis is

  • Upload
    others

  • View
    29

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Discriminant Analysis: Pathological Gambling Inside This ... · Discriminant analysis requires prior knowledge of the group membership, whereas the purpose of cluster analysis is

SAS and JMP are at work together in corporate, academic, government and research settings to help you make discoveries through data visualization.

Discriminant Analysis: Pathological Gambling Tonya Mauldin, SAS Institute Technical Support

Suppose you want to evaluate a 12-item questionnaire that is designed to identify various levels of gambling behavior. The questions are based on DSM-IV® diagnostic criteria for pathological gambling (DSM-IV-TR, 2000). This reference is the diagnostic and statistical manual of mental disorders for mental professionals that lists different categories of mental disorders and the criteria for diagnosing them.

The goal is to predict groups. You want to determine whether this questionnaire can adequately classify people previously identified as binge gamblers, steady gamblers, and non-gamblers.

The following example has these objectives:

1) Predict group membership of a sample using discriminant analysis.

2) Evaluate the predictive ability of the resulting discriminative function by applying it to a new sample.

This goal sounds similar to logistic regression. Discriminant analysis can be thought of as a multivariate generalization of logistic regression. Think of discriminant analysis as a way of learning what variables predict membership in various groups.

Also, be sure not to confuse discriminant analysis with cluster analysis. Discriminant analysis requires prior knowledge of the group membership, whereas the purpose of cluster analysis is to create the groups. Cluster analysis data does not contain previously known information on group membership.

The data for this example can be found on the Web page for this issue of JMPer Cable at

www.jmp.com/about/newsletters/

Predict Group Membership with Discriminant Analysis The data table has variables type that identifies group membership, and dsm1-dsm12 that are responses to the pathological gambling questionnaire. To follow along with this example:

1) Open gamblegrp.jmp.

2) Choose Analyze > Multivariate Methods > Discriminant. 3) Select dsm1-dsm12 as Y, Covariates.

4) Select type as X, Categories.

5) Click, OK to see the initial discriminant analysis results.

By default, JMP uses equal priors, which means a 33% chance of being a steady gambler, a 33% chance of being a binge gambler, and a 33% chance of being a nongambler. But also suppose that through prior knowledge you believe the true

A Technical Publication for JMP® Users Issue 24 Spring 2008

Inside This Newsletter

Discriminant Analysis: Pathological Gambling 1

Finding Sample Data 4

Statistical Intervals: Confidence, Prediction, Enclosure 5

Book Discussion: Elementary Statistics Using JMP 8

Discovery 2008: The Data Exploration Conference 8

Why Shouldnʼt I Delete That Model Term? 9

Page 2: Discriminant Analysis: Pathological Gambling Inside This ... · Discriminant analysis requires prior knowledge of the group membership, whereas the purpose of cluster analysis is

2

population is similar to the observations in the sample. You can use Analyze > Distribution to see that in the sample 33% are binge gamblers, 48% are control (nongamblers), and 19% are steady gamblers. You want to use these priors in your analysis.

To use priors that are proportional to the group occurrence in the data table:

6) Select Specify Priors > Proportional to Occurrence from the red triangle menu on the Discriminant Analysis title bar, as shown in Figure 1.

Figure 1 Specify Proportional Priors

The prior probabilities and the Discriminant Scores report show beneath the Canonical Plot. Misclassified observations are flagged with an asterisk (*).

You can see in Figure 2 that observations 12 and 13 were misclassified. Overall, eleven out of 100 observations were misclassified in this example. Given this model, it is reasonable to expect 11% of the total population to be misclassified. The table of the classifications (inset in Figure 2) shows beneath the Discriminant Scores.

The correct classification rate (89%) is inflated when the same data is used to develop the classification equations and to test the results. You should always compare your results to a simple classification rule. In this example, 48% of the observations are in the control group. If you had classified all observation into the control group, you would have been correct 48% of the time. In this case, the discriminant model outperforms the simple model.

Test the Prediction Probabilities The next step is to validate your findings with a validation data set. Discriminant analysis capitalizes on chance association in the data. If you apply the discriminant equations to a new sample, you get a better estimate of the expected error rate for the population.

Figure 2 Prior Probabilities and Partial List of Discriminant Scores

Page 3: Discriminant Analysis: Pathological Gambling Inside This ... · Discriminant analysis requires prior knowledge of the group membership, whereas the purpose of cluster analysis is

3

To save the discriminant equations, choose Score Options > Save Formulas from the red triangle menu on the Discriminant Analysis title bar. Notice that the data table now contains new columns with probabilities and a column that predicts group membership based on these probabilities.

A way to test these prediction probabilities is to apply the probabilities computed from the sample data to new observations. The next example adds new observations to the sample data table, as follows:

1) Select Edit > Select All to select all observations in gamblegrp, the current data table.

2) Select Rows > Exclude/Unexclude to exclude all observations in the current data table from the next part of the analysis.

3) Select Rows > Hide/Unhide to hide all points in the current data table for the next part of the analysis.

4) Open gamblegrp2.jmp.

5) Select Tables->Concatenate.

6) When the Concatenate dialog appears, select gamblegrp and click Add.

Click OK to see the combined tables (see Figure 4).

Note: Be sure to check the Save and evaluate formulas check box on the Concatenate dialog.

Now you have combined the original data with a test data table in a new JMP table with 200 observations. Note in Figure 3 that the original 100 observations used to compute the discriminant formulas are hidden and excluded from further analysis. The saved formulas computed the probabilities and predictions for the test observations.

An easy way to view the classification error rate is with a contingency table of the computed classifications:

1) Choose Analyze > Fit Y by X.

2) Select Type as Y, Response.

3) Select Pred Type as X, Factor and click OK.

Figure 4 shows the contingency table analysis. The misclassifications show on the off diagonal. The total number of misclassifications is 23 (the error rate is 23%). Using the discriminant model, created from the gamblegrp data, you can expect an error rate of 23%.

Figure 3 Concatenate to Combine Tables

This error rate is more that twice that for the calibration data table, but is still less than half the simple-minded approach, which resulted in 56% misclassification when all observations were assigned to the control group.

Figure 5 Contingency Table of Type by Predicted Type

References Diagnostic and Statistical Manual of Mental Disorders, Fourth

Edition, Text Revision (DSM-IV-TR), (2000), American Psychiatric Publishing, Inc., Arlington, Virginia.

SAS Institute Inc. (2005) Multivariate Statistical Method: Practical Research Methods Course Notes, Cary, NC: SAS Institute Inc.

SAS Institute Inc. (2005) JMP Statistics and Graphics Guide, Cary, NC: SAS Institute Inc.

Page 4: Discriminant Analysis: Pathological Gambling Inside This ... · Discriminant analysis requires prior knowledge of the group membership, whereas the purpose of cluster analysis is

4

Finding Sample Data Lee Creighton, JMP Development

When reading reference guides or teaching a JMP class, it is often important to quickly locate the sample data installed with JMP.

If you are interested in finding sample data based on what statistical concept or industrial area they illustrate, use the Help > Sample Data Directory command. This command opens a JMP journal that has links to many (but not all) of the sample data files, categorized by the area of use (see Figure 1).

However, if you know the name of the file you want to locate, use your operating system’s file browser to get an alphabetical list of files. You can then quickly identify the file by its name. The Sample Data directory location depends on the operating system you are running. JMP conforms with the recommendations of each operating system’s creators, so the location of the files varies.

• If you are using Microsoft Windows, the default sample data location is

C:\Program Files\SAS\JMP7\Support Files English\Sample Data

• If you are using Macintosh OS X, the default Sample Data location is

/Library/Application Support/JMP/7/English/Sample Data

• If you are running Linux, the default Sample Data location is

/opt/JMP8/Support Files English/Sample Data

or

/home/userid/JMP8/Support Files English

There are several options to mark these locations. Each operating system has a facility for making a shortcut or alias that points to a location. Frequent JMP users often make shortcuts on their desktop for quick access to the sample data.

Additionally, Windows allows you to create Favorites that are displayed in every File Open dialog box. To do this, use File > Open within JMP and navigate to the Sample Data folder at the address shown above. Then click the add to favorites folder icon ( ), as illustrated in Figure 2. Thereafter, whenever you use File > Open to display the Open Data File dialog, click the go to favorites folder icon ( ),to see the Sample Data folder listed in the favorites list.

Figure 1 Use the Help > Sample Data Directory Command to see the Sample Data Listed by Category

Figure 2 Use the Open Data File dialog and Navigate to the Sample Data Directory

See Favorites

Add to Favorites

Page 5: Discriminant Analysis: Pathological Gambling Inside This ... · Discriminant analysis requires prior knowledge of the group membership, whereas the purpose of cluster analysis is

5

Statistical Intervals: Confidence, Prediction, Enclosure José G. Ramírez, Ph.D., W.L. Gore and Associates, Inc.

Introduction This article uses an example to explain three kinds of statistical intervals that can be confusing, even in the minds of those who use them often:

• confidence intervals

• prediction intervals

• tolerance intervals

Suppose you are in charge of a coaxial cable manufacturing operation, and that one type of coaxial cable has a target resistance of 50 Ohms with a standard deviation of 2 Ohms. During one month, you took a random and representative sample of 40 cables from your production process and measured their resistance. The data table, ResistanceData.jmp, is included with this issue of JMPer Cable at

http://www.jmp.com/about/newsletters/

JMP (Analyze > Distribution) generated the histogram with basic statistics shown in Figure 1.

Figure 1 Distribution and Summary Statistics for Resistance Data

The average resistance is 49.86 Ohms and the standard deviation is 1.96 Ohms. Based on these estimates of the mean and the standard deviation, it appears you are meeting the target values of 50 Ohms resistance with a standard deviation of 2 Ohms. But, what else can be said about the resistance data?

Confidence Intervals A confidence interval is ideal for quantifying the degree of uncertainty around common parameters of interest such as the mean or standard deviation of a population. For normally distributed data, a confidence interval for the mean is

!

X ± t1"#

2,n"1s1

n (1)

The confidence interval adds a margin of error to the estimate of the population mean. For sample size n, it is a function of a given t quantile with degrees of freedom n−1 and the standard deviation estimate s.

Similarly, a confidence interval for the standard deviation gives lower and upper bounds for the variation of the standard deviation estimate. Yes, even the estimate of noise, s, has noise in it!

To generate a confidence interval in JMP, click the red triangle on the Resistance title bar and select Confidence Interval. Then complete the dialog as shown in Figure 2.

Figure 2 Confidence Intervals for Mean and Standard Deviation (95%)

The 95% lower confidence bound for the mean is 49.23 Ohms and the upper bound is 50.49 Ohms—with 95% confidence, the population mean of the resistance measurements lies in the interval [49.23, 50.49] Ohms. Since this interval contains the target value of 50 Ohms, you are 95% confident of being that close to the target.

Likewise, for the standard deviation you can say with 95% confidence that the true standard deviation of the resis-tance population lies in the interval [1.61, 2.52] Ohms. Since the target standard deviation of 2 Ohms falls within this interval, you are meeting the target with 95% degree of confidence.

What Does Degree of Confidence Mean? A typical first reaction is to interpret 95% confidence as there is 95% chance that the interval contains the true mean of my population. But, this is a bit simplistic. Think of confidence intervals as nets that can capture the true population parameter (mean or standard deviation) so that for 95% confidence, on average, 95% of these nets

Page 6: Discriminant Analysis: Pathological Gambling Inside This ... · Discriminant analysis requires prior knowledge of the group membership, whereas the purpose of cluster analysis is

6

capture the true population mean, while 5% of them, on average, are going to fail—that is, the net will not capture the true population mean. In other words, before you compute the 95% confidence interval there is a 95% chance that your net is going to work and capture the true population value. However, after the net is created, it either encloses the true population mean, or not.

It’s like a game of poker: before a hand is dealt you have a 42.3% chance of getting one pair (you feel 42.3% confident you will get a pair) but after the hand is dealt, you either have the pair or you don’t. Therefore, the degree of confidence is a statement about the quality (yield) of the procedure for generating the confidence interval. Equation (1) is the procedure for generating a confidence interval for the mean.

JMP provides a nice script that simulates the confidence interval around the mean to help you visualize and understand the interpretation of degree of confidence. To use the simulator, run the script called Confidence Intervals, saved with the Resistance Data table. For each sample, the script calculates a 95% confidence interval for the mean using equation (1), and shows the yield of this process (how many intervals out of a hundred contain the true mean of 50 Ohms). Figure 3 shows one of these simulations, for which 93% of the intervals include the true population mean of 50.

Click New Sample to generate a new set of intervals. By repeatedly resampling you can see that sometimes 94% of the intervals contain the true mean, others 97%, etc.

Prediction Intervals What if you want to make a claim about the resistance of a future cable, or the average resistance of a group of cables that you are going to manufacture in the future?

The confidence intervals for the mean and standard deviation calculated in Figure 2 refer to the population of cables manufactured during the month in which the 40 cables were sampled, and not to an individual observation, or group of observations in the future. A prediction interval for a single future observation resembles a confidence interval for the mean but it is wider because it takes into account the prediction noise. The formula for the prediction interval is

1 , 12

11

n

X t sn

!" "

± +

To generate a 95% prediction interval for one future observation, click the red triangle on the Resistance title bar and select Prediction Interval (Figure 4). The results show the 95% lower prediction bound to be 45.84 Ohms, and the 95% upper prediction bound is 53.88 Ohms. The claim you can make is that, with 95% confidence, you expect a future cable to have a resistance between 45.84 and 53.88 Ohms.

Figure 4 Prediction Interval for One Future Observation

If you make a batch of 10 cables, where do you expect the mean and standard deviation resistance of the 10 cables to fall? To answer this question, again select Prediction Interval but this time enter 10 in the Enter number of future samples box.

Figure 3 Simulation of Confidence Interval for the Mean

The Confidence Intervals script simulates and plots 100 samples of size 40 with mean = 50 and standard deviation = 2. In this example, 93% of the samples cross (include) the mean reference line.

Page 7: Discriminant Analysis: Pathological Gambling Inside This ... · Discriminant analysis requires prior knowledge of the group membership, whereas the purpose of cluster analysis is

7

The results in Figure 5 show that 95% of the time, you can expect the average resistance of the batch of 10 future cables to fall within 48.46 and 51.26 Ohms, and the standard deviation to be within 1.05 and 3.08 Ohms. Note that the prediction intervals for the mean and the standard deviation of 10 future observations contain the targets of 50 Ohms and 2 Ohms.

Figure 5 Prediction Intervals for 10 Future Observations

What can you say about the capability of your process?

Tolerance Intervals A tolerance interval is an interval that encloses a specified proportion of the sampled population, not its mean or standard deviation. For a specified confidence level, you may want to determine lower and upper bounds that contain 99% of the population.

Tolerance bounds can be used to set up specification limits by finding the lower and upper values that corre-spond to stated yield or process capability goals. For example, in engineering and science it is common practice to set a specification limit by adding plus or minus 3 standard deviations (s or sigma) to the estimate of the mean. For normally distributed data, this specification limit encloses 99.73% of the population in the interval defined by the mean plus or minus 3s. However, this is true only if the true population parameters are known—a fact that practitioners may not be aware of. Usually the mean and variance are estimated from a random sample that is often small.

A tolerance interval is also constructed to take estimation noise and the sample size into account via the function g(confidence, proportion, sample size), written

!

X ± (g(1"

a

2,p,n ))s

A (95% ± 3s)-equivalent tolerance interval is then given by the equation X ± g(0.975, 0.9973, n)s. Note that the 0.975 refers to the confidence, while the 0.9973 (3s equivalent) refers to the proportion of the population covered by the tolerance bounds.

Click the red triangle on the Resistance histogram title bar and select Tolerance Interval to see the Tolerance Intervals dialog. To generate a 95% tolerance interval that covers 99.73% of the resistance data population, type 0.95 for Specify Confidence (1-Alpha) and 0.9973 for Specify Proportion to cover box. Figure 6 shows a 95% tolerance interval that covers 99.73% of the resistance data population.

Figure 6 Tolerance Interval for the Resistance Population

Summary Statistical intervals help quantify the uncertainty surrounding the estimates calculated from data, such as the mean and standard deviation. The three types of intervals presented here: confidence, prediction and tolerance, are particularly relevant for applications in science and engineering because they allow practical claims about sampled data, as is shown in Table 1. JMP makes it easy to obtain these intervals—all you have to do is apply them in the right way.

References Ramírez, J,G. and Ramírez, B.S. (in production), Analyzing and

Interpreting Continuous Data Using JMP: A Step-by-Step Guide, SAS Press.

Hahn, G.J. and Meeker, W.Q. (1991) Statistical Intervals: A Guide for Practitioners, John Wiley and Sons, Inc.

Table 1 Summary of 95Percent Statistical Intervals and Interpretation for the Resistance Data Example Interval Type Lower (Ohms) Upper (Ohms) Interpretation

Confidence for the Population Mean

49.23 50.49 The population mean of the resistance measurement is within these bounds.

Prediction for 1 Future Observation

45.84 53.88 You expect a single future resistance measurement to be within these bounds.

Prediction for the Average of 10 Future Observations

48.46 51.26 You expect the average of 10 future resistance measurements to be within these bounds.

Tolerance to Enclose 99.73% of the Population

42.52 57.20 You expect 99.73% of the resistance measurements to be within these bounds.

Page 8: Discriminant Analysis: Pathological Gambling Inside This ... · Discriminant analysis requires prior knowledge of the group membership, whereas the purpose of cluster analysis is

8

Book

Discussion Elementary Statistics Using JMP

By Sandra Schlotzhauer Published by SAS Publishing, April 2007

This reader-friendly guide bridges the gap between statistics texts and JMP documentation. The book begins opens with an explanation of the basics of JMP data tables, demonstrating how to use JMP for descriptive statistics and graphs. The author continues with a lucid discussion of fundamental statistical concepts, including normality and hypothesis testing. Using a step-by-step approach, she shows analyses for comparing two groups, comparing multiple groups, fitting regression equations, and exploring contingency tables. For each analysis, the author clearly explains assumptions, the statistical approach, the JMP steps and results, and how to make conclusions from the results.

Understand how to interpret both the graphs and text reports, as well as how to customize JMP results to meet your needs. Packed with examples from a broad range of industries, this text is ideal for novice to intermediate JMP users. Prior statistical knowledge, JMP experience, or programming skills are not required.

Don Lifke, Certified Six Sigma Black Belt at Sandia National Laboratories says, “This book is a great tool for learning both statistics and JMP simultaneously. It is ideal for those wanting to learn the power of JMP, along with learning the power of statistical data analyses. This book used in conjunction with the powerful graphical features of JMP, will provide the reader the knowledge necessary to move to data-based decision making. It is a must for those who have data they'd like to analyze but don't know where to start.”

Discovery 2008 The Data Exploration Conference:

June 16-17 Formerly the JMP User Conference, Discovery 2008 offers a chance to explore new dimensions in modeling and data visualization. The conference this year will focus on how SAS and JMP are at work together in corporate, academic, government and research settings to make discoveries through data visualization. JMP and SAS have joined together to offer a conference that will benefit a wide range of customers. The conference agenda will appeal to business users, data analysts, programmers, statisticians, scientists, engineers, as well as students and academicians.

Whether you are a SAS® or a JMP® user, or are simply interested in statistical analysis, this event offers best practices, proven statistical techniques, and the latest trends in data visualization and data modeling.

There are session speakers from SAS Institute as well as all walks of business and industry.

There are one-day pre-conference workshops scheduled before the conference, and training courses during three days following the conference

Join us June 16-17 at SAS Institute in Cary, North Carolina, and gain insight into visualization techniques that will enable you to do your job more efficiently. For more details about the conference and information about how to register, go to

http://www.jmp.com/

and click on the Discovery 2008 section.

Discovery 2008: The Data Exploration Conference: June 16-17 Pre-Conference workshops: June 15 Post-Conference training: June 18-20

Registration Opens for 2008 Innovators' Summit Plan now to attend the second annual Innovators' Summit, scheduled for Sept. 24-26 in San Francisco. Like last year's inaugural event, the 2008 summit, sponsored by JMP, will bring together creative thinkers from business and academia for three days of networking, benchmarking and idea sharing about ways to make analytic excellence the rule, not the exception, across any organization.

Keynote speakers will include innovation experts from Google, NASA, McDonald’s, Stanford and other institutions large and small who will share strategies for the effective use of analytics and visualization in product improvement and process enhancement. In keynote presentations, panel discussions, breakout sessions and one-on-one conversations, participants will have the opportunity to share challenges and discuss strategies for overcoming them.

For more information and registration, visit the Innovators' Summit Web site at www.jmp.com/summit.

Page 9: Discriminant Analysis: Pathological Gambling Inside This ... · Discriminant analysis requires prior knowledge of the group membership, whereas the purpose of cluster analysis is

9

Why Shouldnʼt I Delete That Model Term? Mark Bailey and Laura Ryan, SAS Institute Education

Introduction Students are often told that certain terms that should not be deleted from a linear regression model. For example, when the model is a higher-order polynomial with interaction or crossed terms, a non-significant term such as X1 involved in a significant higher-order term such as X1

2 or X1*X2 should not be removed. On the other hand, using the simplest, most parsimonious model means that all non-significant terms should be removed.

If you remove a non-significant term that is present in significant higher order effects, the regression procedure and the model won’t break, but removing the term can fundamentally change the model. First, eliminating this low-order term jeopardizes the scale invariance property. Moreover, the reduced model is less flexible and distorts the fit by introducing artificial restrictions. In the case of a simple quadratic model with one predictor variable, eliminating the lower-order terms before removing the quadratic term forces the intercept or the minimum or maximum of the response to be zero. Even if these terms are not significant, these conditions might not be realistic and therefore introduce bias..

This advice is not a rule. The final decision still rests with the analyst. Draper and Smith (1998) offer two practical criteria as guidance to help determine if removing a term adversely affects your model.

1. The form of the model should not change with a shift of the origin.

2. The form of the model should not change with a rotation of the axes.

The First Criterion: Shifting the Origin Should not Change the Form of the Model The form of the model might change when a predictor variable has measurement units that can be converted from one form to another. For example, temperature can be expressed in degrees on the Fahrenheit (F) or Celsius (C) scale, which involves a shift of origin. Suppose for a Fahrenheit temperature, F, you fit a response, Y, as Y = β0 + β1F + β11F

2 and find the term β1F to be non-significant. If you drop that term from the model and substitute Celsius temperature, C, into the equation, you have Y = β0 + β11C

2. Now, if you substitute the con-version 5/9*(T–32) back into the equation, the algebraic result gives not only a new coefficient for the quadratic term, but also causes a first order term to reappear.

The same kind of change often occurs in the analysis of data from a designed experiment, where factor coding is automatically implemented in JMP analyses. This change of scale is defined in terms of the mid-point M and the half-interval H of the original scale over the interval between the low value L and the high value H.

The coded level C is now on a scale over the interval between L and H. The coded level C falls on a unitless scale [-1, 1] according to the transformation shown here. (Note that you can convert any coded value back to the original scale.)

If you fit a model using coded levels, eliminate lower order terms while keeping higher-order terms, and then revert back to the original scale, you encounter the same bias as in the example of changing the temperature scale. Consider the following example.

In Figure 1, an optimal design in sixteen runs tests the full quadratic model in three factors, Inlet Temperature, Reaction Temperature, and Reactant Concentration, for a chemical reaction. The initial analysis indicates that some terms are not significant, including the main effect of Inlet Temperature. This example uses the JMP table Criterion1.jmp. The data tables for the examples are on the JMP web site:

www.jmp.com/about/newsletters/

Figure 1 Results of Quadratic Model Fit

The usual approach removes the non-significant terms (fourth, seventh, eighth, and ninth terms), but leaves the first term in the model because of a significant interaction. However, if you remove all these terms, as well as the first term, and fit the model again, JMP gives a warning because of the first criterion (Figure 2). You can continue in spite of the warning and the new parameter estimates are based on coded factor levels.

Page 10: Discriminant Analysis: Pathological Gambling Inside This ... · Discriminant analysis requires prior knowledge of the group membership, whereas the purpose of cluster analysis is

10

Figure 2 Error when Lower Order Effect is Missing

Next, save the prediction formula in a new column. The formula s hows at the top of Figure 3. Notice absence of the main effect, Inlet Temperature.

Use the Simplify command from the menu on the formula editor to see the formula at the bottom in Figure 3. Simplifying the formula algebraically introduces the main effect Inlet Temperature back into the model. See the section, Algebraic Details, later in this article for more information.

Figure 3 Nonhierarchical Predicted Formula

The Second Criterion: The Model Form Should Not Change Due to Axis Rotation The second criterion states that the form of the model should not change with a rotation of the axes. However, sometimes it is advantageous to transform the data to a new coordinate system. For example, principal components analysis (PCA) is a useful way to obtain new predictors (Wi) when the original variables

(Xi) exhibit collinearity. PCA creates new predictors from orthogonal linear combinations of the original variables. The effect of PCA rotates the original data into a new coordinate system.

To illustrate this situation, consider the Criterion2.jmp table. Types of crime (murder, rape, robbery, assault, burglary, larceny, and auto theft) are used as predictors of house values. There is collinearity between different types of crimes, but removing some of the crimes as predictors loses unique information. Instead, a model can be built based on linear combinations of those predictors.

First, use the multivariate analysis platform to do principal component analysis on the six predictor variables: Murder, Rape, Robbery, Assault, Burglary, Larceny, and Auto. The principal components are built on correlations and the Varimax rotation method used to rotate two factors. The rotated factors are then saved to the data table (see Figure 4). All of the variables are in the model, but the perspective is changed.

Figure 4 Rotated Principal Components in Data Table

Next, use the data table to create the chart shown in Figure 5, which helps show factor membership. It appears that the first rotated factor (Col0) contains the crimes Auto, Burglary, Larceny, and Robbery. The second factor (Col1) contains Assault, Murder, and Rape. Therefore, the first factor could be described as crimes against property and renamed Property Crime; the second describes crimes against people and can be renamed People Crime. These rotated factors are two new predictor variables that can be used to fit a model.

Figure 5 Variables and Principal Components

Page 11: Discriminant Analysis: Pathological Gambling Inside This ... · Discriminant analysis requires prior knowledge of the group membership, whereas the purpose of cluster analysis is

11

Now, fit a model using the two rotated principal components and Housing Value (a fictitious response for this example). Construct a model that has the two main effects, the two-factor interaction and the two quadratic terms.

The Effect Tests report in Figure 6 shows that the quadratic term Property Crime*Property Crime is not significant. It is tempting to use a model having only the four terms that are all significant at the 5% level. However, the second criterion tells you to leave the quadratic term in the model unless you remove all of the terms of that same order. Otherwise, the model in the original variables will not have the same form as the final model with rotated axes. Figure 6 Results of Fitting Principal Component Factors

Algebraic Details The first criterion concerns changes in the model when a predictor variable has measurement units converted from one form to another, or a shift in origin. The example fit temperature expressed in degrees Fahrenheit, (F) and removed the main effect term. When the conversion to Celsius (C) is substituted into the model, the main effect reappears, as follows.

Y = β0 + β1F + β11F 2, remove main effect: Y = β0 + β11F 2

Y = β0 + β11F 2, substitute Celsius: Y = β0 + β11C 2

Y = β0 + β11 C 2, substitute conversion: Y = β0 + β11((5/9)(F – 32))2 Y = β0 + β11(25/81)(F2 – 64F + 1024) Y = β0 + β11(25/81)F 2 + β11(25/81)64F + β11(25/81)1024) Y = (β0 + 1024(25/81)β11) + (64(25/81)β11)F + + ((25/81)β11)F 2 Y = γ 0 + γ 1F + γ 11F 2

The second criterion concerns changes in model form when there is a rotation of the axes. The example created two rotated principal components, fit a model using the components as predictors and removed a lower-order term but left higher-order terms involving that term. Converting back to the original variables re-introduces the term that was removed.

The quadratic model for two predictors is written Y = β0 + β1X1 + β2X2 + β11X12 + β22X22 + β12X1X2 + ε

The linear combination of the predictors, W1 and W2. is a special case that represents a rotation of the axes,

W1 = aX1 + bX2

W2 = –bX1 + aX2 , and a2 + b2 = 1

This system of simultaneous equations can be written in matrix notation as W = CX, where

The reverse operation, to transform back to the original coordinates is, X = C-1W, which can be simplified to

X1 = aW1 – bW2

X2 = bW1 + aW2

Substitute the combination above for the original predictor in the quadratic model.

Y = β0 + β1(aW1 – bW2) + β2(bW1 + aW2) + β11(aW1 – bW2) + β22(bW1 + aW2) + β12(aW1 – bW2)( bW1 + aW2) + ε

This model can also be written as

Y = γ0 + γ1W1 + γ2W22 + γ11W1

2 + γ22W22 + γ12W1W2

where

γ0 = β0

γ1 = β1a + β2b

γ2 = –β1b + β2a

γ11 = β11a2+ β22b

2 + β12ab

γ22 = β11b2+ β22a

2 – β12ab

γ12 = –2β11ab+ 2β22ab + β12(a2 –b2)

Summary Given the two Draper and Smith (1998) criteria discussed in this article, it is important to understand how to approach a particular problem.

• If a particular model has only a shift in its origin, the only applicable criterion is that terms of lower-order cannot be dropped if higher-order terms are not removed from the model.

• If a model has a shift in its origin and uses a linear combination of the original predictor variables, then both criteria should be observed. Terms must be removed in groups based on their order. If a second-order term is removed, all terms of that order should also be removed. Terms of lower-order cannot be dropped if higher-order terms are not removed from the model even the terms are not significant.

The JMP Fit Model platform checks for these violations.

Draper, Norman R., and Smith, Harry (1998) Applied Regression Analysis, Third Edition, John Wiley & Sons, pages 266-271

Page 12: Discriminant Analysis: Pathological Gambling Inside This ... · Discriminant analysis requires prior knowledge of the group membership, whereas the purpose of cluster analysis is

About JMPer Cable Issue 24 Spring 2008

JMPer Cable is mailed to JMP users who are registered users with SAS Institute. It is also available online at www.jmp.com

Contributors Mark Bailey, Lee Creighton, Tonya Mauldin, José G. Ramírez, Laura Ryan

Editor Ann Lehman

Printing SAS Institute Print Center

Questions, comments, or for more information about JMP, call 1-877-594-6567 or visit us online at www.jmp.com

To Order JMP Software

1-877-594-6567

Copyright© 2006 SAS Institute Inc. All rights reserved. SAS, JMP, JMPer Cable, and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Six Sigma is a registered trademark of Motorola, Inc.