23
qwertyuiopasdfghjklzxcvbnmq wertyuiopasdfghjklzxcvbnmqw ertyuiopasdfghjklzxcvbnmqwe rtyuiopasdfghjklzxcvbnmqwer tyuiopasdfghjklzxcvbnmqwert yuiopasdfghjklzxcvbnmqwerty uiopasdfghjklzxcvbnmqwertyu iopasdfghjklzxcvbnmqwertyui opasdfghjklzxcvbnmqwertyuio pasdfghjklzxcvbnmqwertyuiop asdfghjklzxcvbnmqwertyuiopa DISCRIMINANT ANALYSIS AND ITS APPLICATION BDM&DM Term Paper 10/12/2009 Ankoor Das, 0811277

Discriminant Analysis Term Paper

Embed Size (px)

Citation preview

Page 1: Discriminant Analysis Term Paper

qwertyuiopasdfghjklzxcvbnmqwertyui

opasdfghjklzxcvbnmqwertyuiopasdfgh

jklzxcvbnmqwertyuiopasdfghjklzxcvb

nmqwertyuiopasdfghjklzxcvbnmqwer

tyuiopasdfghjklzxcvbnmqwertyuiopas

dfghjklzxcvbnmqwertyuiopasdfghjklzx

cvbnmqwertyuiopasdfghjklzxcvbnmq

wertyuiopasdfghjklzxcvbnmqwertyuio

pasdfghjklzxcvbnmqwertyuiopasdfghj

klzxcvbnmqwertyuiopasdfghjklzxcvbn

mqwertyuiopasdfghjklzxcvbnmqwerty

uiopasdfghjklzxcvbnmqwertyuiopasdf

ghjklzxcvbnmqwertyuiopasdfghjklzxc

Discriminant Analysis and its application

BDM&DM Term Paper

10/12/2009

Ankoor Das, 0811277

Page 2: Discriminant Analysis Term Paper

October 12, 2009

CONTENTS

Contents.......................................................................................................2

Introduction.....................................................................................................3

Basics of Discriminant Analysis....................................................................3

How does DA work?......................................................................................3

When to use Discriminant Analysis? – the Basic Assumptions.....................5

Types of Discriminant Analysis........................................................................6

Linear Discriminant Analysis.........................................................................6

Applications of Discriminant Analysis..............................................................8

Social Sciences.............................................................................................8

Medicine and Diagnostics...........................................................................10

Insurance Companies.................................................................................13

Marketing....................................................................................................16

Conclusions....................................................................................................18

References.....................................................................................................19

BDM&DM Term Paper | Ankoor Das (0811277) 2

Page 3: Discriminant Analysis Term Paper

October 12, 2009

INTRODUCTION

BASICS OF DISCRIMINANT ANALYSIS

Discriminant analysis is a statistical method which allows the researcher to

study the difference between two or more groups of objects (or set of

events) with respect to several variables simultaneously. They are mostly

used to predict possible outcomes of situations based on certain variables

which can influence the outcome.

For example, a researcher may want to investigate which variables

discriminate between fruits eaten by (1) primates, (2) birds, or (3) squirrels.

For that purpose, the researcher could collect data on numerous fruit

characteristics of those species eaten by each of the animal groups. Most

fruits will naturally fall into one of the three categories. Discriminant analysis

could then be used to determine which variables are the best predictors of

whether a fruit will be eaten by birds, primates, or squirrels.

Discriminant function analysis is multivariate analysis of variance (MANOVA)

reversed. In MANOVA, the independent variables are the groups and the

dependent variables are the predictors. In DA, the independent variables are

the predictors and the dependent variables are the groups. Usually, several

variables are included in a study to see which ones contribute to the

discrimination between groups.

HOW DOES DA WORK?

Discriminant function analysis is broken into a 2-step process:

(1)Testing significance of a set of discriminant functions

(2)Classification.

In the first step, there is a matrix of total variances and covariances;

likewise, there is a matrix of pooled within-group variances and covariances.

The two matrices are compared via multivariate F tests in order to determine

BDM&DM Term Paper | Ankoor Das (0811277) 3

Page 4: Discriminant Analysis Term Paper

Event1

Event2

Event3

X1

X2

Outcomes/groupsDiscriminating Variables

Xn

October 12, 2009

whether or not there are any significant differences (with regard to all

variables) between groups. One first performs the multivariate test, and, if

statistically significant, proceeds to see which of the variables have

significantly different means across the groups.

Once group means are found to be statistically significant, classification of

variables is undertaken. DA automatically determines some optimal

combination of variables so that the first function provides the most overall

discrimination between groups; the second provides second most, and so on.

Moreover, the functions will be independent or orthogonal so the

contributions to the discrimination between groups will not overlap. The first

function picks up the most variation; the second function picks up the

greatest part of the unexplained variation, etc.... Classification is then

possible from the canonical functions hence formed. Subjects are classified

in the groups in which they had the highest classification scores. The

maximum number of discriminant functions will be equal to the smaller of

either the degrees of freedom, or the number of variables in the analysis.

Standardized beta coefficients are obtained for each variable in each

discriminant function, and the larger the standardized coefficient, the greater

is the contribution of the respective variable to the discrimination between

groups.

BDM&DM Term Paper | Ankoor Das (0811277) 4

Page 5: Discriminant Analysis Term Paper

October 12, 2009

WHEN TO USE DISCRIMINANT ANALYSIS? – THE BASIC ASSUMPTIONS

Sample size: Unequal sample sizes are acceptable. The sample size of the

smallest group needs to exceed the number of predictor variables. As a

“rule of thumb”, the smallest sample size should be at least 20 for a few (4

or 5) predictors. The maximum number of independent variables is n - 2,

where n is the sample size. While this low sample size may work, it is not

encouraged, and generally it is best to have 4 or 5 times as many

observations and independent variables.

No. of outcomes/groups: There should be at least 2 or more possible

outcomes for the Discriminant analysis to be performed. These outcomes or

groups should not overlap (the functions hence would be orthogonal or non-

interfering). Though there is no limit on the no. of outcomes, but it is usually

limited to 5 or less.

Normal distribution: It is assumed that the data (for the variables)

represent a sample from a multivariate normal distribution. You can examine

whether or not variables are normally distributed with histograms of

frequency distributions. However, note that violations of the normality

assumption are not "fatal" and the resultant significance test are still reliable

as long as non-normality is caused by skewness and not outliers .

Homogeneity of variances/covariances: DA is very sensitive to

heterogeneity of variance-covariance matrices. Before accepting final

conclusions for an important study, it is a good idea to review the within-

groups variances and correlation matrices. Homoscedasticity is evaluated

through scatterplots and corrected by transformation of variables.

Outliers: DA is highly sensitive to the inclusion of outliers. Run a test for

univariate and multivariate outliers for each group, and transform or

eliminate them. If one group in the study contains extreme outliers that

impact the mean, they will also increase variability. Overall significance tests

BDM&DM Term Paper | Ankoor Das (0811277) 5

Page 6: Discriminant Analysis Term Paper

October 12, 2009

are based on pooled variances, that is, the average variance across all

groups. Thus, the significance tests of the relatively larger means (with the

large variances) would be based on the relatively smaller pooled variances,

resulting erroneously in statistical significance.

Non-multicollinearity: If one of the independent variables is very highly

correlated with another, or one is a function (e.g., the sum) of other

independents, then the tolerance value for that variable will approach 0 and

the matrix will not have a unique discriminant solution. There must also be

low multicollinearity of the independent variables.

TYPES OF DISCRIMINANT ANALYSIS

LINEAR DISCRIMINANT ANALYSIS

Linear Discriminant model (LDA) is used in the case when the groups are

separable by linear combinations of the discriminating variables. If only two

features, the separators between objects group will become lines. If the

features are three, the separator is a plane and the number of features (i.e.

independent variables) is more than 3, the separators become a hyper-

plane. The final value of the Discriminant function will determine the group

the particular observation belongs to. Appropriate threshold values and

relative significance of individual Discriminant function will lead to the final

outcome/group.

LDA is closely related to ANOVA (analysis of variance) and regression

analysis, which also attempt to express one dependent variable as a linear

combination of other features or measurements. In the other two methods

however, the dependent variable is a numerical quantity, while for LDA it is

a categorical variable (i.e. the class label).

The threshold values can be obtained by proper training of the function with

known outcomes from past observations. For example, in the case of the

BDM&DM Term Paper | Ankoor Das (0811277) 6

Page 7: Discriminant Analysis Term Paper

October 12, 2009

fruits, previous data on which fruit was eaten by any one of the three

animals and the obtained Discriminant function values based on the

variables will help one attain a training set of possible values of outcomes for

each fruit or animal. One can obtained threshold values of the function,

which can help identify which animal will eat the next unknown fruit based

on its properties (variables).

BDM&DM Term Paper | Ankoor Das (0811277) 7

Page 8: Discriminant Analysis Term Paper

October 12, 2009

APPLICATIONS OF DISCRIMINANT ANALYSIS

SOCIAL SCIENCES

Prediction of Elections:

In this case the variables can be various social and economic factors,

coupled with party effort parameters. Some of these variables can be as

follows

(1)No. of new projects implemented by incumbent party

(2)No. of candidates in fray

(3)National reach of the party (no .of states active in)

(4)SEC division of the Electorate (in form of ratios)

(5)Profession wise division of the Electorate

(6)Age wise division of the Electorate

 The variables mentioned above are few of the representative parameters

that might have a bearing on the coming elections. Nowadays another

important parameter is the result of exit polls, which are conducted by

various media agencies. They provide the general expectations of the

electorate in view.

Outcome of terrorist attacks with hostages:

With the increasing occurrences of terrorist attacks, it becomes very

important for the law and order enforcing body and governments to ensure

minimal collateral damage during rescue operations. Lot of times it can be

prudent to predict the possibility of such an operation going bad i.e. casualty

while rescue. Research on this front has already been initiated. The basic

hypothesis is based on the fact that various variables may be good

predictors of the safe release or execution of the hostages. Some of these

variables are as follows

(1)Number of terrorists

(2)Strength of their support in the local population

BDM&DM Term Paper | Ankoor Das (0811277) 8

Page 9: Discriminant Analysis Term Paper

October 12, 2009

(3)Number of weapons and amount of ammunition with the terrorists

(4)Type of weapons wielded by the attackers

(5)Ratio of terrorists to hostages

(6)Whether the terrorists are independent operators or they belong to

some large scale terrorist outfit

(7)Time since the hostages were taken

(8)Female/male ratio among the hostages

(9)Children/adults ratio among the hostages

A careful training with past cases can help the government take a decision

on whether to use force or negotiations to neutralize the terrorist threat.

BDM&DM Term Paper | Ankoor Das (0811277) 9

Page 10: Discriminant Analysis Term Paper

October 12, 2009

MEDICINE AND DIAGNOSTICS

The application of multivariate analysis, and especially discriminant analysis,

to the study of trace elements in food and environmental fields has been

largely used in various occasions. In the clinical field, Discriminant analysis

has been tentatively used to improve the predictive value of tomography

images in differential diagnosis between AD and frontotemporal dementia.

Similarly, the need for non-invasive, specific and sensitive test led to study

whether levels of some proteins considered markers of neuronal

degeneration were useful to discriminate between patients and control

groups.

Neuro-degenerative diseases

Neurotoxic substances are thought to play a major role in several

neurodegenerative diseases like Alzheimer’s disease (AD), multiple sclerosis

(MS), Parkinson’s disease (PD). The association among certain neurodege-

nerative disorders and the chronic exposure to low doses of neurotoxins has

been also demonstrated. Some reports have shown Al, Ca, Fe and Mg

implicated in neuro-pathologies. A clear increase of Al concentration in PD

specific brain regions and, at the same time, a decrease in Mg concentration

in the basal ganglia may be involved in the central nervous system

degeneration in PD. Aluminium toxicity has been recognized to reduce the

enzyme activity and to increase the number of neurofibrillary tangles in AD.

Deposition of Fe observed in Lewy bodies may be related to an involvement

of this metal in the ethiology of PD. Abnormalities in the metabolism of the

transition metals Cu and Fe have been demonstrated to play a crucial role in

the pathogenesis of various neurodegenerative diseases; in particular,

alterations in specific Cu- and Fe-containing metalloenzymes have been

observed. For Ca, one of the most important intracellular messengers in the

brain, several studies have reported a destabilization in intracellular

homeostasis associated with β-amyloid peptide neurotoxicity and with

BDM&DM Term Paper | Ankoor Das (0811277) 10

Page 11: Discriminant Analysis Term Paper

October 12, 2009

mitochondrial dysfunction in AD. Oxidative stress has also been established

as a feature of these pathologies. It seems to provide a critical link of

environmental factors and heavy metals with endogenous and genetic risk

factors in the pathogenic mechanism of neurodegeneration.

Test results can provide one with the concentration of these metals in the

brain. With a trained Discriminant analysis function, we can feed in this

numbers, to detect any of the mentioned neuro-degenerative disease at its

initiation. AD and PD are life threatening diseases, but are controllable

through proper medication. Their early detection is necessary to mitigate the

long term effects of both diseases, which when left unattended leads to the

patient becoming a vegetable.

Hepatitis Disease Detection

Research has been going in this domain. The basic diagnostic flowchart

follows. Here LDA is useful in determining the most important features

impacting the advent of the disease. Once the reduction is done, the actual

classification is done through a fuzzy network based classifier. Here the LDA

is like a data conditioning function, instead of being a predictor.

BDM&DM Term Paper | Ankoor Das (0811277) 11

Page 12: Discriminant Analysis Term Paper

October 12, 2009

Table: Features to determine advent of Hepatitis

BDM&DM Term Paper | Ankoor Das (0811277) 12

Page 13: Discriminant Analysis Term Paper

October 12, 2009

The study hence conducted attained 94.16% accuracy in detection on

Hepatitis, which is very high. This would help quick medication and hence

recovery for the patient.

BDM&DM Term Paper | Ankoor Das (0811277) 13

Page 14: Discriminant Analysis Term Paper

October 12, 2009

INSURANCE COMPANIES

Insolvency prediction (Case study on Spanish Banks)

Unlike other financial problems, there are a great number of agents facing

business failure, so research in this topic has been of growing interest in the

last decades. Insolvency, early detection of financial distress, or conditions

leading to insolvency of insurance companies have been a concern of parties

such as insurance regulators, investors, management, financial analysts,

banks, auditors, policy holders and consumers. This concern has arised from

the necessity of protecting the general public against the consequences of

insurer’s insolvencies, as well as minimizing the costs associated to this

problem such as the effects on state insurance guaranty funds or the

responsibilities for management and auditors.

It has long been recognized that there needs to be some form of supervision

of

such entities to attempt to minimize the risk of failure. Nowadays, Solvency II

project is intended to lead to the reform of the existing solvency rules in

European Union. Many insolvency cases appeared after the insurance cycles

of the 1970s and 1980s in the United States and in European Union. Several

surveys have been devoted to identify the main causes of insurers’

insolvency, in particular, the Müller Group Report (1997) analyses the main

identified causes of insurance insolvencies in the European Union. The main

reasons can be summarized as follows: operational risks (operational failure

related to inexperienced or incompetent management, fraud); underwriting

risks (inadequate reinsurance programme and failure to recover from

reinsurers, higher losses due to rapid growth, excessive operating costs,

poor underwriting process); insufficient provisions and imprudent

investments. On the other hand, many insurance companies, specially larger

companies, have developed internal risk models for a number of purposes.

There is an absence of such standardized systems in Spain, where most

insurance companies have internal check mechanism to predict insolvency.

BDM&DM Term Paper | Ankoor Das (0811277) 14

Page 15: Discriminant Analysis Term Paper

October 12, 2009

A recent study by academicians from Madrid performed a LDA to predict

insolvency of Spanish banks using historical data from 72 banks. The data

was collected 1,2,3 years prior to the insolvency. Some of the results of the

study are as given below.

Here Model 1, 2 and 3 are predictors with data 1, 2, and 3 years prior to

insolvency respectively.

Table: List of Financial Ratios, used as variables for the Predictor model

BDM&DM Term Paper | Ankoor Das (0811277) 15

Page 16: Discriminant Analysis Term Paper

October 12, 2009

Table: Final Results of the LDA performed in the three models

From the above results we see that the LDA model, was probably not the

best model to apply here as the accuracy was very low, and only slightly

better than 0.5 probability in the case of the test cases. Maybe some other

high level classification method would work better here.

BDM&DM Term Paper | Ankoor Das (0811277) 16

Page 17: Discriminant Analysis Term Paper

October 12, 2009

MARKETING

In marketing, discriminant analysis is often used to determine the factors

which distinguish different types of customers and/or products on the basis

of surveys or other forms of collected data. The use of discriminant analysis

in marketing is usually described by the following steps:

(1)Formulate the problem and gather data - Identify the salient attributes

consumers use to evaluate products in this category - Use quantitative

marketing research techniques (such as surveys) to collect data from a

sample of potential customers concerning their ratings of all the product

attributes. The data collection stage is usually done by marketing

research professionals. Survey questions ask the respondent to rate a

product from one to five (or 1 to 7, or 1 to 10) on a range of attributes

chosen by the researcher. Anywhere from five to twenty attributes are

chosen. They could include things like: ease of use, weight, accuracy,

durability, colourfulness, price, or size. The attributes chosen will vary

depending on the product being studied. The same question is asked

about all the products in the study. The data for multiple products is

codified and input into a statistical program such as R, SPSS or SAS. (This

step is the same as in Factor analysis).

(2)Estimate the Discriminant Function Coefficients and determine the

statistical significance and validity - Choose the appropriate discriminant

analysis method. The direct method involves estimating the discriminant

function so that all the predictors are assessed simultaneously. The

stepwise method enters the predictors sequentially. The two-group

method should be used when the dependent variable has two categories

or states. The multiple discriminant method is used when the dependent

variable has three or more categorical states. Use Wilks’s Lambda to test

for significance in SPSS or F stat in SAS. The most common method used

to test validity is to split the sample into an estimation or analysis sample,

BDM&DM Term Paper | Ankoor Das (0811277) 17

Page 18: Discriminant Analysis Term Paper

October 12, 2009

and a validation or holdout sample. The estimation sample is used in

constructing the discriminant function. The validation sample is used to

construct a classification matrix which contains the number of correctly

classified and incorrectly classified cases. The percentage of correctly

classified cases is called the hit ratio.

(3)Plot the results on a two dimensional map, define the dimensions, and

interpret the results. The statistical program (or a related module) will

map the results. The map will plot each product (usually in two

dimensional space). The distance of products to each other indicate either

how different they are. The dimensions must be labelled by the

researcher. This requires subjective judgement and is often very

challenging.

BDM&DM Term Paper | Ankoor Das (0811277) 18

Page 19: Discriminant Analysis Term Paper

October 12, 2009

CONCLUSIONS

As seen, Linear Discriminant Analysis is used in various practical situations,

both in Business and otherwise. It’s a relatively simpler way of prediction but

sometimes entails huge costs, mainly because collection of data for each

predictor variable can be a huge task at times. Also, the method is not

dynamic as artificial nueral networks or fuzzy logic, because of which it is

unable to take into consideration macro level changes in the environment

and general change in trends. Moreover, some of the conditions of

performing an LDA like normal distribution, homogeneity of variances/co-

variances also make the method unsuitable for many cases.

The versatility of the LDA method in classification has made it one of the

most used classification methods. But now it is losing shine to other

methods.

BDM&DM Term Paper | Ankoor Das (0811277) 19

Page 20: Discriminant Analysis Term Paper

October 12, 2009

REFERENCES

(1)Klecka, R. William, Discriminant Analysis, Sage Publications, 1980

(2)Martinez, Z. Diaz, Vargas, M.J. Segovia and Menendez, J. Fernandez,

See5 vs Discriminant Analysis. An Application to Predict Insolvency of

Spanish Non-Life Insurance Companies, University of Madrid, 2001

(3)Anna PINO, Sonia BRESCIANINI, Cristina D’IPPOLITO, Corrado FAGNANI,

Alessandro ALIMONTI and Maria Antonietta STAZI, Discriminant

analysis to study trace elements in biomonitoring: an application on

neurodegenerative diseases, 223-228, Ann Ist Super Sanità 2005

(4)Poulsen, John and French, Aaron, DISCRIMINANT FUNCTION ANALYSIS

(DA)

(5)Jieping, Ye, Janardan, Ravi and Li, Qi, Two-Dimensional Linear

Discriminant Analysis, Class Notes, University of Minnesota

(6)Srivastava, Santosh, Gupta, Maya R. and Frigyik, Bela A., Bayesian

Quadratic Discriminant Analysis, Journal of Machine Learning Research,

June 2007

(7)Esin Dogantekina, Akif Dogantekinb and Derya Avc, Automatic hepatitis

diagnosis system based on Linear Discriminant Analysis and Adaptive

Network based on Fuzzy Inference System, 11282–11286, Expert

Systems with Applications 36 (2009)

(8)www.wikipedia.org

(9)www.Wolfram-maths.org

(10) http://people.revoledu.com/kardi/tutorial/LDA/LDA.html#LDA

BDM&DM Term Paper | Ankoor Das (0811277) 20