Upload
ankoordas
View
111
Download
0
Embed Size (px)
Citation preview
qwertyuiopasdfghjklzxcvbnmqwertyui
opasdfghjklzxcvbnmqwertyuiopasdfgh
jklzxcvbnmqwertyuiopasdfghjklzxcvb
nmqwertyuiopasdfghjklzxcvbnmqwer
tyuiopasdfghjklzxcvbnmqwertyuiopas
dfghjklzxcvbnmqwertyuiopasdfghjklzx
cvbnmqwertyuiopasdfghjklzxcvbnmq
wertyuiopasdfghjklzxcvbnmqwertyuio
pasdfghjklzxcvbnmqwertyuiopasdfghj
klzxcvbnmqwertyuiopasdfghjklzxcvbn
mqwertyuiopasdfghjklzxcvbnmqwerty
uiopasdfghjklzxcvbnmqwertyuiopasdf
ghjklzxcvbnmqwertyuiopasdfghjklzxc
Discriminant Analysis and its application
BDM&DM Term Paper
10/12/2009
Ankoor Das, 0811277
October 12, 2009
CONTENTS
Contents.......................................................................................................2
Introduction.....................................................................................................3
Basics of Discriminant Analysis....................................................................3
How does DA work?......................................................................................3
When to use Discriminant Analysis? – the Basic Assumptions.....................5
Types of Discriminant Analysis........................................................................6
Linear Discriminant Analysis.........................................................................6
Applications of Discriminant Analysis..............................................................8
Social Sciences.............................................................................................8
Medicine and Diagnostics...........................................................................10
Insurance Companies.................................................................................13
Marketing....................................................................................................16
Conclusions....................................................................................................18
References.....................................................................................................19
BDM&DM Term Paper | Ankoor Das (0811277) 2
October 12, 2009
INTRODUCTION
BASICS OF DISCRIMINANT ANALYSIS
Discriminant analysis is a statistical method which allows the researcher to
study the difference between two or more groups of objects (or set of
events) with respect to several variables simultaneously. They are mostly
used to predict possible outcomes of situations based on certain variables
which can influence the outcome.
For example, a researcher may want to investigate which variables
discriminate between fruits eaten by (1) primates, (2) birds, or (3) squirrels.
For that purpose, the researcher could collect data on numerous fruit
characteristics of those species eaten by each of the animal groups. Most
fruits will naturally fall into one of the three categories. Discriminant analysis
could then be used to determine which variables are the best predictors of
whether a fruit will be eaten by birds, primates, or squirrels.
Discriminant function analysis is multivariate analysis of variance (MANOVA)
reversed. In MANOVA, the independent variables are the groups and the
dependent variables are the predictors. In DA, the independent variables are
the predictors and the dependent variables are the groups. Usually, several
variables are included in a study to see which ones contribute to the
discrimination between groups.
HOW DOES DA WORK?
Discriminant function analysis is broken into a 2-step process:
(1)Testing significance of a set of discriminant functions
(2)Classification.
In the first step, there is a matrix of total variances and covariances;
likewise, there is a matrix of pooled within-group variances and covariances.
The two matrices are compared via multivariate F tests in order to determine
BDM&DM Term Paper | Ankoor Das (0811277) 3
Event1
Event2
Event3
X1
X2
Outcomes/groupsDiscriminating Variables
Xn
October 12, 2009
whether or not there are any significant differences (with regard to all
variables) between groups. One first performs the multivariate test, and, if
statistically significant, proceeds to see which of the variables have
significantly different means across the groups.
Once group means are found to be statistically significant, classification of
variables is undertaken. DA automatically determines some optimal
combination of variables so that the first function provides the most overall
discrimination between groups; the second provides second most, and so on.
Moreover, the functions will be independent or orthogonal so the
contributions to the discrimination between groups will not overlap. The first
function picks up the most variation; the second function picks up the
greatest part of the unexplained variation, etc.... Classification is then
possible from the canonical functions hence formed. Subjects are classified
in the groups in which they had the highest classification scores. The
maximum number of discriminant functions will be equal to the smaller of
either the degrees of freedom, or the number of variables in the analysis.
Standardized beta coefficients are obtained for each variable in each
discriminant function, and the larger the standardized coefficient, the greater
is the contribution of the respective variable to the discrimination between
groups.
BDM&DM Term Paper | Ankoor Das (0811277) 4
October 12, 2009
WHEN TO USE DISCRIMINANT ANALYSIS? – THE BASIC ASSUMPTIONS
Sample size: Unequal sample sizes are acceptable. The sample size of the
smallest group needs to exceed the number of predictor variables. As a
“rule of thumb”, the smallest sample size should be at least 20 for a few (4
or 5) predictors. The maximum number of independent variables is n - 2,
where n is the sample size. While this low sample size may work, it is not
encouraged, and generally it is best to have 4 or 5 times as many
observations and independent variables.
No. of outcomes/groups: There should be at least 2 or more possible
outcomes for the Discriminant analysis to be performed. These outcomes or
groups should not overlap (the functions hence would be orthogonal or non-
interfering). Though there is no limit on the no. of outcomes, but it is usually
limited to 5 or less.
Normal distribution: It is assumed that the data (for the variables)
represent a sample from a multivariate normal distribution. You can examine
whether or not variables are normally distributed with histograms of
frequency distributions. However, note that violations of the normality
assumption are not "fatal" and the resultant significance test are still reliable
as long as non-normality is caused by skewness and not outliers .
Homogeneity of variances/covariances: DA is very sensitive to
heterogeneity of variance-covariance matrices. Before accepting final
conclusions for an important study, it is a good idea to review the within-
groups variances and correlation matrices. Homoscedasticity is evaluated
through scatterplots and corrected by transformation of variables.
Outliers: DA is highly sensitive to the inclusion of outliers. Run a test for
univariate and multivariate outliers for each group, and transform or
eliminate them. If one group in the study contains extreme outliers that
impact the mean, they will also increase variability. Overall significance tests
BDM&DM Term Paper | Ankoor Das (0811277) 5
October 12, 2009
are based on pooled variances, that is, the average variance across all
groups. Thus, the significance tests of the relatively larger means (with the
large variances) would be based on the relatively smaller pooled variances,
resulting erroneously in statistical significance.
Non-multicollinearity: If one of the independent variables is very highly
correlated with another, or one is a function (e.g., the sum) of other
independents, then the tolerance value for that variable will approach 0 and
the matrix will not have a unique discriminant solution. There must also be
low multicollinearity of the independent variables.
TYPES OF DISCRIMINANT ANALYSIS
LINEAR DISCRIMINANT ANALYSIS
Linear Discriminant model (LDA) is used in the case when the groups are
separable by linear combinations of the discriminating variables. If only two
features, the separators between objects group will become lines. If the
features are three, the separator is a plane and the number of features (i.e.
independent variables) is more than 3, the separators become a hyper-
plane. The final value of the Discriminant function will determine the group
the particular observation belongs to. Appropriate threshold values and
relative significance of individual Discriminant function will lead to the final
outcome/group.
LDA is closely related to ANOVA (analysis of variance) and regression
analysis, which also attempt to express one dependent variable as a linear
combination of other features or measurements. In the other two methods
however, the dependent variable is a numerical quantity, while for LDA it is
a categorical variable (i.e. the class label).
The threshold values can be obtained by proper training of the function with
known outcomes from past observations. For example, in the case of the
BDM&DM Term Paper | Ankoor Das (0811277) 6
October 12, 2009
fruits, previous data on which fruit was eaten by any one of the three
animals and the obtained Discriminant function values based on the
variables will help one attain a training set of possible values of outcomes for
each fruit or animal. One can obtained threshold values of the function,
which can help identify which animal will eat the next unknown fruit based
on its properties (variables).
BDM&DM Term Paper | Ankoor Das (0811277) 7
October 12, 2009
APPLICATIONS OF DISCRIMINANT ANALYSIS
SOCIAL SCIENCES
Prediction of Elections:
In this case the variables can be various social and economic factors,
coupled with party effort parameters. Some of these variables can be as
follows
(1)No. of new projects implemented by incumbent party
(2)No. of candidates in fray
(3)National reach of the party (no .of states active in)
(4)SEC division of the Electorate (in form of ratios)
(5)Profession wise division of the Electorate
(6)Age wise division of the Electorate
The variables mentioned above are few of the representative parameters
that might have a bearing on the coming elections. Nowadays another
important parameter is the result of exit polls, which are conducted by
various media agencies. They provide the general expectations of the
electorate in view.
Outcome of terrorist attacks with hostages:
With the increasing occurrences of terrorist attacks, it becomes very
important for the law and order enforcing body and governments to ensure
minimal collateral damage during rescue operations. Lot of times it can be
prudent to predict the possibility of such an operation going bad i.e. casualty
while rescue. Research on this front has already been initiated. The basic
hypothesis is based on the fact that various variables may be good
predictors of the safe release or execution of the hostages. Some of these
variables are as follows
(1)Number of terrorists
(2)Strength of their support in the local population
BDM&DM Term Paper | Ankoor Das (0811277) 8
October 12, 2009
(3)Number of weapons and amount of ammunition with the terrorists
(4)Type of weapons wielded by the attackers
(5)Ratio of terrorists to hostages
(6)Whether the terrorists are independent operators or they belong to
some large scale terrorist outfit
(7)Time since the hostages were taken
(8)Female/male ratio among the hostages
(9)Children/adults ratio among the hostages
A careful training with past cases can help the government take a decision
on whether to use force or negotiations to neutralize the terrorist threat.
BDM&DM Term Paper | Ankoor Das (0811277) 9
October 12, 2009
MEDICINE AND DIAGNOSTICS
The application of multivariate analysis, and especially discriminant analysis,
to the study of trace elements in food and environmental fields has been
largely used in various occasions. In the clinical field, Discriminant analysis
has been tentatively used to improve the predictive value of tomography
images in differential diagnosis between AD and frontotemporal dementia.
Similarly, the need for non-invasive, specific and sensitive test led to study
whether levels of some proteins considered markers of neuronal
degeneration were useful to discriminate between patients and control
groups.
Neuro-degenerative diseases
Neurotoxic substances are thought to play a major role in several
neurodegenerative diseases like Alzheimer’s disease (AD), multiple sclerosis
(MS), Parkinson’s disease (PD). The association among certain neurodege-
nerative disorders and the chronic exposure to low doses of neurotoxins has
been also demonstrated. Some reports have shown Al, Ca, Fe and Mg
implicated in neuro-pathologies. A clear increase of Al concentration in PD
specific brain regions and, at the same time, a decrease in Mg concentration
in the basal ganglia may be involved in the central nervous system
degeneration in PD. Aluminium toxicity has been recognized to reduce the
enzyme activity and to increase the number of neurofibrillary tangles in AD.
Deposition of Fe observed in Lewy bodies may be related to an involvement
of this metal in the ethiology of PD. Abnormalities in the metabolism of the
transition metals Cu and Fe have been demonstrated to play a crucial role in
the pathogenesis of various neurodegenerative diseases; in particular,
alterations in specific Cu- and Fe-containing metalloenzymes have been
observed. For Ca, one of the most important intracellular messengers in the
brain, several studies have reported a destabilization in intracellular
homeostasis associated with β-amyloid peptide neurotoxicity and with
BDM&DM Term Paper | Ankoor Das (0811277) 10
October 12, 2009
mitochondrial dysfunction in AD. Oxidative stress has also been established
as a feature of these pathologies. It seems to provide a critical link of
environmental factors and heavy metals with endogenous and genetic risk
factors in the pathogenic mechanism of neurodegeneration.
Test results can provide one with the concentration of these metals in the
brain. With a trained Discriminant analysis function, we can feed in this
numbers, to detect any of the mentioned neuro-degenerative disease at its
initiation. AD and PD are life threatening diseases, but are controllable
through proper medication. Their early detection is necessary to mitigate the
long term effects of both diseases, which when left unattended leads to the
patient becoming a vegetable.
Hepatitis Disease Detection
Research has been going in this domain. The basic diagnostic flowchart
follows. Here LDA is useful in determining the most important features
impacting the advent of the disease. Once the reduction is done, the actual
classification is done through a fuzzy network based classifier. Here the LDA
is like a data conditioning function, instead of being a predictor.
BDM&DM Term Paper | Ankoor Das (0811277) 11
October 12, 2009
Table: Features to determine advent of Hepatitis
BDM&DM Term Paper | Ankoor Das (0811277) 12
October 12, 2009
The study hence conducted attained 94.16% accuracy in detection on
Hepatitis, which is very high. This would help quick medication and hence
recovery for the patient.
BDM&DM Term Paper | Ankoor Das (0811277) 13
October 12, 2009
INSURANCE COMPANIES
Insolvency prediction (Case study on Spanish Banks)
Unlike other financial problems, there are a great number of agents facing
business failure, so research in this topic has been of growing interest in the
last decades. Insolvency, early detection of financial distress, or conditions
leading to insolvency of insurance companies have been a concern of parties
such as insurance regulators, investors, management, financial analysts,
banks, auditors, policy holders and consumers. This concern has arised from
the necessity of protecting the general public against the consequences of
insurer’s insolvencies, as well as minimizing the costs associated to this
problem such as the effects on state insurance guaranty funds or the
responsibilities for management and auditors.
It has long been recognized that there needs to be some form of supervision
of
such entities to attempt to minimize the risk of failure. Nowadays, Solvency II
project is intended to lead to the reform of the existing solvency rules in
European Union. Many insolvency cases appeared after the insurance cycles
of the 1970s and 1980s in the United States and in European Union. Several
surveys have been devoted to identify the main causes of insurers’
insolvency, in particular, the Müller Group Report (1997) analyses the main
identified causes of insurance insolvencies in the European Union. The main
reasons can be summarized as follows: operational risks (operational failure
related to inexperienced or incompetent management, fraud); underwriting
risks (inadequate reinsurance programme and failure to recover from
reinsurers, higher losses due to rapid growth, excessive operating costs,
poor underwriting process); insufficient provisions and imprudent
investments. On the other hand, many insurance companies, specially larger
companies, have developed internal risk models for a number of purposes.
There is an absence of such standardized systems in Spain, where most
insurance companies have internal check mechanism to predict insolvency.
BDM&DM Term Paper | Ankoor Das (0811277) 14
October 12, 2009
A recent study by academicians from Madrid performed a LDA to predict
insolvency of Spanish banks using historical data from 72 banks. The data
was collected 1,2,3 years prior to the insolvency. Some of the results of the
study are as given below.
Here Model 1, 2 and 3 are predictors with data 1, 2, and 3 years prior to
insolvency respectively.
Table: List of Financial Ratios, used as variables for the Predictor model
BDM&DM Term Paper | Ankoor Das (0811277) 15
October 12, 2009
Table: Final Results of the LDA performed in the three models
From the above results we see that the LDA model, was probably not the
best model to apply here as the accuracy was very low, and only slightly
better than 0.5 probability in the case of the test cases. Maybe some other
high level classification method would work better here.
BDM&DM Term Paper | Ankoor Das (0811277) 16
October 12, 2009
MARKETING
In marketing, discriminant analysis is often used to determine the factors
which distinguish different types of customers and/or products on the basis
of surveys or other forms of collected data. The use of discriminant analysis
in marketing is usually described by the following steps:
(1)Formulate the problem and gather data - Identify the salient attributes
consumers use to evaluate products in this category - Use quantitative
marketing research techniques (such as surveys) to collect data from a
sample of potential customers concerning their ratings of all the product
attributes. The data collection stage is usually done by marketing
research professionals. Survey questions ask the respondent to rate a
product from one to five (or 1 to 7, or 1 to 10) on a range of attributes
chosen by the researcher. Anywhere from five to twenty attributes are
chosen. They could include things like: ease of use, weight, accuracy,
durability, colourfulness, price, or size. The attributes chosen will vary
depending on the product being studied. The same question is asked
about all the products in the study. The data for multiple products is
codified and input into a statistical program such as R, SPSS or SAS. (This
step is the same as in Factor analysis).
(2)Estimate the Discriminant Function Coefficients and determine the
statistical significance and validity - Choose the appropriate discriminant
analysis method. The direct method involves estimating the discriminant
function so that all the predictors are assessed simultaneously. The
stepwise method enters the predictors sequentially. The two-group
method should be used when the dependent variable has two categories
or states. The multiple discriminant method is used when the dependent
variable has three or more categorical states. Use Wilks’s Lambda to test
for significance in SPSS or F stat in SAS. The most common method used
to test validity is to split the sample into an estimation or analysis sample,
BDM&DM Term Paper | Ankoor Das (0811277) 17
October 12, 2009
and a validation or holdout sample. The estimation sample is used in
constructing the discriminant function. The validation sample is used to
construct a classification matrix which contains the number of correctly
classified and incorrectly classified cases. The percentage of correctly
classified cases is called the hit ratio.
(3)Plot the results on a two dimensional map, define the dimensions, and
interpret the results. The statistical program (or a related module) will
map the results. The map will plot each product (usually in two
dimensional space). The distance of products to each other indicate either
how different they are. The dimensions must be labelled by the
researcher. This requires subjective judgement and is often very
challenging.
BDM&DM Term Paper | Ankoor Das (0811277) 18
October 12, 2009
CONCLUSIONS
As seen, Linear Discriminant Analysis is used in various practical situations,
both in Business and otherwise. It’s a relatively simpler way of prediction but
sometimes entails huge costs, mainly because collection of data for each
predictor variable can be a huge task at times. Also, the method is not
dynamic as artificial nueral networks or fuzzy logic, because of which it is
unable to take into consideration macro level changes in the environment
and general change in trends. Moreover, some of the conditions of
performing an LDA like normal distribution, homogeneity of variances/co-
variances also make the method unsuitable for many cases.
The versatility of the LDA method in classification has made it one of the
most used classification methods. But now it is losing shine to other
methods.
BDM&DM Term Paper | Ankoor Das (0811277) 19
October 12, 2009
REFERENCES
(1)Klecka, R. William, Discriminant Analysis, Sage Publications, 1980
(2)Martinez, Z. Diaz, Vargas, M.J. Segovia and Menendez, J. Fernandez,
See5 vs Discriminant Analysis. An Application to Predict Insolvency of
Spanish Non-Life Insurance Companies, University of Madrid, 2001
(3)Anna PINO, Sonia BRESCIANINI, Cristina D’IPPOLITO, Corrado FAGNANI,
Alessandro ALIMONTI and Maria Antonietta STAZI, Discriminant
analysis to study trace elements in biomonitoring: an application on
neurodegenerative diseases, 223-228, Ann Ist Super Sanità 2005
(4)Poulsen, John and French, Aaron, DISCRIMINANT FUNCTION ANALYSIS
(DA)
(5)Jieping, Ye, Janardan, Ravi and Li, Qi, Two-Dimensional Linear
Discriminant Analysis, Class Notes, University of Minnesota
(6)Srivastava, Santosh, Gupta, Maya R. and Frigyik, Bela A., Bayesian
Quadratic Discriminant Analysis, Journal of Machine Learning Research,
June 2007
(7)Esin Dogantekina, Akif Dogantekinb and Derya Avc, Automatic hepatitis
diagnosis system based on Linear Discriminant Analysis and Adaptive
Network based on Fuzzy Inference System, 11282–11286, Expert
Systems with Applications 36 (2009)
(8)www.wikipedia.org
(9)www.Wolfram-maths.org
(10) http://people.revoledu.com/kardi/tutorial/LDA/LDA.html#LDA
BDM&DM Term Paper | Ankoor Das (0811277) 20