DATA REDUCTION (Lecture# 03)

DATA REDUCTION DATA REDUCTION (Lecture# 03)(Lecture# 03)

Dr. Tahseen Ahmed JilaniDr. Tahseen Ahmed JilaniAssistant ProfessorAssistant Professor

Member IEEE-CIS, IFSA, IRSSMember IEEE-CIS, IFSA, IRSS

Department of Computer ScienceDepartment of Computer ScienceUniversity of Karachi University of Karachi

Dr. Tahseen A. Jilani-DCS-UokDr. Tahseen A. Jilani-DCS-Uok 22

• Identify the differences in dimensionality reduction Identify the differences in dimensionality reduction based on features, cases, and reduction of value based on features, cases, and reduction of value techniques.techniques.

• Advantages of dimensionality reduction in the Advantages of dimensionality reduction in the preprocessing of a data mining process could be preprocessing of a data mining process could be performed prior to applying the data-mining techniques. performed prior to applying the data-mining techniques.

• Understanding basic principles of feature-selection and Understanding basic principles of feature-selection and feature composition tasks using corresponding feature composition tasks using corresponding Statistical methods.Statistical methods.

• Apply Principal component analysis and entropy based Apply Principal component analysis and entropy based techniques and their comparison.techniques and their comparison.

Chapter ObjectivesChapter Objectives


• Data Preprocessing steps are sufficient for moderate Data Preprocessing steps are sufficient for moderate data sets.data sets.

• For really large data sets, there is an increased For really large data sets, there is an increased likelihood that an intermediate, additional step-data likelihood that an intermediate, additional step-data reduction-should be performed prior to applying the reduction-should be performed prior to applying the data-mining techniques. data-mining techniques.

• Large data sets have the potential for better mining Large data sets have the potential for better mining results, there is no guarantee that they will yield results, there is no guarantee that they will yield better knowledge than small data sets. better knowledge than small data sets.

• For large databases (datasets), it is possible that For large databases (datasets), it is possible that huge data have less information (Knowledge) huge data have less information (Knowledge)

Data ReductionData Reduction


• The three basic operations in a data-reduction The three basic operations in a data-reduction process are process are delete columndelete column, , delete a rowdelete a row, and , and reduce the number of values in a columnreduce the number of values in a column

• These operations attempt to preserve the character These operations attempt to preserve the character of the original data by deleting data that are of the original data by deleting data that are nonessential. nonessential.

• There are other operations that reduce dimensions, There are other operations that reduce dimensions, but the new data are unrecognizable when but the new data are unrecognizable when compared to the original data set, and these compared to the original data set, and these operations are mentioned here just briefly because operations are mentioned here just briefly because they are highly application-dependent. they are highly application-dependent.

Types of Data ReductionTypes of Data Reduction


• Computing timeComputing time-Simpler data-Simpler data

• Predictive/descriptive accuracyPredictive/descriptive accuracy

it measures how well the data is summarized and it measures how well the data is summarized and generalized into the model. We generally expect that generalized into the model. We generally expect that by using only relevant features, a data-mining by using only relevant features, a data-mining algorithm can not only learn faster but also with higher algorithm can not only learn faster but also with higher accuracy. Irrelevant data may mislead a learning accuracy. Irrelevant data may mislead a learning process and a final model, while redundant data may process and a final model, while redundant data may complicate the task of learning and cause unexpected complicate the task of learning and cause unexpected data-mining results.data-mining results.

• Representation Representation

If the simplicity of representation improves, a If the simplicity of representation improves, a relatively small decrease in accuracy may be tolerable. relatively small decrease in accuracy may be tolerable. The need for a balanced view between accuracy and The need for a balanced view between accuracy and simplicity is necessary, and dimensionality reduction is simplicity is necessary, and dimensionality reduction is one of the mechanisms for obtaining this balance.one of the mechanisms for obtaining this balance.

Why Data ReductionWhy Data Reduction


Dimension ReductionDimension Reduction

• The main question is whether some of these The main question is whether some of these prepared and preprocessed data can be discarded prepared and preprocessed data can be discarded without sacrificing the quality of results (Principal of without sacrificing the quality of results (Principal of Parsimony)Parsimony)

• Can the prepared data be reviewed and a subset Can the prepared data be reviewed and a subset found in a reasonable amount of time and space? found in a reasonable amount of time and space?

• If the complexity of algorithms for data reduction If the complexity of algorithms for data reduction increases exponentially, then there is little to gain in increases exponentially, then there is little to gain in reducing dimensions in big data.reducing dimensions in big data.


Dimensions of Large Data SetsDimensions of Large Data Sets

• The choice of data representation, and selection, The choice of data representation, and selection, reduction, or transformation of features is probably the reduction, or transformation of features is probably the most important issue that determines the quality of a most important issue that determines the quality of a data-mining solution. data-mining solution.

• A large number of features can make available samples A large number of features can make available samples of data relatively insufficient for mining. In practice, the of data relatively insufficient for mining. In practice, the number of features can be as many as several number of features can be as many as several hundreds.hundreds.

• If we have only a few hundred samples for analysis, If we have only a few hundred samples for analysis, dimensionality reduction is required in order for any dimensionality reduction is required in order for any reliable model to be mined or to be of any practical use. reliable model to be mined or to be of any practical use.

• On the other hand, data overload, because of high On the other hand, data overload, because of high dimensionality, can make some data-mining algorithms dimensionality, can make some data-mining algorithms non-applicable, and the only solution is again a non-applicable, and the only solution is again a reduction of data dimensions. reduction of data dimensions.


Main Objectives in Data ReductionMain Objectives in Data Reduction

• The three basic operations in a data-reduction The three basic operations in a data-reduction process are process are – Delete a column (Principal Component Analysis) Delete a column (Principal Component Analysis) – Delete a row (Profile Analysis, Self Organization Delete a row (Profile Analysis, Self Organization

Analysis, Analysis, Classification and Classification and Clustering) Clustering)

– Reduce the number of values in a column Reduce the number of values in a column (smooth a feature).(smooth a feature).

• These operations attempt to preserve the character These operations attempt to preserve the character of the original data by deleting data that are of the original data by deleting data that are nonessential nonessential


ENTROPY MEASURE FOR RANKING ENTROPY MEASURE FOR RANKING FEATURESFEATURES• A method for unsupervised feature selection or ranking A method for unsupervised feature selection or ranking

based on entropy measure is a relatively simple based on entropy measure is a relatively simple technique; but with a large number of features its technique; but with a large number of features its complexity increases significantly. complexity increases significantly.

• The basic assumption is that all samples are given as The basic assumption is that all samples are given as vectors of a feature's values without any classification vectors of a feature's values without any classification of output samples. of output samples.

• The approach is based on the observation that The approach is based on the observation that removing an irrelevant feature, a redundant feature, or removing an irrelevant feature, a redundant feature, or both from a set may not change the basic both from a set may not change the basic characteristics of the data set. characteristics of the data set.

• The idea is to remove as many features as possible but The idea is to remove as many features as possible but yet maintain the level of distinction between the yet maintain the level of distinction between the samples in the data set as if no features had been samples in the data set as if no features had been

removed.removed.


ENTROPY MEASURE FOR RANKING ENTROPY MEASURE FOR RANKING FEATURESFEATURES

• AlgorithmAlgorithm • The algorithm is based on a similarity measure S The algorithm is based on a similarity measure S

that is in inverse proportion to the distance D that is in inverse proportion to the distance D between two n-dimensional samples. between two n-dimensional samples.

• The distance measure D is small for close samples The distance measure D is small for close samples (close to zero) and large for distinct pairs (close to (close to zero) and large for distinct pairs (close to one). When the features are numeric, the similarity one). When the features are numeric, the similarity measure S of two samples can be defined asmeasure S of two samples can be defined as

• where Dwhere Dij ij is the distance between samples xi and xj is the distance between samples xi and xj and α is a parameter mathematically expressed asand α is a parameter mathematically expressed as

ijD

ij eS

D

5.0ln


ENTROPY MEASURE FOR RANKING FEATURES ENTROPY MEASURE FOR RANKING FEATURES (Continue)(Continue)

• D is the average distance among samples in the data D is the average distance among samples in the data set. Hence, α is determined by the data. But, in a set. Hence, α is determined by the data. But, in a successfully implemented practical application, it was successfully implemented practical application, it was used a constant value of α = 0.5. Normalized Euclidean used a constant value of α = 0.5. Normalized Euclidean distance measure is used to calculate the distance Dij distance measure is used to calculate the distance Dij between two samples xi and xj:between two samples xi and xj:

• where n is the number of dimensions and max(k) and where n is the number of dimensions and max(k) and min(k) are maximum and minimum values used for min(k) are maximum and minimum values used for normalization of the k-th dimension.normalization of the k-th dimension.

• All features are not numeric. The similarity for nominal All features are not numeric. The similarity for nominal variables is measured directly using variables is measured directly using Hamming Hamming distancedistance::

21

1

2

minmax

n

k

jkikij kk

xxD



wherewhere

• The total number of variables is equal to n. For The total number of variables is equal to n. For mixed data, we can discretize numeric values mixed data, we can discretize numeric values (Binning)(Binning) and transform numeric features into and transform numeric features into nominal features before we apply this similarity nominal features before we apply this similarity measure. measure.

• Figure 3.1 is an example of a simple data set with Figure 3.1 is an example of a simple data set with three categorical features; corresponding three categorical features; corresponding similarities are given in Table 3.1.similarities are given in Table 3.1.

n

xxS

n

kjkik

ij

1

otherwise

xxxx jkik

jkik,0

,1


FeaturesSimilarity Measure for Nominal

Data

Sample F1 F2 F3 R1 R2 R3 R4 R5

R1 A X 1 R1 1 0/3 0/3 2/3 0/3

R2 B Y 2 R2 ---- 1 2/3 1/3 0/3

R3 C Y 2 R3 ---- ---- 1 0/3 1/3

R4 B X 1 R4 ---- ---- ---- 1 0/3

R5 C Z 3 R5 ---- ---- ---- ---- 1

A tabular representation of similarity measures S

ENTROPY MEASURE FOR RANKING FEATURES (Continue)ENTROPY MEASURE FOR RANKING FEATURES (Continue)



• The distribution of all similarities (distances) for a given The distribution of all similarities (distances) for a given data set is a characteristic of the organization and order data set is a characteristic of the organization and order of data in an n-dimensional space. of data in an n-dimensional space.

• This organization may be more or less ordered. Changes This organization may be more or less ordered. Changes in the level of order in a data set are the main criteria in the level of order in a data set are the main criteria for inclusion or exclusion of a feature from the features for inclusion or exclusion of a feature from the features set; these changes may be measured by entropy.set; these changes may be measured by entropy.

• From information theory, we know that entropy is a From information theory, we know that entropy is a global measure, global measure,

• It is less for ordered configurations and higher for It is less for ordered configurations and higher for disordered configurations. disordered configurations.

• The proposed technique compares the entropy measure The proposed technique compares the entropy measure for a given data set before and after removal of a for a given data set before and after removal of a

feature.feature.


Entropy functionEntropy function

• If the two measures are close, then the reduced set of If the two measures are close, then the reduced set of features will satisfactorily approximate the original set. features will satisfactorily approximate the original set. For a data set of N samples, the entropy measure isFor a data set of N samples, the entropy measure is

• where Sij is the similarity between samples xi and xj. where Sij is the similarity between samples xi and xj. This measure is computed in each of the iterations as This measure is computed in each of the iterations as a basis for deciding the ranking of features. We rank a basis for deciding the ranking of features. We rank features by gradually removing the least important features by gradually removing the least important feature in maintaining the order in the configurations feature in maintaining the order in the configurations of data. The steps of the algorithm are base on of data. The steps of the algorithm are base on sequential backward ranking, and they have been sequential backward ranking, and they have been successfully tested on several real-world applicationssuccessfully tested on several real-world applications

1

1 1

1log1logN

i

N

jijijijij SSSSE


Entropy function AlgorithmEntropy function Algorithm

• Start with the initial full set of features F.Start with the initial full set of features F.

• For each feature, remove one feature f from F and For each feature, remove one feature f from F and obtain a subset Ff. Find the difference between obtain a subset Ff. Find the difference between entropy for F and entropy for all Fentropy for F and entropy for all Fff. Let f. Let fkk be a be a feature such that the difference between entropy for feature such that the difference between entropy for F and entropy for Ffk is minimum.F and entropy for Ffk is minimum.

• Update the set of features F = F – {fUpdate the set of features F = F – {fkk}, where - is a }, where - is a difference operation on sets. In our example, if the difference operation on sets. In our example, if the difference (Edifference (EFF - E - EFF-F-F11) is minimum, then the reduced ) is minimum, then the reduced set of features is {Fset of features is {F22, F, F33}. F}. F11 becomes the bottom of becomes the bottom of the ranked list.the ranked list.

• Repeat steps 2-4 until there is only one feature in F.Repeat steps 2-4 until there is only one feature in F.


Entropy function AlgorithmEntropy function Algorithm

• A ranking process may be stopped in any iteration, and A ranking process may be stopped in any iteration, and may be transformed into a process of selecting may be transformed into a process of selecting features, using the additional criterion mentioned in features, using the additional criterion mentioned in step 4. step 4.

• This criterion is that the difference between entropy for This criterion is that the difference between entropy for F and entropy for Ff should be less then the approved F and entropy for Ff should be less then the approved threshold value to reduce feature fk from set F. threshold value to reduce feature fk from set F.

• A computational complexity is the basic disadvantage A computational complexity is the basic disadvantage of this algorithm, and its parallel implementation could of this algorithm, and its parallel implementation could overcome the problems of working with large data sets overcome the problems of working with large data sets and large number of features sequentially.and large number of features sequentially.


Principal Component AnalysisPrincipal Component Analysis

• A Principal component analysis is concerned with A Principal component analysis is concerned with explaining the variance-covariance structure of a set of explaining the variance-covariance structure of a set of variables through a few linear combinations of these variables through a few linear combinations of these variables. Its general objectives are variables. Its general objectives are

• Data ReductionData Reduction

• InterpretationInterpretation

• If we have p components to describe the complete If we have p components to describe the complete variability of the system, often much of this variability variability of the system, often much of this variability can be accounted for by a small number of ‘k’ of the can be accounted for by a small number of ‘k’ of the principal components. If so, there is (almost) as much principal components. If so, there is (almost) as much information in the k components as there is in the information in the k components as there is in the original p variables. The k principal components can then original p variables. The k principal components can then replace the initial p variable, and the original data set, replace the initial p variable, and the original data set, consisting of n measurements on p variables, is reduced consisting of n measurements on p variables, is reduced to a data set consisting of n measurements of k principal to a data set consisting of n measurements of k principal components.components.


Principal Component Analysis (Continue)Principal Component Analysis (Continue)

• An analysis of principal components often reveals An analysis of principal components often reveals relationships that were not previously suspected relationships that were not previously suspected and thereby allows interpretations that would not and thereby allows interpretations that would not ordinarily result.ordinarily result.

• Analyses of principal components provides Analyses of principal components provides intermediate steps in much larger investigations. intermediate steps in much larger investigations. For example principal components may be input for For example principal components may be input for a multiple regression model or for cluster analysis or a multiple regression model or for cluster analysis or factor analysis.factor analysis.



• Algebraically, principal components are particular Algebraically, principal components are particular linear combination of the p random variables .linear combination of the p random variables .

• Geometrically, these linear combinations represent Geometrically, these linear combinations represent the selection of a new coordinate system obtained the selection of a new coordinate system obtained by rotating the original system as the by rotating the original system as the coordinate axes. The new axes represent the coordinate axes. The new axes represent the directions with maximum variability and provide a directions with maximum variability and provide a simpler and more parsimonious description of the simpler and more parsimonious description of the covariance structure.covariance structure.

• The principal components depends solely on the The principal components depends solely on the covariance matrix (or the correlation matrix) ofcovariance matrix (or the correlation matrix) of

p21 ,...,, XXX

p21 ,...,, XXX

p21 ,...,, XXX



• The important characteristic is those principal The important characteristic is those principal components do not require assumption of components do not require assumption of multivariate normal distribution. But if data follows multivariate normal distribution. But if data follows multivariate normal distribution then the multivariate normal distribution then the interpretation about using constant density and interpretation about using constant density and making inference using sample principal making inference using sample principal components.components.

• Let the random vector have the Let the random vector have the covariance matrix with eigenvalues .covariance matrix with eigenvalues .

• Consider the linear combinationsConsider the linear combinations

p21 ,...,,X XXX'

.0... p21

ppp22p11p'pp

pp2222121'22

pp1212111'11

Xa...XaXaaY

Xa...XaXaaY

Xa...XaXaaY

X

.....

X

X



• Then, we can obtainThen, we can obtain

• The principal components are those uncorrelated linear The principal components are those uncorrelated linear combinations whose variances are as large as combinations whose variances are as large as possible.possible.

• The first principal component is the linear combination The first principal component is the linear combination with maximum variance. That is, it maximizes with maximum variance. That is, it maximizes ..

• It is clear that can be increased by multiplying It is clear that can be increased by multiplying any by some constant. To eliminate this indeterminacy any by some constant. To eliminate this indeterminacy it is convenient to restrict attention to coefficient vector it is convenient to restrict attention to coefficient vector of unit length. We therefore define.of unit length. We therefore define.

p,...,2,1i,Y,YCov

p,...,2,1i,YVar

k'iki

i'ii

aΣa

aΣa

p21 Y,...,Y,Y

1'11YVar Σaa

1'11YVar Σaa

1a



• First Principal ComponentFirst Principal Component= Linear combination = Linear combination that maximizes subject to . that maximizes subject to .

• Second Principal ComponentSecond Principal Component= =

Linear combination that maximizes subject Linear combination that maximizes subject to to

and and

• At the ith step that maximizes subject to At the ith step that maximizes subject to

and .and .

Important ResultsImportant Results

Xa '1

Xa '1V 11

'1 aa

Xa '2

Xa '2V 12

'2 aa

XaXa ''1 2,Cov

Xa 'i Xa '

iV

1i'i aa ikfor0,Cov ki XaXa ''

pip22i11ii Xe...XeXeY Xe 'i

p,...,2,1i,VYV ii i'i Xee



• Proportion of total population varianceProportion of total population variance

due to kth principal component due to kth principal component

• If are the principal components If are the principal components obtained from the covariance matrix , then obtained from the covariance matrix , then

are the correlation coefficient between and .Here are the correlation coefficient between and .Here

are the eigenvalue-eigenvector pairs. are the eigenvalue-eigenvector pairs.

ki,0Y,YCov kki Xee 'i

p

1ip21i

p

1iiipp2211 ...XVar...

p,...,2,1k,... p21

k

XeXeXeXe ''''pp332211 Y...,,Y,Y,Y

Σ

p,...,2,1k,i,e

X,YCorrkk

iikki

iY kX

pp2211 ,,...,,,, eee



ExampleExample

• Suppose the random variables have the covariance Suppose the random variables have the covariance matrixmatrix

• It may be verified that the eigenvalues-eigenvector It may be verified that the eigenvalues-eigenvector pairs arepairs are

• Therefore, the principal components becomeTherefore, the principal components become

200

052

021

Σ

17.0

00.2

83.5

3

2

1

0,383.0,924.0

1,0,0

0,924.0,383.0

3

2

'

'

'1

e

e

e

21'33

3'22

1'11

X383.0X924.0aY

XaY

924.0X383.0aY

X

X

X



• TheThe variablevariable is one of the principal components, is one of the principal components, because it is uncorrelated with the other two because it is uncorrelated with the other two variables. This implies variables. This implies

FurthermoreFurthermore

3X

0X,XCorrX,XCorr 3231

1

2122

12

211

83.52708.05854.01147.0

X,XCov924.0383.02vXVar924.0XVar383.0X924.0X383.0VarYVar

00924.00383.0

X,XCov924.0X,XCov383.0X,X924.0X383.0CovY,YCov 323132121

17.000.283.5321332211

73.017.000.283.5

83.5

321

1

25.017.000.283.5

00.2

321

2

02.017.000.283.5

17.0

321

3



• Therefore, only first two principal components account Therefore, only first two principal components account for 98% of the total variance. In this case the for 98% of the total variance. In this case the components components

could replace the original three variables with little could replace the original three variables with little loss of information. The correlations of original vectors loss of information. The correlations of original vectors with principal components arewith principal components are

As is neglected so no need to calculate its As is neglected so no need to calculate its correlation.correlation.

21 Y,Y

925.01

83.5383.0eX,YCorr

11

11111

998.05

83.5924.0eX,YCorr

22

12121

0e

X,YCorr11

21212

1

eX,YCorr

22

22222

3Y


Principal Component Analysis (Continue) The Principal Component Analysis (Continue) The number of Principal Componentsnumber of Principal Components• There is always a question of how many There is always a question of how many

components to retain. There is no definitive answer components to retain. There is no definitive answer to this question. Things to consider includeto this question. Things to consider include

• The amount of total sample variance explainedThe amount of total sample variance explained

• The relative sizes of eigenvalues (or say the The relative sizes of eigenvalues (or say the variance of the sample components)variance of the sample components)

• The subject-matter interpretations of the The subject-matter interpretations of the components.components.


Principal Component Analysis (Continue) Principal Component Analysis (Continue) Scree PlotScree Plot• A useful visual aid to determine an appropriate number of A useful visual aid to determine an appropriate number of

principal components is a Scree plot. With the eigenvalues principal components is a Scree plot. With the eigenvalues ordered from largest to smallest, a Scree plot is a plot of ordered from largest to smallest, a Scree plot is a plot of versus i- the magnitude of an eigen value versus its versus i- the magnitude of an eigen value versus its number (bend) in the Scree plot.number (bend) in the Scree plot.

Component Number987654321

Eige

nval

ue

4

3

2

1

0

Scree Plot


SPSS FACTOR ANALYSIS OF CUSTOMER.SAVSPSS FACTOR ANALYSIS OF CUSTOMER.SAV

Descriptive Statisticsa

3.71 1.307 559

.54 .499 559

4.25 1.227 559

2.59 1.175 559

9.78 9.359 559

2.97 1.453 559

2.42 1.214 559

2.96 1.391 559

2.40 1.152 559

Size of hometown

Gender

Age category

Level of education

Years with currentemployer


Income category inthousands

Job satisfaction

Spouse level of education

Mean Std. Deviation Analysis N

Only cases for which Geographic indicator = Zone 5 are usedin the analysis phase.

a.



Total Variance Explained(a)

Initial Eigenvalues Extraction Sums of Squared Loadings

Component Total % of Variance Cumulative % Total % of Variance Cumulative % 1 3.191 35.452 35.452 3.191 35.452 35.452 2 1.714 19.045 54.497 1.714 19.045 54.497 3 1.065 11.831 66.328 1.065 11.831 66.328 4 .937 10.414 76.743 5 .674 7.493 84.236 6 .669 7.438 91.674 7 .375 4.164 95.838 8 .295 3.282 99.120 9 .079 .880 100.000

Extraction Method: Principal Component Analysis. a Only cases for which Geographic indicator = Zone 5 are used in the analysis phase.


Component Matrixa,b

-.138 -.074 .668

-.197 -.083 .725

.806 .121 -.089

-.327 .832 -.028

.921 .015 .050

.944 .046 .053

.418 .547 .164

.622 .141 .225

-.274 .820 .023

Size of hometown

Gender

Age category

Level of education



Income category inthousands

Job satisfaction

Spouse level of education

1 2 3

Component

Extraction Method: Principal Component Analysis.

3 components extracted.a.

Only cases for which Geographic indicator = Zone 5 areused in the analysis phase.

b.



Factor AnalysisFactor Analysis

• Factor analysis attempts to identify underlying Factor analysis attempts to identify underlying variables, or factors, that explain the pattern of variables, or factors, that explain the pattern of correlations within a set of observed variables. correlations within a set of observed variables.

• Factor analysis is often used in data reduction to Factor analysis is often used in data reduction to identify a small number of factors that explain most identify a small number of factors that explain most of the variance that is observed in a much larger of the variance that is observed in a much larger number of manifest variables. number of manifest variables.

• Factor analysis can also be used to generate Factor analysis can also be used to generate hypotheses regarding causal mechanisms or to hypotheses regarding causal mechanisms or to screen variables for subsequent analysis (for screen variables for subsequent analysis (for example, to identify collinearity prior to performing a example, to identify collinearity prior to performing a linear regression analysis).linear regression analysis).


Factor Analysis (Continue)Factor Analysis (Continue)

• The factor analysis procedure offers a high The factor analysis procedure offers a high degree of flexibility:degree of flexibility:

– Seven methods of factor extraction are available.Seven methods of factor extraction are available.– Five methods of rotation are available, including direct Five methods of rotation are available, including direct

oblimin and promax for non-orthogonal rotations.oblimin and promax for non-orthogonal rotations.– Three methods of computing factor scores are Three methods of computing factor scores are

available, and scores can be saved as variables for available, and scores can be saved as variables for further analysis.further analysis.

• The essential purpose of factor analysis is t describe, if The essential purpose of factor analysis is t describe, if possible, the covariance (Correlation) relationships possible, the covariance (Correlation) relationships among many variables in terms of a few underlying, but among many variables in terms of a few underlying, but unobserved, random quantities called unobserved, random quantities called factorsfactors..


• Factor analysis can be considered an extension of Factor analysis can be considered an extension of principal component analysis. Both can be viewed as principal component analysis. Both can be viewed as attempts to approximate the covariance matrix. attempts to approximate the covariance matrix. However, the approximation based on the factor However, the approximation based on the factor analysis is more elaborate. analysis is more elaborate.

• The primary question in factor analysis is whether The primary question in factor analysis is whether the data are consistent with a prescribed structure.the data are consistent with a prescribed structure.

Factor Analysis (Continue)Factor Analysis (Continue)


The orthogonal Factor ModelThe orthogonal Factor Model

• The observed random vector The observed random vector Xwith p components, has mean

μ and covariance matrix Σ

.

The factor model postulates that X

upon a few unobserved random variables is linearly dependent

m21 F,...,F,F

called common factors, and p additional sources of variation p21 ,...,, called errors or sometimes specific factors

(includes measurement errors).

In particular, the factor analysis model is


The orthogonal Factor Model (Continue)The orthogonal Factor Model (Continue)

• In particular, the factor analysis model is In particular, the factor analysis model is

• or, in matrix notationor, in matrix notation

• The coefficientsThe coefficients is called loading of the ith variable on the jth is called loading of the ith variable on the jth factor, so the matrix factor, so the matrix L L is the matrix of factor loading. is the matrix of factor loading.

pmpm22p11ppp

2mm222212122

1mm121211111

Fl...FlFlX

.

.

.

Fl...FlFlX

Fl...FlFlX

1p1mmp εFLμX

iil


The orthogonal Factor Model (Continue)The orthogonal Factor Model (Continue)

• Note that the ith specific factor is associated only Note that the ith specific factor is associated only with the ith response with the ith response XX..

• Here the p deviations (of given data) Here the p deviations (of given data)

are expressed in terms ofrandom variables are expressed in terms ofrandom variables

i

pp2211 X,...,X,X

p21m21 ...,,,F,...,F,F mm1m ECov,E IFF'F0F

p

2

1

PP1p

000

............

000

0...0

eCov,E ψεε'ε0ε


VALUES REDUCTION (BINNING)VALUES REDUCTION (BINNING)

• A reduction in the number of discrete values for a A reduction in the number of discrete values for a given feature is based on the second set of given feature is based on the second set of techniques in the data-reduction phase; these are techniques in the data-reduction phase; these are the the feature-discretization techniquesfeature-discretization techniques..

• The task is to discretize the values of continuous The task is to discretize the values of continuous features into a small number of intervals, where features into a small number of intervals, where each interval is mapped to a discrete symbol.each interval is mapped to a discrete symbol.

• The benefits of these techniques are simplified data The benefits of these techniques are simplified data description and easy-to-understand data and final description and easy-to-understand data and final data-mining results. Also, more data mining data-mining results. Also, more data mining techniques are applicable with discrete feature techniques are applicable with discrete feature values. An "old fashioned" discretization is made values. An "old fashioned" discretization is made manually, based on our a priori knowledge about the manually, based on our a priori knowledge about the feature. feature.


VALUES REDUCTION (BINNING) VALUES REDUCTION (BINNING) ExampleExample

• Example, Binning Age Feature values Example, Binning Age Feature values Given the continuous/measurable nature at the Given the continuous/measurable nature at the beginning of a data-mining process for age feature beginning of a data-mining process for age feature (between 0 and 150 years) may be classified into (between 0 and 150 years) may be classified into categorical segments: child, adolescent, adult, categorical segments: child, adolescent, adult, middle age, and elderly. Cut off points are middle age, and elderly. Cut off points are subjectively defined.subjectively defined.

Two main questions exist about this reduction process:Two main questions exist about this reduction process:– What are the cut-off points?What are the cut-off points?– How does one select representatives of intervals?How does one select representatives of intervals?



Note:Note: A reduction in feature values usually is not A reduction in feature values usually is not harmful for harmful for real-world data-mining applications, and real-world data-mining applications, and it leads to a it leads to a major decrease in computational major decrease in computational complexity.complexity.


BINNING- BINNING- ContinueContinue


BINNING- BINNING- ContinueContinue


Feature Discretization: CHI-MERGE Feature Discretization: CHI-MERGE TechniqueTechnique

• An automated discretization algorithm that analyzes An automated discretization algorithm that analyzes the quality of multiple intervals for a given feature the quality of multiple intervals for a given feature by using χby using χ22 statistics. statistics.

• The algorithm determines similarities between The algorithm determines similarities between distributions of data in two adjacent intervals based distributions of data in two adjacent intervals based on output classification of samples.on output classification of samples.

• If null hypothesis is true then the two consecutive If null hypothesis is true then the two consecutive intervals are merged to form a single big interval. intervals are merged to form a single big interval. Assuming the the intervals are non-overlapping.Assuming the the intervals are non-overlapping.

n

1i

k

1j ij

2ijij2

E

EA


CHI-MERGE Technique CHI-MERGE Technique (Continue)(Continue)

A contingency table for 2 × 2 categorical data

Class 1 Class 2 ∑

Interval-1 A11 A12 R1

Interval-2 A21 A22 R2

∑ C1 C2 N


CHI-MERGE Technique CHI-MERGE Technique (Continue)(Continue)




• Mehmed Kantardzic, “Mehmed Kantardzic, “Data Mining: Concepts, Data Mining: Concepts, Models, Methods, and Algorithms, Models, Methods, and Algorithms, John Wiley & John Wiley & Sons, 2003.Sons, 2003.

• William Johnson, William Johnson, Applied Multivariate AnalysisApplied Multivariate Analysis, , Parson’s Education, Low Price Edition, 2005.Parson’s Education, Low Price Edition, 2005.

ReferencesReferences


Thank YouThank You

Documents

DATA REDUCTION (Lecture# 03)