PCA and ICA

Embed Size (px)

Citation preview

  • 7/28/2019 PCA and ICA

    1/15

    Principal Components Analysis & Independent Components Analysis

    Aaron ClarkeSN: 206071237

    Prof. Robert Cribbie

    Statistics 6130

  • 7/28/2019 PCA and ICA

    2/15

    Introduction:

    A common problem in information theory is that of representing a message space

    with the smallest possible set of message components (Cottrell et al., 1987; Oja, 1983).

    That is, to find a basis set of message components that could be used to form every

    message, given a particular set of possible messages. For example, the Morse code forms

    a basis set for the set of possible message that can be transmitted by Morse code. If one

    wanted to form the message SOS, then one would simply combine the element for S

    (---) with the element for O () such that the full message would be ------. If the

    only message that anyone ever sent by Morse code was ------, then --- and

    would be the basis set for the set of all messages sent by Morse code. The question then

    arises: is there a basis set for the set of images that the human visual system was likely

    to encounter in its evolutionary environment? If so then one would expect that the

    human visual system would be adapted to optimally perceive this basis set and would use

    it to reconstruct observed image information. It is already known that Fourier analysis

    can be used to decompose any given image into a set of spatial frequency components of

    varying phase, orientation and amplitude, however, the elements of the set of all possible

    spatial frequency components are not equally distributed in natural images (Field, 1987).

    Thus one would expect that there might exist a smaller basis set of spatial frequency

    components that could be used to compose the set of natural images that the human

    visual system is exposed to. Evidence supporting this theory can be found in the

    neurophysiological literature where it has been shown that single neurons in the visual

    cortex respond to a finite set of Gaussian enveloped Fourier components (called Gabor

    patches) of particular spatial frequencies and orientations (Hubel & Wiesel, 1968). Thus,

  • 7/28/2019 PCA and ICA

    3/15

    it seems that the visual system somehow de-correlates the incoming visual information to

    produce a useful basis set of image components with which to filter the incoming images.

    A possible method for computing this basis set lies in principal components analysis

    (PCA).

    PCA

    The idea behind principal components analysis is that a given message set, or a

    given data set, is linearly transformed into a smaller dimensional dataset with the

    property that each of the transformed variables is uncorrelated (Gill, 2002). PCA

    generates a basis set for a set of messages by rotating the message data in the sample

    space of observed messages (Gill, 2002). For example, in the figure below, the original

    data are presented on the left and the PCA rotated data are presented on the right.

    Figure 1: Left: original data. Right: PCA rotated data.

    Note that the shape of the distribution is preserved, but the regression line through the

    PCA rotated data set is now aligned with the x-axis, thereby de-correlating the data. This

  • 7/28/2019 PCA and ICA

    4/15

    is achieved mathematically by representing each message as a column vector U and by

    placing each column vector in a matrix X.

    Ui = [ Ui1

    Ui2:

    Uim]

    and

    X = [ U11 U21 U31 Un1U12 U22 U32 Un2: : : :

    U1m U2m U3m Unm ]

    Each row of a message column is treated as a separate variable, and each variable defines

    a separate axis (Gill, 2002). The covariance matrix R for the matrix of message columns

    X is then computed and then the eigenvectors E and the matrix of eigenvalues for the

    variance-covariance matrix are computed (Gill, 2002). The eigenvalues of the variance-

    covariance matrix represent the variance-covariance matrix of the rotation defined by the

    principal components (Gill, 2002). So each eigenvalue is the variance of a principal

    component where the first principal component now accounts for the largest variance

    (Gill, 2002). The eigenvector matrix E provides the transformation of the data points

    from the original message matrix X to the PCA metric Y through simple matrix

    multiplication (Gill, 2002).

    Y = XE

    Here, the principal component scores matrix equals the original matrix multiplied by the

    eigenvector matrix.

    For a practical example of how PCA can be used in images, suppose that one

    were given the set of faces illustrated in figure 2.

  • 7/28/2019 PCA and ICA

    5/15

    Figure 2: Original set of faces.

    If this set represented the full set of faces that a visual system were exposed to, then one

    could compute the set of basis faces required to fully represent those faces. This basis set

    is shown in figure 3.

    Figure 3: Complete basis set (i.e. the principal components) for the set of faces given in

    figure 2. Going from left to right and from top to bottom the variances accounted for by

    each face are: 72.9242%, 6.8400%, 4.7808%, 3.4802%, 2.9461%, 2.1323%, 2.0361%,

    1.8905%, 1.6516% and 1.3180%

    Additionally, however, one can also see what percent of the total variance in the face set

    is accounted for by each individual basis face. Here it can be seen that the first basis face

    accounts for most of the variance in the basis face set (~73%). Subjectively this face

    looks the most like a face out of any of the faces in the set. The next face accounts for a

    much smaller percentage of the total variance in the face set, as do all of the subsequent

  • 7/28/2019 PCA and ICA

    6/15

    faces. If one were arbitrarily to set a cut-off level for the variance explained by a basis

    face at 2% then one could roughly represent all of the faces in the original face set using

    only the first 7 basis faces as can be seen in figure 4.

    Figure 4: Formulation of the first face using the basis set. The top left-hand face was

    composed using only the first principal component. The face to the right of it was

    composed using the first two principal components. This pattern continues from left to

    right and from top to bottom. The last face uses all ten principal components and isexactly the same as the original face. Note that an excellent approximation is achieved

    using 7 or more of the 10 basis faces.

    The Matlab code that I wrote to compute the basis faces can be found in Appendix A.

    One benefit of using PCA then is that it allows information to be compressed without the

    loss of the subjective qualities of the information. Specifically, if one wanted to transmit

    the full set of faces given in figure 2, then one would need only to transmit the first seven

    basis faces from figure 3 and the amplitudes that each basis face would need to be

    multiplied by to combine them to regenerate each original face. In this case, the use of

    PCA results in roughly a 30% reduction in the amount of information that would need to

    be sent. This procedure has been extended by other researchers to massive sets of natural

    images where it was found that the components of the natural images tended to resemble

    the Gaussian enveloped Fourier components noted by Hubel & Wiesel (1968) to be the

    optimal stimuli for exciting neurons in the visual cortex (Olshausen & Field, 1997).

  • 7/28/2019 PCA and ICA

    7/15

    ICA

    Assume that one has the following neural network:

    Where the column vector X represents the sensory inputs from an external stimulus U,

    and:

    U = [ U1U2:

    Um]

    If the external stimulus (U) is subject to mixing where A is a mixing matrix of size m-by-

    m, then the sensory information received by the brain (X) is given as:

    X = AU

    (Haykin, 1999). In this case, in order for the brain to pick out the original signal matrix

    U, it is necessary to develop a neural model that unmixes the mixing done by A, and

    transforms the inputs X into the output Y such that the elements of Y are as statistically

    independent as is possible (Haykin, 1999). In order to do this, it is necessary to compute

    an unmixing matrix W that reverses the effects of A, such that

    Y = WX

    And

    Y = [ Y1Y2:

    Neuralmodel

    X1

    X2

    Xm

    Y1

    Y2

    Ym

  • 7/28/2019 PCA and ICA

    8/15

    Ym]

    (Haykin, 1999). The unmixing matrix W in this case would be the m-by-m matrix that

    when multiplying X makes the elements of the resultant product Y as statistically

    independent as is possible. Thus, the elements of Y would be the independent

    components present in the original signal U, although rescaled and permuted (Haykin,

    1999).

    In order to make the elements of Y as statistically independent as is possible, it is

    necessary to minimize the mutual information conveyed by any pair of elements in Y

    (Haykin, 1999). Mutual information is a measure of the uncertainty about Yi after Yj has

    been observed (Haykin, 1999). The mutual information I(Yi;Yj) between Yi and Yj,

    then, is the entropy of Yi minus the conditional entropy of Yi given Yj:

    I(Yi;Yj) = H(Yi) H(Yi|Yj)

    (Haykin, 1999). This situation is represented in the following Venn diagram.

    In order for all of the elements of Y to be statistically independent, the Kullback-Leibler

    divergence between the probability density function Y and the probability density

    H(Yi|Yj) H(Yj|Yi)I(Yi,Yj)

    H(Yi) H(Yj)

    H(Yi,Yj)

  • 7/28/2019 PCA and ICA

    9/15

    function defined by removing each element Yi from Y (where i goes from 1 to m) must

    be minimized (Haykin, 1999).

    Some Matlab code that I wrote to accomplish this objective using the set of faces

    in figure 2 can be found in Appendix B. In the code that I wrote, it is implicitly assumed

    that the unmixing matrix W converges by 1200 iterations. This may not necessarily be

    the case, however, I have found it to work in some preliminary tests with the face stimuli.

    In order to obtain a quantitative index of the demixers performance, one may calculate a

    global rejection index as:

    =pij

    maxk

    pik1

    j=1

    m

    +

    pij

    maxk

    pki1

    i=1

    m

    j=1

    m

    i=1

    m

    where P = {pij}= WA (Haykin, 1999). The performance index is a measure of the

    diagonality of matrix P (Haykin, 1999). If the matrix P is perfectly diagonal, = 0

    (Haykin, 1999). For a matrix P whose elements are not concentrated on the principal

    diagonal, the performance index will be high (Haykin, 1999). A good performance

    index is around 0.05 (Haykin, 1999). This index could be used in the iterative code that I

    wrote for computing W. Instead of iterating the loop for 1200 cycles, one could instead

    use a while loop, evaluating the performance index at each iteration of the loop, and

    exiting only when the performance index reached a certain threshold level. The

    calculation of this index, however, is computationally intensive, and so I didnt include it

    in my code in the hopes that any time I loose by iterating the loop calculating W past the

    threshold performance index, Ill make up in the speed of my loop.

    In the end, ICA may be viewed as an extension of PCA. Whereas PCA can only

    impose independence up to the second order while constraining the direction vectors to

  • 7/28/2019 PCA and ICA

    10/15

    be orthogonal, ICA imposes statistical independence on the individual components of the

    output vector Y and has no orthogonality constraint.

    An example of the application of ICA to images can be found in figure 5.

    Figure 5: Independent components derived from the original image set given in figure 2.Note the marked differences between the independent components of the image set

    presented here and the principal components of the image set depicted in figure 3. Also

    note that since the independent components are maximally statistically they all account

    for an equal percent of the variance in the image set.

    Here the same initial set of faces from figure 2 that was used in the PCA demonstration is

    used again. Each image was vectorized by taking each row of the image and

    concatenating it with the previous row to produce a row vector of length (image length

    image height). Each row vector was then placed in a matrix X, providing the input vector

    to the above diagrammed neural network.

    In the algorithm for calculating the independent components, W is calculated, and

    the independent components matrix Y can be calculated as Y = WX, where the rows of Y

    are the independent components of the original images. These components were

    presented in figure 5.

    Note that in order to re-construct any of the original images, one must simply

    multiply the inverse of the mixing matrix W by the matrix Y, where

  • 7/28/2019 PCA and ICA

    11/15

    X = W-1Y.

    The top left-hand image from figure 2 is reconstructed in this manner in figure 6.

    Figure 6: Re-constitution of the top left-hand corner face from figure 2 using the

    independent components for the image set. Going from left to right and from top tobottom each image uses incrementally more independent components in its re-

    constitution of the original face. Note that each component adds a lot of information

    reflecting the high statistical independence of each component.

    Note here, however, that each independent component contributes a substantial amount to

    the subjective impression of the face as resembling the original face. This property

    reflects the statistical independence of the components derived from the original face set

    used to reconstruct the faces. In the end, ICA doesnt compress the image information as

    much as PCA, however, it encodes the components more efficiently, making each

    component a valuable contributor to the original image set. This property is desirable in

    neural networks where it is necessary to make the most efficient use possible of the

    neurons that are available for encoding information. That is, given a set of neurons that

    are to be used to represent information about images in the real world, it would be

    efficient to have the outputs of those neurons as statistically independent as possible.

    This result also explains the neurophysiological findings of Hubel and Weisel (1968) as

    noted in Bell and Sejnowski (1997).

  • 7/28/2019 PCA and ICA

    12/15

    Reference:

    Bell, A.J., and Sejnowski, T.J. (1997). The independent components of natural scenes

    are edge filters. Vision Research, 37, 3327-3338.

    Cottrell, G.W., Munro, P.W., and Zipser, D. (1987). Image Compression by Back-

    Propagation: A Demonstration of Extensional Programming. Technical Report 8720,

    University of California, San Diego, Institute of Cognitive Science.

    Field, D.J. (1987). Relations between the statistics of natural images and the response

    properties of cortical cells. Journal of the Optical Society of America A, 4, 2379-2394.

    Gill, J. (2002). What Is Principle Components Analysis Anyway? Retreived January 2,

    2003, from http://www.clas.ufl.edu/~jgill/papers/pca.pdf

    Haykin S. (1999). Neural Networks A Comprehensive Foundation Second Edition. New

    Jersay: Prentice Hall.

    Oja, E. (1983) Subspace methods of pattern Recognition. Letchworth, England: Research

    Studies Press and Wiley.

    Appendix A

    % PCA

    % Load the original set of faces

    % (compliments of Prof. Jason Gould, University of Indianna)

    http://www.clas.ufl.edu/~jgill/papers/pca.pdfhttp://www.clas.ufl.edu/~jgill/papers/pca.pdf
  • 7/28/2019 PCA and ICA

    13/15

    load FaceStruct.mat;

    names = fieldnames(images);

    % Initialize the matrix of images

    Raw = zeros(prod(size(images.andrea)),length(names));

    % Put the images into the matrix of images

    for i = 1:length(names)

    eval(['Raw(:,i) = reshape(images.',char(names(i)),',[length(Raw(:,1)) 1]);']);end

    % Normalize the image matrix to have zero mean

    % and unit standard deviationColMeans = repmat(mean(Raw),length(Raw(:,1)),1);

    ColStd = repmat(std(Raw),length(Raw(:,1)),1);

    X = (Raw-ColMeans)./ColStd;

    % Calculate the variance-covariance matrix for the

    % normalize image matrixR = cov(X);

    % Calculate the eigenvectors and eigenvalues of the% variance-covariance matrix

    [E, LATENT, EXPLAINED] = pcacov(R);

    % Calculate the principal component scores% (These are the filters for the images)

    Y = X*E;

    % Calculate the inverse of the eigenvector matrix

    Einv = inv(E);

    % Build and display the first face using the principal components

    for i = 1:length(names)

    eval(['ReformFace1',num2str(i),' = Y(:,1:i)*Einv(1:i,:);']);figure

    eval(['img',num2str(i),' =

    scale(reshape(ReformFace1',num2str(i),'(:,1),size(images.andrea)));']);eval(['image(repmat(img',num2str(i),',[1 1 3]));']);

    axis equal

    eval(['imwrite(img',num2str(i),',''MakeAnd',num2str(i),'.jpg'',''jpg'');']);end

    % Display the principal components

    for i = 1:length(names)

  • 7/28/2019 PCA and ICA

    14/15

    eval([char(names(i)),' = 2*scale(reshape(X(:,i),size(images.andrea)))-1;']);

    figure

    eval(['image(repmat(scale(',char(names(i)),'),[1 1 3]));']);axis equal

    eval(['title(''Variance explained = ',num2str(EXPLAINED(i)),''');']);

    end

    Appendix B

    % ICA

    % Load the original face set

    % (Courtesy of Professor Jason Gould, University of Indianna)

    load FaceStruct.matnames = fieldnames(images);

    % Calculate the length of the column vector composed of

    % the concatenated columns of one image.

    ColLength = prod(size(images.andrea));

    % Initialize the observation vector

    X = zeros(length(names),ColLength);

    % Fill in the observation vector

    for i = 1:length(names)

    eval(['X(i,:) = reshape(images.',char(names{i}),',[1 ColLength]);']);end

    % Initialize the unmixing matrixW = rand(length(names))*0.05;

    % Initialize the unmixed matrix

    Y = W*X;

    % Calculate the updating parameter phi for the given W and X

    phi = 1/2*Y.^5 + 2/3*Y.^7 + 15/2*Y.^9 + 2/15*Y.^11 + 112/3*Y.^13 +...128*Y.^15 - 512/3*Y.^17;

    % Learning rateeta = 0.1;

    % Initialize waitbar (this isn't a necessary part of the code,

    % it just lets you see how far along the algorithm is as it's

  • 7/28/2019 PCA and ICA

    15/15

    % iterating.

    h = waitbar(0,'Calculating matrix W...');

    n = 1200;

    % Main ICA loop, repeat n times, so that the matrix W convergesfor i = 1:n

    W = W + eta*(eye(size(W)) - phi*Y')*W;

    Y = W*X;phi = 1/2*Y.^5 + 2/3*Y.^7 + 15/2*Y.^9 + 2/15*Y.^11 + 112/3*Y.^13 + 128*Y.^15 -

    512/3*Y.^17;

    waitbar(i/n,h);

    end

    % Close waitbar

    close(h)

    % Display and save the images for the independent components

    ImageMatrix = scale(Y);for i = 1:length(names)

    img{i} = repmat(reshape(ImageMatrix(i,:),size(images.andrea)),[1 1 3]);

    figureimage(img{i});

    axis equal

    eval(['imwrite(img{i},''IndComp',num2str(i),'.jpg'',''jpg'');']);

    end

    % Calculate the inverse of the unmixing matrix W

    Winv = inv(W);

    % Rebuild and display the first image using the independent components

    for i = 1:length(names)A = Winv(:,1:i)*Y(1:i,:);

    ImageMatrix = scale(A);

    img{i} = repmat(reshape(ImageMatrix(1,:),size(images.andrea)),[1 1 3]);

    figureimage(img{i});

    axis equal

    eval(['imwrite(img{i},''RebuiltUsing',num2str(i),'.jpg'',''jpg'');']);end