Clustering Report

7/27/2019 Clustering Report

1/34

CLUSTERING

Clustering involves grouping data points together according to some measure of

similarity. One goal of clustering is to extract trends and information from raw data sets.

An alternative goal is to develop a compact representation of a data set by creating a set

of models that represent it [1].

There are two general types of clustering that are used: supervised and

unsupervised clustering. Supervised clustering uses a set of example data to classify the

rest of the data set. This can be called as classification and here the task is to learn to

assign instances to pre-defined classes [2]. For example, consider a set of colored balls

(all colors) that you want to classify into three groups: red, green, and blue. A logical way

to do this is to pick out one example of each class--a red ball, a green ball, and a blue

ball--and set them each next to a bucket. Then go through the remaining balls, compare

each ball to the three examples and put each ball in the bucket whose example it matches

the best.

This example of supervised clustering is illustrative because there are two

potential problems. First, the result you get is going to be dependent upon the balls you

select as examples. If you were to select a red, an orange, and a blue ball, then it might be

difficult to classify a green ball. Second, unless you are careful about selecting examples,

you may select examples that don't represent the distribution of data. For example, you

might select red, green, and blue balls, only to discover that most of the colored balls

were cyan, purple, and magenta (which are in between the other 3 primary colors). This


2/34

shows the importance of selecting representative samples when you execute supervised

clustering.

Unsupervised clustering, on the other hand, tries to discover the natural groupings

inside a data set without any input from a trainer. The main input a typical unsupervised

clustering algorithm takes is the number of classes it should find. In the colored balls

case, this would be like dumping them into an automatic sorting machine and telling it to

create three piles. The goal of unsupervised clustering is to create three piles where the

balls within each pile are very similar, but the piles are different from one another. Here

no pre-defined classification is required. The task is to learn a classification from the

data.

One of the most important characteristics of any supervised or unsupervised

clustering process is how to measure the similarity of two data points. Clustering

algorithms divide a data set into natural groups( clusters). Instances in the same cluster

are similar to each other, they share certain properties.

Clustering algorithms can have different properties [2]:

Hierarchical: These methods include those techniques where the input data are

not partitioned into the desired number of classes in a single step. Instead, a series

of successive fusions of data are performed until the final number of clusters is

obtained [3].

Non-Hierarchical or iterative : These methods include those techniques in which a

desired number of clusters is assumed at the start. Instances are reassigned to

clusters to improve them.


3/34

Hard and Soft : Hard clustering assigns each instance to exactly one cluster. Soft

clustering assigns each instance a probability of belonging to a cluster

Disjunctive: Instances can be part of more than one cluster

Figure below shows an illustration of the properties of clustering

Figure 1 Illustration of properties of clustering

Un-Supervised Clustering:

One of the most commonly used un-supervised clustering algorithm is K-means

algorithm. The algorithm is as follows.

Specify k, the number of clusters


4/34

Choose k points randomly as cluster centers

Assign each instance to its closest cluster center using Euclidian distance

Calculate the median (mean) for each cluster, use it as its new cluster center

Reassign all instances to the closest cluster center

Iterate until the cluster centers do not change any more

The figure below explains the concept of K-means clustering

Figure 2: Illustration of K-means algorithm [4]

A demo of K-means algorithm is shown below. The pictures depict the change of centers

for 4 clusters for 4 iterations.


5/34

(2)

(3)


6/34

(4)

After the fourth iteration, the centers do not move much and hence the centers are fixed at

this position. The disadvantages of this K-means algorithm is, initially one has to mention

the number of clusters and also with different set of initial random centers, one gets a

different cluster center in the end.


7/34

SUPERVISED CLUSTERING ALGORITHMS:

In this section four different types of supervised clustering algorithms are presented.

They are Vector quantization, fuzzy clustering, artificial neural net and fuzzy-neural

algorithms. Though fuzzy and neural nets initially go through unsupervised clustering, to

determine the cluster centers, only the supervised clustering algorithms are discussed

here.

VECTOR QUANTIZATION :

Origin of this algorithm is Shanons source coding theory, which is used for transmission

and encoding of data. The algorithm is as follows. A vector quantizer maps k-

dimensionalvectors in the vector spaceRk into a finite set of vectors Y = {yi: i = 1, 2, ...,

N} [5]. Each vectoryi is called a code vector or a codeword. and the set of all the

codewords is called a codebook. Associated with each codeword,yi, is a nearest neighbor

region called Voronoi region, and it is defined by:

The set of Voronoi regions partition the entire spaceRk such that:


8/34

for all i j

As an example we take vectors in the two dimensional case. Figure 2 shows some

vectors in space. Associated with each cluster of vectors is a representative codeword

(cluster center or cluster representative obtained by k-means algorithm or similar

algorithms). Each codeword resides in its own Voronoi region. These regions are

separated with imaginary lines in figure 1 for illustration. Given an input vector, the

codeword that is chosen to represent it is the one in the sa

Figure 3 : Vector Quantization illustration in 2-D space showing veronoi region formed

by imaginary lines

The representative codeword ( cluster center) is determined to be the closest in Euclidean

distance from the input vector (instances). The Euclidean distance is defined by:

wherexj is the jth component of the input vector, andyij is the jth is component of the

codewordyi.

FUZZY SUPERVISED CLUSTERING :


9/34

Fuzzy logic is becoming popular in the field of automatic control. Fuzzy logic

requires no analytical model of the system, and offers the chance to combine heuristic

knowledge with any model knowledge which may be available [6]. Fuzzy logic can also

deal with vague or imprecise data. In the field of fault diagnosis, fuzzy logic has been

used successfully in many applications, both as a means of residual generation, and to aid

in the decision making process of residual evaluation.

The idea behind fuzzy clustering is basically that of pattern recognition. Training

data is used off-line to determine relevant cluster centers for each of the faults of interest.

On-line, the degree to which the current data belongs to each of the pre-defined clusters is

determined, and this results in a degree-of-membership to each of the pre-determined

faults. This method is useful in cases where there are many residuals, or in which no

expert knowledge of the system is available. Fuzzy clustering is different from fuzzy

reasoning which is also used in residual analysis. Fuzzy reasoning mainly comprises of

IF-THEN reasoning based on the sign of the residual. Example of fuzzy reasoning :

IF residual1 is positive and residual2 is negative THEN fault1 is

Present

IF residual1 is zero and residual2 is zero THEN system is fault free

And so on.

Clustering is the allocation of data points to a certain number of classes. Each class is

represented by a cluster center, orprototype, which can be considered as the point which

best represents the data points in the cluster. The idea behind fuzzy clustering is that each

data point belongs to all classes with a certain degree of membership. The degree to

which a data point belongs to a certain class is dependant upon the distance to all cluster


10/34

centers. For fault diagnosis, each class could correspond to a particular fault. The general

principle is shown for three inputs and three clusters in Fig. 3.

Figure 4: Fuzzy clustering concept showing the cluster centers and the membership

grade of a data point

The fuzzy clustering fault isolation procedure consists of the following two steps:

Off-line phase: this is a learning phase which consists of the determination of the

characteristics (i.e. cluster centers) of the classes. A learning data set is necessary

for this off-line phase, which must contain residuals for all known faults. (For

more details on origin of idea of fuzzy clustering refer to [7] )

On-line phase: This phase calculates the membership degree of the current

residuals to each of the known classes. In this way each data point does not

belong to only one cluster, but its membership is distributed among all clusters

according to the varying degree of resemblance of its features with respect to

those cluster centers [8].


11/34

It is important that the training data contains all faults of interest, otherwise they cannot

be isolated on-line - though unknown faults can in some cases be detected.

The fuzzy membership matrix and the cluster centers are computed by minimizing the

following partition formula:

ik

mN

k

ik

C

i

f dumCJ ,1

,

1

)(),( ==

= subject to 11

, ==

C

i

kiu

(1)

Where C denotes the number of clusters, N the number of data points, kiu , , the

fuzzy membership of the k-th point to the i-th cluster, ikd , the euclidean distance

between the data point and the cluster center, and ),1( m a fuzzy weighting factor

which defines the degree of fuzziness of the results. The data class becomes more fuzzy

and less discriminating with increasing m. Ingeneral, m =2 is chosen ( it is mentioned

that this value of m does not produce optimal solution for all problems).

The constraint in eq. (1) implies that each point must entirely distribute its

membership among all the clusters. The cluster centers (centroids or prototypes) are

defined as the fuzzy weighted center of gravity of the data x ,

=

==N

k

m

ik

N

k

k

m

ik

i

u

xu

v

1

,

1

,

)(

)(

Ci .....2,1= (2)

Since kiu ,

affects the computation of the cluster center iv

, the data with a high

membership will influence the prototype location more than points with a low

membership. For the fuzzy C-means algorithm, distance ikd , is defined as follows

22

, )( ikik vxd = (3)


12/34


13/34

Figure 5 :Matlab fuzzy-logic toolbox demo of Fuzzy C-means clustering for 4 clusters

ARTIFICIAL NEURAL NET CLUSTERING :

Before discussing the supervised clustering technique in neural nets, basics of the

artificial neural network is discussed.

Artificial Neural Network is a system loosely modeled on the human brain [9]. It

is an attempt to simulate within specialized hardware or sophisticated software, the

multiple layers of simple processing elements called neurons. Each neuron is linked to

certain of its neighbors with varying coefficients of connectivity that represent the


14/34

strengths of these connections. Learning is accomplished by adjusting these strengths to

cause the overall network to output appropriate results.The most basic components of

neural networks are modeled after the structure of the brain. The most basic element of

the human brain is a specific type of cell, which provides us with the abilities to

remember, think, and apply previous experiences to our every action. These cells are

known as neurons, each of these neurons can connect with up to 200000 other neurons.

The power of the brain comes from the numbers of these basic components and the

multiple connections between them.

All natural neurons have four basic components, which are dendrites, soma, axon,

and synapses. Basically, a biological neuron receives inputs from other sources,

combines them in some way, performs a generally nonlinear operation on the

result, and then output the final result. The figure below shows a simplified

biological neuron and the relationship of its four components.


15/34

Figure 6 : Four main parts of human nerve cells, based on which artificial neurons are

designed

The basic unit of neural networks, the artificial neurons, simulates the four basic

functions of natural neurons. Artificial neurons are much simpler than the biological

neuron; the figure below shows the basics of an artificial neuron.

Figure 7 Structure of an artificial neuron with Hebbian learning ability.

(weights are adjustable)

D. Hebb has postulated a principle for a learning process (Hebb, 1949) at the cellular

level: if Neuron A is stimulated repeatedly by Neuron B at times when Neuron A is

active, then Neuron A will become more sensitive to stimuli from Neuron B (the

correlation principle [10]. It implicitly involves adjustments of the strengths of the

synaptic inputs, which led to the incorporation of adjustable synaptic weightson the input

lines to excite or inhibit incoming signals.


16/34


17/34

The architecture for a network that consists of a layerof M perceptrons is shown in Figure 8. An

input feature vectorx = ( Nxx ...............1 ),is input to the network via the set of N branching

nodes. The lines fan out at the branching nodes so that each perceptron receives an input from

each component ofx. At each neuron, the lines fan in from all of the input (branching) nodes.

Each incoming line is weighted with a synaptic coefficient (weight parameter) from the set

{wnm}, where wnm weights the line from the nth component xn coming into the mth perceptron.

Figure 9 : One layer of perceptrons network with N inputs and M perceptrons

The Perceptron as Hyperplane Separator:

Consider a perceptron as shown in Figure 7. The input vectorx = (x1,...,xN) is linearly combined

with the weights to obtain

,where b is the threshold. Then s is activated by a threshold function T(-) to produce the output y

= T(s) = 1 when s >= 0, else y = T(s) = -1. The set of all input vectors x such that

forms a hyperplane H in the input vector space. H partitions the feature vector space into right

and left halfspacesH+ and H-.

bxwxwS NN ++= .........11

0.........11=++=

bxwxwSx NN


18/34

An example: consider a single perceptron with two inputs. Let w1 = 2 andw2 = -1, b=0,

then 2x1 - x2 = 0 determines H. the points (0,0) and (1,2) belong to H

The feature vectorx = (x1,x2) = (2,3) is summed into

S = 2(2) - 1(3) = 1 > 0, so that the activated output is y = T(1) = 1

(corresponds to H+ in the plane, i.e right half)

(x1,x2) = (0,2) activates the output y = T(2(0) - 1(2)) = T(-1) = -1,

which indicates that (0,2) is in the left halfspace H-. The figure below shows these

points.

Figure 10 : Illustration of H+ and H-- in the hyperplane

The above example is a simple linear mapping between the input and the output. Now

consider another example which illustrates how non-linear relation between input and

output is implemented. Consider an XOR logic function or 2- bit parity problem.

N = 2 inputs, M = 1 output, and Q = 4 sample vector (input/output)

pairs for training, and K= 2 clusters (even and odd).


19/34

Table below shows the mapping of input and output for this 2-bit parity data.

Table 1: Logic for 2-bit parity data

However, we see from Figure 11 below that a single hyperplane can not separate the four

feature vectors into the required 2 classes, no matter how it is oriented (rotated and

translated) by the weights.

Figure 11: Hyperplane diagram for 2-bit parity data, showing one hyperplane is not

sufficient to separate the data into two clusters

The power of a single neuron can be greatly amplified by using multiple neurons in a

network of layered connectionist architecture, as displayed in Figure 12 below. Such a

multiple layered perceptron(MLP) is also called a feed forward artificial neural network

and abbreviated to FANN. The modifier "feed forward" distinguishes them from

feedback (recursive) networks. On the left is the layer of inputs, or branching, nodes,


20/34

which are not artificial neurons. The hidden layer(the middlelayer here) contains neural

nodes, as does the output layer on the right. This is the architecture of a two-

layeredNN(so called because there are two layers of neuronal units).

Figure 12 : A typical two layered network where the middle layer introduces the required

non-linearity between input and output layers

Neural networks may also have multiple hidden layers for the sake of extra power in

learning to separate nonlinearly separable classes. The Hornik-Stinchcombe-White

theorem, states that a layered artificial neural network with two layers of neurons is

sufficient to approximate as closely as desired any piecewise continuous map of a closed

bounded subset of a finite dimensional space into another finite dimensional space,

provided there are sufficiently many neurons in the single hidden layer. There is no

theoretical need to use more than two layers of neurons, which would increase the

computational complexity and instability in training, and slow down the operation

because the extra layers cause delays in processing (the idea is that the neurons in a single

layer are to process in parallel, while the different layers process sequentially). But extra


21/34

layers can prevent the necessity of using an excessive number of neurons in a single

hidden layer to achieve highly nonlinear classification.

Consider the same XOR implementation using the two layered network shown in

the figure below:

Figure 13 : A two layered network for XOR logic implementation

Let

result is two parallel hyperplanes that yield three convex regions. The hyperplanes are

determined by

The threshold at the first neuron in the hidden layer yields

The threshold at the second hidden neuron yields

This forces the results listed in Table 2, where we use 0.1 for 0 and 0.9 for 1 (this is the

usual procedure in using neural networks, because 0 and 1 have special properties that


22/34

inhibit gradient training).The four sets of above outputs yield the three unique vectors

(y1,y2) = (0,1), (y1,y2) = (1,1), and (y1,y2) = (0,0) that identify the three linearly

separable regions shown in Figure 14. We see from the figure that Regions 1 and 3make

up the odd parity (Class 2),while Region 3 is even parity (Class 1).We saw in the

previous example that a network of a single layer can not output the two correct classes,

no matter how we orient the hyperplanes via translation and rotation. In all cases of non

coincidental hyperplanes, we obtain three or four convex regions (the lower and upper

bounds, respectively).

Table 2 : Hidden layer mapping for 2-bitparity function

To show that the network with a second layer of perceptrons can learn the nonlinearly

separable classes of even and odd parity (XOR logic), we take the new weights at the

single output neuron to be in figure 13. These weight the lines on

which y1 and y2 enter the output neuron (perceptron). Using the hyperplane


23/34

we need to map y = (1,1) and y = (0,0) into the same class, Class 1, as shown in Figure 14

below.

Figure 14 : The Partitioning of the 2-bit Parity Feature Space with Two Perceptron

Layers

This is done by choosing the weights(u) as above and threshold to be . The result is

shown in the table below.

Table 3: The 2-bit Parity Mapping by Two Layers of Perceptrons

There are many different kinds of learning rules used by neural networks. The most

common class of ANNs is called backpropagational neural networks (BPNNs)[11].

Backpropagation is an abbreviation for the backwards propagation of error. Here

learning is a supervisedprocess that occurs with each cycle or epoch (i.e. each time

the network is presented with a new input pattern). It consists of a forward activation,

which results in flow of input and output of the neurons through the network, and the


24/34

backward weight adjustment schema based on the error calculated. More simply, when a

neural network is initially presented with a pattern it makes a random guess as to what it

might be. It then sees how far its answer was from the actual one and makes an

appropriate adjustment to its connection weights.

Backpropagation performs a gradient descent within the weight space towards a

global minimum. The global minimum is the theoretical solution with the lowest possible

error. In most problems, the solution space is quite irregular with numerous pits and hills

which may cause the network to settle down in a local minimum which is not the best

overall solution. This idea is depicted in figure below.

Figure 15 The weights versus error space.

Here for clarity this graph is drawn in two dimensions, however, often we have many

weights, say n, and this graph would be in n+1 dimensions.

Since the nature of the error versus weights space can not be known a priori, one

has to make several neural network analysis with different parameters to determine the

best solution. The speed of the learning can be controlled by the learning rate. Another

parameter, momentum, helps the network to overcome obstacles (local minima) in the


25/34

error surface and settle down at or near the global minimum. The issue of when to stop

the training is non-trivial. Training should not necessarily proceed to the global

minimum: this point is per definition optimal for the training set, but that may not be the

case for an independent data set.

The math and algorithm is as follows [12].

The main objective in neural model development is to find an optimal set of weight

parameters w, such that ),( wxyy = closely represents (approximates) the original

problem behavior. This is achieved through a process called training (that is, optimization

in w-space). A set of training data is presented to the neural network. The training data is

presented to the neural network. The data are pairs of Pkdx kk .......,2,1),,( = , where

kd is the desired outputs of the neural model for inputs kx andPis the total number of

training samples.

During training, the neural network performance is evaluated by computing the

difference between actual network outputs and desired outputs for all the training

samples. The difference, also known as the error, is quantified by

--------(1)

where jkd is thejth element of kd , ),( wxy kj is thejth neutral network output for the

input kx , and rT is an index set of training data. The weight parameters w are adjusted

during training, such that this error is minimized.


26/34

Training Process :

The first step in training is to initialize the weight parameters w, and small random values

are usually suggested. During training, w is updated along negative direction of the

gradient ofE, asw

Eww

= , until Ebecomes small enough. Here, the parameter

is called the learning rate. If we use just one training sample at a time to update w, then a

per-sample error function kE given by

----(2)

is used and w is updated asw

Eww k

= . The following sub-section describes how the

error back propagation process can be used to compute the gradient informationw

Ek

.

Error Back Propagation :

Using the definition of kE in (3.20), the derivative of kE with respect to the weight

parameters of the lth layer can be computed by simple differentiation as

------(3)

and

-------(4)


27/34

The gradient Li

k

z

E

can be initialized at the output layer as

-----(5)

using the error between neural network outputs and desired outputs (training data).

Subsequent derivatives L

i

k

z

E

are computed by back-propagating this error from l+1th

layer to lth layer (see Figure below) as

-------(6)


28/34

Figure 16: Relationship between ith neuron of the lth layer, with neurons of layer l-1 and

l+1

For example, if the MLP uses sigmoid (3.6) as hidden neuron activation function,

-------(7)

--------(8)

and

--------(9)

For the same MLP network, letl

i be defined as l

i

kl

i

E

= representing local

gradient at ith neuron oflth layer. The back propagation process is given by,

-------(10)

--(11)

and the derivative with respect to the weights are


29/34

----(12)

The algorithm in pictorial representation is given in figure below.

Figure 17 : Error back propagation algorithm stepsMatlab neural network tool box has a demonstration for error back propagation

algorithm, showing the change of error with respect to different combination of weights

for a two layered network. It also shows how it is possible to get the weights

corresponding to local minima. The figures below shows the Matlab demo.


30/34

Figure 18 : Variation of error with respect to layer one weights


31/34

Figure 19 : Arbitrarily chosen two points on the graph, depict the value of weights thatwill be obtained by the algorithm

Integration of Fuzzy systems and Neural Networks

:

Neural networks process numerical information and exhibit learning capability. Fuzzy

systems can process linguistic information and represent, say, experts' knowledge by

fuzzy rules. Thus, the fusion of these two technologies is the current research trend. The

aim is to be able to create machines with more intelligent behavior [13].


32/34

Some of the motivations for considering both fuzzy systems and Neural Networks:

(1) The Knowledge Base of a fuzzy system consists of a collection of "If... Then..." rules

in which linguistic labels are modeled by membership functions.

Neural Networks can be used to produce membership functions when available data are

numerical.

(2) Moreover, one can take advantage of the learning capability of neural networks to

adjust membership functions, say in control strategies, to enhance control precision.

(3) Neural Networks can be used to provide learning methods for fuzzy inference

procedures.

(4) In the opposite direction, one can use fuzzy reasoning architecture to construct new

NeuralNetworks

(5) One can also fuzzify the Neural Networks architecture to enlarge the domain of

applications.

(6) The fusion of Neural Networks and Fuzzy Systems is essentially based upon the fact

that Neural Networks can learn experts' knowledge (through numerical data) and Fuzzy

Systems can represent experts' knowledge (through the representation of in-out relation

by fuzzy reasoning).

The literatures talk about basically two types of combination

Neural-Fuzzy system :In this type of systems the learning ability of neural networks is

utilized to realize the key components of a general fuzzy logic inference system. Neural

networks are considered in realizing fuzzy membership functions


33/34

Fuzzy-Neural network system: These models talk of incorporating fuzzy principles in

neural network, to create a more flexibility and robust system. Inherently neural networks

model, algorithm can be fuzzified like, fuzzy neurons, fuzzified neural models and neural

networks with fuzzy training.

The developments are in progress in this field. There are different proposals for the

building of these integrated systems and algorithms are in the proposal stage. For more

detailed explanation of different types of combinations and proposals refer to [14].

REFERENCES

[1] http://www.palantir.swarthmore.edu/loicz/help/clustering.htm

[2] Clustering Connections and statistical language processing , Frank Keller,

University of Saarlandes

[3] http://cne.gmu.edu/modules/dau/stat/clustgalgs/clust3_frm.html


34/34

[4] Refining Initial Points for K-Means Clustering, P. S. Bradley, Computer Sciences

Department, University of Wisconsin, Usama M. Fayyad, Microsoft Research, Redmond,WA

[5] http://www.geocities.com/mohamedqasem/vectorquantization/vq.html

[6] Fuzzy Logic In Fault Diagnosis, Dr. Tracy Dalton, University of Duisburg,

Germany

[7] Bezdek J.C., Pattern recognition with fuzzy objective functions algorithms, Plenum

Press, New York, 1991.

[8] Adaptive Fuzzy Monitoring and Fault Detection, Stefano Marsili-Libelli,

[9] An individual project within MISB-420-0, Author: Daniel Klerfors, Professor:

Dr Terry L. Huston, St.Louis University( http://hem.hj.se/~de96klda/NeuralNetworks.htm )

[10] Posted notes of Prof. Carl G. Looney - Computer Science Department, University ofNevada .

( http://ultima.cs.unr.edu/cs773b/CHAP3.pdf )

[11] http://www-binf.bio.uu.nl/BPA/NIntro.pdf

[12] http://www.ieee.cz/knihovna/Zhang/Zhang100-ch03.pdf

[13] Collection from various websites

[14] Chin- Teng Lin and C. S. George Lee, Neural Fuzzy Systems , Prentice Hall, NJ.1996

Documents

Clustering Report